12 Advanced models for differential abundance
GLMs are the basis for advanced testing of differential abundance in sequencing data. This is necessary, as the sequencing data sets deviate from symmetric, continuous, Gaussian assumptions in many ways.
12.1 Particular properties of taxonomic profiling data
12.1.1 Discrete count data
Sequencing data consists of discrete counts:
## Sample-1 Sample-2 Sample-3
## Actinomycetaceae 0 1 0
## Aerococcus 0 0 0
## Aeromonas 0 0 0
## Akkermansia 18 97 67
## Alcaligenes faecalis et rel. 1 2 3
12.1.3 Rarity
Long tails of rare taxa:
12.1.4 Overdispersion
Variance exceeds the mean:
means <- apply(abundances(d),1,mean)
variances <- apply(abundances(d),1,var)
# Calculate mean and variance over samples for each taxon
library(reshape2)
library(dplyr)
df <- melt(abundances(d))
names(df) <- c("Taxon", "Sample", "Reads")
df <- df %>% group_by(Taxon) %>%
summarise(mean = mean(Reads),
variance = var(Reads))
# Illustrate overdispersion
library(scales)
p <- ggplot(df, aes(x = mean, y = variance)) +
geom_point() +
geom_abline(aes(intercept = 0, slope = 1)) +
scale_x_log10(labels = scales::scientific) +
scale_y_log10(labels = scales::scientific) +
labs(title = "Overdispersion (variance > mean)")
print(p)
12.2 Generalized linear models: a brief overview
Let us briefly discuss the ideas underlying generalized linear models.
The Generalized linear model (GLM) allows a richer family of probability distributions to describe the data. Intuitively speaking, GLMs allow the modeling of nonlinear, nonsymmetric, and nongaussian associations. GLMs consist of three elements:
A probability distribution for the data (from exponential family)
A linear predictor targeting the mean, or expectation: \(Xb\)
A link function g such that \(E(Y) = \mu = g^{-1}(Xb)\).
Let us fit Poisson with (natural) log-link just to demonstrate how generalized linear models could be fitted in R. We fit the abundance (read counts) assuming that the data is Poisson distributed, and the logarithm of its mean, or expectation, is obtained with a linear model. For further examples in R, you can also check the statmethods website.
# Load again the example data
d <- dietswap
df <- data.frame(Abundance = abundances(d)[taxa,],
Group = meta(d)$nationality)
res <- glm(Abundance ~ 1, data = df, family = "poisson")
Investigate the model output:
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 2.09286 | 0.02357 | 88.79275 | 0 |
Note the link between mean and estimated coefficient (\(\mu = e^{Xb}\)):
## [1] 8.108108
## (Intercept)
## 8.108108
12.3 DESeq2: differential abundance testing for sequencing data
12.3.1 Fitting DESeq2
[DESeq2 analysis]((https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8) accommodates those particular assumptions about sequencing data.
# Start by converting phyloseq object to deseq2 format
library(DESeq2)
d <- dietswap # Phyloseq data
ds2 <- phyloseq_to_deseq2(d, ~ group + nationality)
# Run DESeq2 analysis (all taxa at once!)
dds <- DESeq(ds2)
# Investigate results
deseq.results <- as.data.frame(results(dds))
deseq.results$taxon <- rownames(results(dds))
# Sort (arrange) by pvalue and effect size
library(knitr)
deseq.results <- deseq.results %>%
arrange(pvalue, log2FoldChange)
# Print the result table
# Let us only show significant hits
knitr::kable(deseq.results %>%
filter(pvalue < 0.05 & log2FoldChange > 1.5),
digits = 5)
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | taxon |
---|---|---|---|---|---|---|
29.20535 | 1.91205 | 0.13432 | 14.23457 | 0.00000 | 0.00000 | Clostridium difficile et rel. |
51.65152 | 3.04116 | 0.28687 | 10.60107 | 0.00000 | 0.00000 | Mitsuokella multiacida et rel. |
12.39749 | 1.83825 | 0.18531 | 9.91994 | 0.00000 | 0.00000 | Klebisiella pneumoniae et rel. |
44.16494 | 1.78333 | 0.23072 | 7.72937 | 0.00000 | 0.00000 | Megasphaera elsdenii et rel. |
66.93783 | 1.68345 | 0.25330 | 6.64609 | 0.00000 | 0.00000 | Escherichia coli et rel. |
3.63459 | 1.53142 | 0.23140 | 6.61792 | 0.00000 | 0.00000 | Weissella et rel. |
5.74035 | 3.07334 | 0.47848 | 6.42308 | 0.00000 | 0.00000 | Serratia |
0.42171 | 1.70079 | 0.47147 | 3.60743 | 0.00031 | 0.00075 | Moraxellaceae |
12.3.2 Comparison between DESeq2 and standard models
For comparison purposes, assess significances and effect sizes based on Wilcoxon test.
test.taxa <- taxa(d)
pvalue.wilcoxon <- c()
foldchange <- c()
for (taxa in test.taxa) {
# Create a new data frame for each taxonomic group
df <- data.frame(Abundance = abundances(d)[taxa,],
Log10_Abundance = log10(1 + abundances(d)[taxa,]),
Group = meta(d)$nationality)
# Calculate pvalue and effect size (difference beween log means)
pvalue.wilcoxon[[taxa]] <- wilcox.test(Abundance ~ Group, data = df)$p.value
foldchange[[taxa]] <- coef(lm(Log10_Abundance ~ Group, data = df))[[2]]
}
# Correct p-values for multiple testing
pvalue.wilcoxon.adjusted <- p.adjust(pvalue.wilcoxon)
par(mfrow = c(1,2))
plot(deseq.results$padj, pvalue.wilcoxon.adjusted,
xlab = "DESeq2 adjusted p-value",
ylab = "Wilcoxon adjusted p-value",
main = "P-value comparison")
abline(v = 0.05, h = 0.05, lty = 2)
plot(deseq.results$log2FoldChange, foldchange,
xlab = "DESeq2",
ylab = "Linear model",
main = "Effect size comparison")
abline(0,1)
For systematic comparisons between various methods for differential abundance testing, see this paper.