Microbial diversity analysis using R tools
Microbial diversity analysis using R tools
This the microbial diversity analysis tutorial (Version 2.01)
Introduction
This is a simplified version of various methods available these days to microbial ecologists. The ideology of putting all of this together is to share the information and also clarify the ‘ease’(you see we didn’t say ‘simple’) of using R-software and related packages. The analyses shown here are basic and aimed mostly at introducing the reader to commonly used packages, scripts and data analysis methods. Descision making related to different parameters will still be soley upon the user, although output generated using NG-Tax/DADA2/Deblur may need very little polishing, depending on the settings that were used to get OTUs/ASVs. OTUs i.e. Operational taxonomic units and ASVs i.e Amplicon sequence variants are terminologies that will approach used to process raw reads click here. It is advised that the reader does adequate referencing for each of the features and makes better choices on the methods, etc. during the analysis of her/his data, based on the biological question.
The main tools used here are Phyloseq, microbiome, Vegan, ggplot2, tidyverse packages.
Important Note 1:
Wisdom1 - There is no substitute for careful reading, so read the tutorial first and then start playing with it.
Wisdom2 - never skip a step or piece of text, you might need a file that was generated previously.
Wisdom 3 - If it took the authors weeks to make this tutorial don’t expect to grasp it in one afternoon. Take your time and don’t rush.
Kindly cite all the packages and tools that you have used in your analysis. Also make sure that you provide the scripts you used for analysis as supplementary material with your research article.
You can also find useful cheat sheets for R in the folder Useful “Useful R cheat sheets”.
or other simple commands on plotting or data transformation on Quick-R
============================================================================================
Here, we use data from Mock Communities (MCs) used to benchmark the NG-Tax pipeline. These MCs are synthetic communities of known composition. It represents data from 3 different Hiseq runs spread over 7 libraries with two different primers covering region V4 (F515-R806) and V5-V6 (F784-R1064) and different PCR settings (25,30 and 35 cycles and pooling of triplicate PCR reactions or a single one). More information with regards to PCR settings and generation of the data can be found here:NG-Tax. For other published test datasets you can check Qiita
Location of test data files: ./input_files
Therefore, this data will alow you to compare what you sequenced to perfect error free data and see how well your sequenced data can reproduce the theoretical composition and underlying biological signals.
Download the complete repository to your local PC
In our lab the commonly used tool for OTU picking is NG-Tax. This approach removes a lot of spurious OTUs arising from PCR and sequencing errors. Although, based on the analysis of the positive controls (Mock Communities) further filtering or data transformations might be necessary. This small tutorial aims to provide tools on which to base these decisions.
An outline of the basic steps:
- Import OTU table/Biom file, metadata, tree files into R using the Phyloseq package.
NOTE
A new function read_phyloseq
has been created in Microbiome package, please update to the latest version for easy creation of phyloseq object.
- Pre-proccessing of the phyloseq object.
- Alpha diversity analysis
- Beta diveristy analysis (unconstrained, constrained and with and without rarefaction).
- Other useful R packages.
Loading the libraries/packages
These are necessary to run the commands If you are setting up RStudio for the first time you need to install a wide range of R-packages. This can be found in the file required_packages.R
in the ./information
folder.
You may need to do some changes and this might be specific for your PC. Read the required_packages.R
file properly.
For instructions on installing different R packages click here.
For instructions on how to load different file formats into R see import data.
#Load the required packages
library(ggplot2)
library(ape)
library(plyr)
library(vegan)
library(RColorBrewer)
library(reshape2)
library(scales)
library(data.table)
library(microbiome)
library(dplyr)
library(phyloseq)
library(DT) #for interactive tables
library(microbiomeutilities)
When you do your own analysis, create a new project in RStudio.
If you have the mapping file in .txt format then open it in excel and save it as CSV(comma delimited) “.csv”. Also good to renames samples column from “#SampleID” to “SampleID”. Remove the # sign. Do not keep any special characters like #,$ and also spaces etc. use underscore “_" to seperate two words for eg. SampleID can also be written as Sample_ID.
If you have downloaded this repository, open the .Rproj and follow the step outlined below.
Open the Microbiome_tutorial_V2.rmd in RStudio and run run the code chunks one by one and analyse.
**Remember the path in windows is given as “" however for R it has to be changed to”/“. **
When you are working on your own data kindly check that:
Making a phyloseq object
This is the basis for your analyses. In this phyloseq object, information on OTUs, taxonomy, the phylogenetic tree and metadata is stored. A single object with all this information provides a very convinient way of handling data. Please remember that the metadata (i.e. mapping) file has to be in *.csv format (columns have sample attributes). Below you can see how the mapping file has been used.
For more infromation: phyloseq
Things to be done in QIIME terminal (if required): Important Note 2: If you have error in loading the biom files stating JSON or HDF5 then you need to convert it in to a JSON format.
For this, use the following command within the linux installed with biom terminal and not in R!
# biom convert -i NGTaxMerged.biom -o ngtax_json.biom --table-type "OTU table" --to-json
For more information on the biom format please click here.
Let’s begin:
1] Read the biom and mapping file
From now on, all the commands are for R.
NOTE
Update to latest version of Microbiome package to use the read_phyloseq
function. This function can be used for reading other outputs (like .shared and consensus taxonomy files from mothur) into phyloseq object.
Read input to phyloseq object
# may take anywhere between 30 seconds to 2 or more minutes to create a phyloseq object depending on the size of biom file and your PCs processing strength.
pseq1 <- read_phyloseq(otu.file = "./input_files/NGTaxMerged_conv.biom", taxonomy.file = NULL, metadata.file = "./input_files/mappingMerged_edit.csv", type = "biom")
## Time to complete depends on OTU file size
Read the tree file.
Note: requires a package called ape
and the extension has to be “.tre” and not “.tree” (you can just change the name of the file extension)
# Load tree file
library(ape)
treefile_p1 <- read.tree("./input_files/combinedTree.tre")
Merge into phyloseq object.
ps1 <-merge_phyloseq(pseq1,treefile_p1)
# ps1 is the first phyloseq object.
print(ps1)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 466 taxa and 57 samples ]
## sample_data() Sample Data: [ 57 samples by 14 sample variables ]
## tax_table() Taxonomy Table: [ 466 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 466 tips and 464 internal nodes ]
The next steps will guide you through the process of understanding the noise levels in your data and provide you with reasoning to eliminate some of it.
Important Note 4
All the things you do with your data from this point on are based on your decision making. The settings of the NG-Tax pipeline that you used to get this data are optimised. However, if you have very noisy data, first re-run your data with a more strict abundance cut-off before trying to remove it yourself.
Always keep track of the filtering steps you performed and make a note of it!
The decision making starts here! For instance: Do we expect or want to include Choloropast and Mitrochondria related sequences? if NO then filter them out, if yes skip to next step using ps1 as input file. But first ask yourself, WHY do I want to remove these sequences?
# check the data
print(ps1)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 466 taxa and 57 samples ]
## sample_data() Sample Data: [ 57 samples by 14 sample variables ]
## tax_table() Taxonomy Table: [ 466 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 466 tips and 464 internal nodes ]
# check if any OTUs are not present in any samples
any(taxa_sums(ps1) == 0)
## [1] FALSE
The answer is FALSE, there are no OTUs that are not found in any samples. It is usually TRUE in the case when data is subset to remove some samples. OTUs unique to those sample are not removed along with the samples. Therefore, it is important to check this everytime the phyloseq object is filtered for samples using subset_samples
function.
ps1a <- prune_taxa(taxa_sums(ps1) > 0, ps1)
# check again if any OTUs are not present in any samples
any(taxa_sums(ps1a) == 0)
## [1] FALSE
After the prune_taxa
function is run to remove OTus with zero occurances, the answer is “FALSE”.
Check how many OTUs are kept.
# subtract the number of OTUs in original (ps1) with number of OTUs in new phyloseq (ps1a) object.
# no. of OTUs in original
ntaxa(ps1)
## [1] 466
# no. of OTUs in new
ntaxa(ps1a)
## [1] 466
ntaxa(ps1) - ntaxa(ps1a)
## [1] 0
rank_names(ps1) #we check the taxonomic rank information
## [1] "Domain" "Phylum" "Class" "Order" "Family" "Genus"
datatable(tax_table(ps1)) # the table is interactive you can scrol and search thorugh it for details.
# Check the taxonomy levels
rank_names(ps1a)
## [1] "Domain" "Phylum" "Class" "Order" "Family" "Genus"
Filter phyloseq
# Use the interactive table to check under which taxonomic rank chloroplast and Mitrochondria are mentioned. Specify them Class or Order in the code below.
ps1a <- subset_taxa(ps1,Class!="c__Chloroplast")
ps1b <- subset_taxa(ps1a,Order!="o__Mitochondria")
# Second question: Do we expect or want to include the archaeal sequences? if you have used primers which do not target them, then it is a spurious observation.
#Keep in mind: The fact that you picked up these sequences with any stringing cut-off from your otupicking algorithm means that they were relatively abundant, check the confidence of the sequences and taxonomic assignments before you decide to remove them.
#If you find high abundance/high confidence sequences that in theory should not be there, maybe more things are wrong with your data. Be critical. OR, your primer did pick up these sequences.
ps1 <- subset_taxa(ps1b,Domain!="k__Archaea")
# We again name it as ps1 to avoid complicating or creating lot of ps objects.
sort(sample_sums(ps1)) #check the number of reads/sample,we can see that one of the samples has only 1867 seqs, normally we would remove it, but because it mock community data, we can check wether this sequencing depth is enough to capture the theoretical composition, so for now we will leave it in. Just pay extra attention to where this sample ends up in downstream analysis.
## Mc.3.2.l01 Mc.4.2.l01 Mc.4.1.l01
## 1867 2540 2903
## Mc.3.1.l01 Mc.3.3.l01 Mc.4.3.l01
## 2912 5526 5842
## Mc.2.3.l01 Mc.2.1.l01 Mc.1.t.l44
## 6467 7167 16000
## Mc.1.t.l56 Mc.2.2.l01 Mc.4.1.l07
## 16000 19314 22661
## Mc.4.1.l06 Mc.2.2.l07 Mc.4.1.l03
## 30878 31764 31831
## Mc.2.1.l03 Mc.4.1.l05 Mc.2.2.35cy.l02
## 34773 37011 44499
## Mc.2.2.l06 Mc.2.2.l05 Mc.2.t.l44
## 45569 49657 51000
## Mc.1.3.l01 Mc.4.1.l04 Mc.2.t.l56
## 52479 53851 54000
## Mc.2.2.l03 Mc.4.2.l07 Mc.1.2.l01
## 54088 64714 69762
## Mc.2.2.l04 Mc.4.1.l02 Mc.3.t.l44
## 76177 77281 79000
## Mc.4.2.l02 Mc.3.t.l56 Mc.4.2.l05
## 81040 82000 83515
## Mc.4.2.l06 Mc.4.t.l44 Mc.3.3.l02
## 84618 97500 98645
## Mc.2.1.l06 Mc.4.2.l03 Mc.4.t.l56
## 99411 99638 100000
## Mc.2.2.30cy.l02 Mc.2.1.l05 Mc.2.2.100v.dup.l02
## 100574 101903 104536
## Mc.3.1.l02 Mc.1.1.l01 Mc.2.1.l07
## 107261 108440 122869
## Mc.4.2.l04 Mc.2.1.l02 Mc.2.1.l04
## 123830 171787 191415
## Mc.3.2.l02 Mc.2.2.l02 Mc.2.2.dup.l02
## 196934 212613 215903
## Mc.2.2.100v.l02 Mc.1.2.l02 Mc.2.3.l02
## 228492 232876 269422
## Mc.1.3.l02 Mc.4.3.l02 Mc.1.1.l02
## 284202 288597 315845
# Nevertheless, in case you want to remove some samples because of low sequencing depth or some other technical reasons you can remove it from the analysis with the following command:
# ps1.sub <-subset_samples(ps1, #phyloseq object
# SampleID !="SampleName") #metadata category name and != mean including all expect "SampleName"
# again we type the name and hit enter
# ps1 <- ps1.sub # this is our naively filtered phyloseq file which can undergo further filtering
# If you dont have anything to filter then continue with ps1 object for rest of the analysis.
datatable(tax_table(ps1))
In our Mock Community analysis using data from a Hiseq machine, we don’t have any of the aforementioned issues and hence we will continue using “ps1” as our phyloseq object for downstream analysis.
We have the phyloseq object ready for step wise analysis.
Exploring your sequence data
Distribution of the reads over the samples
For this, we need one extra R package to handle data objects called “data.table”. We will then explore our preliminary data. The table is interactive and you can scroll and search the table.
Let’s take a look at the distribution of the reads over the samples
library("data.table")
library(ggplot2)
ps1_df= data.table(as(sample_data(ps1), "data.frame"), Reads_per_sample = sample_sums(ps1), keep.rownames = TRUE)
ps1_df_plot = ggplot(ps1_df, aes(Reads_per_sample)) + geom_histogram() + ggtitle("Distribution of reads per sample") + ylab("Sample counts")
print(ps1_df_plot) #normal plot
ggsave("./output/Distribution_of_reads_per_sample.pdf", height = 6, width = 8) # you can create a sub-folder for figures to save all files in location. here, "./output/" notice the full stop or punt (in dutch) infornt of slash "./" this tells R to go inside a folder within the exsisting working directory and save the output.
# I include this now for better managing of projects.
Some samples at the right have high number of reads. This is a way to see if you have more or less even sequencing depth. Here, the mock communities were seqeucnced at different depths. Don’t worry, differences of up to 50x were quite normal.
Now let’s check the distribution of the OTUs in our data set.
# We make a data table with information on the OTUs
ps1.dt.taxa = data.table(tax_table(ps1),OTUabundance = taxa_sums(ps1),OTU = taxa_names(ps1))
ps1.dt.tax.plot <- ggplot(ps1.dt.taxa, aes(OTUabundance)) + geom_histogram() + ggtitle("Histogram of OTU (unique sequence) counts") + theme_bw()
print(ps1.dt.tax.plot)
#write.table(ps1.dt.taxa, "ps1_dt_tax_plot.txt", sep = "\t")
ggsave("./output/Histogram of OTU counts.pdf", height = 6, width = 7)
We observe a distribution that is skewed to the left side of the plot. This means that we have some OTUs that are repeated > 300.000 times, but a large number of OTUs are repeated < 100.000 (which is of course is still high). However we see that a very large proportion of our OTUs is repeated a lot less. Remember we have 466 OTUs in total. Now it seems that these are close to zero, but let’s have a closer look. Only a small proportion of reads is repeated less than 250x
library(ggplot2)
plot.zoom <- ggplot(ps1.dt.taxa, aes(OTUabundance)) + geom_histogram() + ggtitle("Histogram of Total Counts") + xlim(0, 1000) + ylim (0,50) + theme_bw()
print(plot.zoom)
## Warning: Removed 166 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
ggsave("./output/histogram of counts_zoomed_0_to_1000.pdf", height = 6, width = 7)
## Warning: Removed 166 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
If we zoom in even further, we see that only very little OTUs contain less than 20 reads. In real case senario, if your sequencing and OTU identification did not go well you might see a skewed distribution and may indicate the need for further filtering and processing.
library(ggplot2)
plot.zoom <- ggplot(ps1.dt.taxa, aes(OTUabundance)) + geom_histogram(breaks=seq(0, 20, by =1)) + ggtitle("Histogram of Total Counts") + theme_bw()
print(plot.zoom)
ggsave("./output/histogram of counts_zoomed_0_to_20.pdf", height = 6, width = 7)
Let’s find out what these numbers actually mean with regards to the total amount of data in this set.
Read summary
#total number of reads in the dataset
reads_per_OTU <- taxa_sums(ps1)
print(sum(reads_per_OTU))
## [1] 5251399
# we have a total of 5251399 reads
#We saw in the last graph that only a small fraction of the OTUs consist of less than 10 reads, so how much OTUs are this and how many reads do they contain?
print(length(reads_per_OTU[reads_per_OTU < 10]))
## [1] 109
# there are 109 OTus that contain less than 10 reads
print(sum(reads_per_OTU[reads_per_OTU < 10]))
## [1] 512
# more importantly, these OTUs only contain 512 reads
print((sum(reads_per_OTU[reads_per_OTU < 10])/sum(reads_per_OTU))*100)
## [1] 0.009749783
# which is only 0.00974% of the data, not bad right?
# To put this into context; out of the 466 OTUs, a 109 OTUs contain less than 10 reads, which is:
print((109/466)*100)
## [1] 23.39056
# 23.4% of the OTUs. This sounds like a large number, but remember; they only contain 0.00974% of the total data! This means that OTUs with a lower confidence (ie repeated 10 times or less are only a very small part of this dataset). Check this distribution with your own dataset.
#Important!: From this we can conclude that a very small part of your data can have a significant effect on your downstream analyses, if you use methods that use OTUs instead of the information that is containd in de sequences, such as Unifrac and phylogenetic diversity.
#Some other other checks that are normally done on sequencing data. You might recognize them from literature
# Check for singletons
ps1.dt.taxa[(OTUabundance <= 0), .N] # no singletons
## [1] 0
#check for doubletons
ps1.dt.taxa[(OTUabundance <= 2), .N]
## [1] 11
# out of 466, we have 11 doubletons
print((11/466)*100)
## [1] 2.360515
print((sum(reads_per_OTU[reads_per_OTU <= 2])/sum(reads_per_OTU))*100)
## [1] 0.000418936
# only 2.3% of the OTUs are doubletons and they consist of 0.000418% of the data
What we do now is important and also needs to be documented for reproducibility. We plot the cumulative sum of taxa to make decisions for filtering.
Important Note 6 :
As is clear from the former paragraph, in most instances with NG-Tax there will not be any need to filter out OTUs!! This is just for validation of the data. Remember, if you start filtering afterwards you create a bias, because the following filtering steps ignore the distribution of your data among the samples!!
Based on how the mock community data looks like, you can don’t need to do any filtering. However, if your data is problematic We will discuss some things you can do in later sections using examples.
You can filter by abundance ie. OTUs that contain less than a set number of reads are discarded. You can plot the OTU distribution.
Filtering Threshold, Minimum Total Counts
library(ggplot2)
ps1.cumsum = ps1.dt.taxa[, .N, by = OTUabundance]
setkey(ps1.cumsum, OTUabundance)
ps1.cumsum[, CumSum := cumsum(N)]
# Define the plot
ps1.cumsum.plot = ggplot(ps1.cumsum, aes(OTUabundance, CumSum)) +
geom_point() +
xlab("Filtering Threshold, Minimum Total Counts") +
ylab("OTUs Filtered") +
ggtitle("OTUs that would be filtered vs. the minimum count threshold") + theme_bw()
print(ps1.cumsum.plot)
ggsave("./output/OTUs that would be filtered vs the minimum count threshold.pdf", height = 6, width = 7)
Zoom in on the OTUs that contain <100 reads.
This will give an idea about the loss of OTUs when you use a certain count threshold.
library(ggplot2)
ps1.cumsum.plot.zoom <- ps1.cumsum.plot + xlim(0, 100) + theme_bw() #this will give an idea of the potential loss in otus you might expect in our dataset.
print(ps1.cumsum.plot.zoom)
## Warning: Removed 280 rows containing missing values (geom_point).
ggsave("./output/OTUs that would be filtered vs the minimum count threshold_zoom_0_to_100.pdf", height = 6, width = 7)
## Warning: Removed 280 rows containing missing values (geom_point).
In the next plot we can see the variance in the reads/OTU. These plots might be informative if you have very skewed data and see patterns. Here, however we see no clear patterns. We also know that the data represents the theoretical MCs quite good, so no need to draw any conclusions. However, check whether your own data deviates from this dataset.
Variance
library(ggplot2)
Variance.plot <- qplot(log10(apply(otu_table(ps1), 1, var)), xlab = "log10(variance)", main = "Variance in OTUs")
print(Variance.plot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave("./output/OTU_distribution_NG-Tax.pdf", height = 6, width = 7)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
CV
Alternatively, you can also check for Coefficeint of variation of OTUs to check for spurious observations.
ps1.rel <- microbiome::transform(ps1, "compositional") # transform to relative abundance
p <- plot_taxa_cv(ps1.rel)
print(p)
As can be seen, OTUs with low abundance have relatively high Coeeficient of variation. These may affect the differential abundance testing.
Alpha diversity calculations
Note: Some alpha diversity metrics such as ACE are richness estimators, which also consider rare OTUs to estimate the total richness. However if all these rare OTUs are really just error (and the vast majority of them are) then these estimates are meaningless anyway. Thus using NG-Tax, phyloseq will give an error and will ask for “unfiltered data’. We will use the plot_richness
function from the phyloseq
package for alpha diversity analysis. However, it is important that you document this and be aware that you results will be conservative. However, if you do use unfiltered data your answers will be useless and based on the number of reads you have per sample (as you know this depends on the barcode and the amount of reads you ordered, not on your sample composition). Also, as already explained; the number of OTUs and the amount of data/reads they contain is very skewed. That is why we advise phylogenetic diversity for all diversity analyses, unless other metrics give similar results.
The real composition of the MCs is as follows, the real values will depend slightly on the sequenced region (V4 is more stable so will contain less OTUs when sequenced, while V5-V6 will result in more; if things are done properly):
MC1: 17 species, equimolar MC2: 55 species, equimolar MC3: the same 55 species as MC2, but distributed differently (lower evenness=lower diversity) MC4: 43 species, with some at 0.1,0.01 and 0.001%. Again the same type of species as in the other MCs
Try to figure out what kind of pattern of the MCs you expect based on this information.
Phyogenetic diversity
Phylogenetic diversity is calculated using the picante
package.
# if you want to specify specific colors
my_colors <- c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578", "#CD9BCD", "#AD6F3B", "#673770","#D14285", "#652926", "#C84248", "#8569D5", "#5E738F","#D1A33D", "#8A7C64", "#599861", "steelblue" )
# install.packages("picante",repos="http://R-Forge.R-project.org")
# library(picante)
library(picante)
otu_table_ps1 <- as.data.frame(ps1@otu_table)
metadata_table_ps1 <- as.data.frame(ps1@sam_data)
df.pd <- pd(t(otu_table_ps1), treefile_p1,include.root=F) # t(ou_table) transposes the table for use in picante and the tre file comes from the first code chunck we used to read tree file (see making a phyloseq object section).
datatable(df.pd)
# now we need to plot PD
# check above how to get the metadata file from phyloseq object.
# We will add the results of PD to this file and then plot.
metadata_table_ps1$Phyogenetic_diversity <- df.pd$PD
plot.pd <- ggplot(metadata_table_ps1, aes(MC_type2, Phyogenetic_diversity)) + geom_boxplot(aes(fill = Region)) + geom_point(size = 2) + theme(axis.text.x = element_text(size=14, angle = 90)) + scale_fill_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578")) + theme_bw()
print(plot.pd)
We can see that the sequenced phylogenetic diversity very accurately reproduces the theoretical diversity.
For the convenience of using other metrics, we will show the plot for several options available in Phyloseq and Microbiome packages.
Non-Phylogenetic metrics
#plot the diversity using different metrics
p <- plot_richness(ps1, "Mock_type", measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"))
## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
p <- p + geom_boxplot(aes(fill = "Mock_type")) + scale_fill_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578"))
print(p)
## Warning: Removed 49 rows containing non-finite values (stat_boxplot).
## Warning: Removed 334 rows containing missing values (geom_errorbar).
#plot the diversity using different metrics and color the points by different variable region, the plot also contains the expected values and sequenced values to compare what you sequenced to what it is supposed to be. Check with your own expectations.
p <- plot_richness(ps1, "MC_type2", measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"))
## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
p <- p + geom_point(aes(color = Region)) + theme_bw()
print(p)
## Warning: Removed 334 rows containing missing values (geom_errorbar).
## Warning: Removed 49 rows containing missing values (geom_point).
To investigate in more detail, we will separate our data based on 16S rRNA gene hypervariable region and plot the alpha diversity metrics.
- Only the v4 region.
#to plot the regions separately (because we saw that they give slightly different results)
#filter out the samples based on region
ps1.V4 <- subset_samples(ps1, Region=="V4")
plot.v4.Adiv <- plot_richness(ps1.V4, "MC_type2", measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"))
## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
plot.v4.Adiv <- plot.v4.Adiv + geom_point(aes(color = Mock_type)) + theme_bw()
print(plot.v4.Adiv)
## Warning: Removed 246 rows containing missing values (geom_errorbar).
## Warning: Removed 41 rows containing missing values (geom_point).
- Only the V5-V6 region.
ps1.V5V6 <- subset_samples(ps1, Region=="V5-V6")
plot.V5V6.Adiv <- plot_richness(ps1.V5V6, "MC_type2", measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"))
## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
plot.V5V6.Adiv <- plot.V5V6.Adiv + geom_point(aes(color = Mock_type))
As you can see only Simpson/invSimpson reproduces the biological order of the theoretical MCs (i.e perfect data without sequencing/PCR error). Therefore, most of these metrics don’t show the underlying biological conclusions, but are based on other (non-biological) parameters. UNLESS the difference between your biological samples/groups was very large to begin with (such ~3x higher richness as in MC1 compared to the other MC).
However, when we use a metric such as phylogeny diversity or Faith’s diversity index we reproduce the biological order of the theoretical samples, with both regions. Therefore this metric is preferred.
Important note 6 :
Always challenge your data: Use phylogenetic diversity, check the other metrics and see in what ways they give similar results (or not!) and try to deduce why you observe these things.
Others
For more diversity indices please refer to Microbiome Package
Test of significance
Next we can test whether there are significant differences between the MC types
# We can check whether there are any significant differences in alpha diversity between the sample groups we are interested in.
ps1.adiv <- estimate_richness(ps1, measures = c("Chao1", "Shannon", "Observed", "InvSimpson"))
## Warning in estimate_richness(ps1, measures = c("Chao1", "Shannon", "Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
ps1.metadata <- as(sample_data(ps1), "data.frame")
head(ps1.metadata)
## BarcodeSequence LibraryNumber Direction
## Mc.1.1.l02 CTGCTGAA 2 p
## Mc.1.2.l02 GCCTTAAG 2 p
## Mc.1.3.l02 CTTGGCCT 2 p
## Mc.2.1.l02 GAACCGTT 2 p
## Mc.2.2.l02 GAACGTAT 2 p
## Mc.2.3.l02 GAACTAAG 2 p
## LibraryName
## Mc.1.1.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Mc.1.2.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Mc.1.3.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Mc.2.1.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Mc.2.2.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Mc.2.3.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Facet MC_type1 MC_type2 ProjectName Region Mock_type
## Mc.1.1.l02 Mc_1 Mc_1 Mc_1 Mock_communities V4 Mc_1
## Mc.1.2.l02 Mc_1 Mc_1 Mc_1 Mock_communities V4 Mc_1
## Mc.1.3.l02 Mc_1 Mc_1 Mc_1 Mock_communities V4 Mc_1
## Mc.2.1.l02 Mc_2 Mc_2 Mc_2 Mock_communities V4 Mc_2
## Mc.2.2.l02 Mc_2 Mc_2 Mc_2 Mock_communities V4 Mc_2
## Mc.2.3.l02 Mc_2 Mc_2 Mc_2 Mock_communities V4 Mc_2
## Mock_replicate BarcodeNumber Description MockTypeGroup
## Mc.1.1.l02 Mc_1_1 33 Mc_1_1_l02 Mc_1
## Mc.1.2.l02 Mc_1_2 45 Mc_1_2_l02 Mc_1
## Mc.1.3.l02 Mc_1_3 35 Mc_1_3_l02 Mc_1
## Mc.2.1.l02 Mc_2_1 36 Mc_2_1_l02 Mc_2
## Mc.2.2.l02 Mc_2_2 37 Mc_2_2_l02 Mc_2
## Mc.2.3.l02 Mc_2_3 38 Mc_2_3_l02 Mc_2
#We add the columns with the alpha-div indices from ps3.adiv to ps3.adiv.metadata where all info including metadata is now available.
ps1.metadata$Observed <- ps1.adiv$Observed
ps1.metadata$Shannon <- ps1.adiv$Shannon
ps1.metadata$InvSimpson <- ps1.adiv$InvSimpson
ps1.metadata$Phyogenetic_diversity <- df.pd$PD # we also add Phylogenetic diversity values from previous code for analysis.
colnames(ps1.metadata) #check the last three coloumns will be the alpha diversity indices
## [1] "BarcodeSequence" "LibraryNumber"
## [3] "Direction" "LibraryName"
## [5] "Facet" "MC_type1"
## [7] "MC_type2" "ProjectName"
## [9] "Region" "Mock_type"
## [11] "Mock_replicate" "BarcodeNumber"
## [13] "Description" "MockTypeGroup"
## [15] "Observed" "Shannon"
## [17] "InvSimpson" "Phyogenetic_diversity"
For tutorial on doing statistics on diversity values please check the microbiome R package website click here.
Classification in NG-Tax
It is also useful to mark the unclassified OTUs at Domain and Phylum level. You can also do it at any taxonomic level, especially when you want to plot composition plots, but only when you know what you are doing (!) because “NA” or missing levels do not automatically mean “Unidentified” or “Unknown”. Remember that you can and maybe need to manually reclassify unidentified sequences. IMPORTANT: NG-Tax employs relatively strict cutoffs for classification (high confidence in correct classification) and defaults to “NA”, when there are too many mismatches with a sequence in the database. It is basically open reference OTU picking so If you have reasons, which are backed by biology, to reclassified OTUs that are classified as “NA”, do so.
# If there are NA's in you taxonomy table then do this
tax_table(ps1)[is.na(tax_table(ps1)[,1])] <- "k__"
tax_table(ps1)[is.na(tax_table(ps1)[,2])] <- "p__"
tax_table(ps1)[is.na(tax_table(ps1)[,3])] <- "c__"
tax_table(ps1)[is.na(tax_table(ps1)[,4])] <- "o__"
tax_table(ps1)[is.na(tax_table(ps1)[,5])] <- "f__"
tax_table(ps1)[is.na(tax_table(ps1)[,6])] <- "g__"
# Then you can convert those unclassified such as p__ without phylum name to this or skip this part
tax_table(ps1)[tax_table(ps1)[,"Phylum"]== "p__", "Phylum" ] <- "Unknown_Phylum"
NG-Tax assigns a name to an OTU using the SILVA database as a reference. Quite often these sequences are not fully classified to genus level in this database, depending on the type of organism. If this is the case, NG-Tax reports the name as “< empty >”. This means that this organism could not be classified further using the SILVA alignment. This is different from the cases where NG-Tax cannot classify a sequence because it cannot distinguish it from another one with high confidence and therefore goes down a rank. Example: a sequence finds a genus level match for: Escherichia/Shigella and Enterobacter in the database. Using short reads these two cannot be distinguished therefore NG-Tax classifies it one rank lower to family: Enterobacter. In this case no “< empty >” will be added to the classification, because this is due to lack of information in the sequence.
This extra information can be very useful in specific cases such as poorly studied/covered ecosystems or when you are interested in very distinct micro-organisms, but cannot find them back in your data.
Also, a sequence containing “< empty >” is very distinct from one that doesn’t. Keep this in mind when analyzing your data and check for taxa that are distinctive in your intervention of ecosystems.
Another way to detect classifications with a confidence of < 1 (meaning not all the sequences in the database that have a hit are classified to the same taxon) is the option in NG-Tax to show these using “*~“. This option is off by default.
Nevertheless when plotting the composition or any other plot where the taxonomic lines are shown, especially for publication, it might be useful to remove these for clearer plots.
#Some of the taxonomy levels may include "<empty>" entries. If you want to remove them for plotting then run the following code.
ps1.com <- ps1 # create a new pseq object
tax.mat <- tax_table(ps1.com)
tax.df <- as.data.frame(tax.mat)
tax.df[tax.df == "__*~"] <- "__"
tax.df[tax.df == "g__<empty>"] <- "g__"
tax.df[tax.df == "k__<empty>"] <- "k__"
tax.df[tax.df == "p__<empty>"] <- "p__"
tax.df[tax.df == "c__<empty>"] <- "c__"
tax.df[tax.df == "o__<empty>"] <- "o__"
tax.df[tax.df == "f__<empty>"] <- "f__"
tax_table(ps1.com) <- tax_table(as.matrix(tax.df))
Plot composition
Important note 7 :
There are different options for doing all the following analyses. You can use the separate OTUs for finer detail. Just remember they are unique sequences, not necessarily ‘species’. For this purpose, you need to use the ps1 object and then do:
Barplot composition
Barplots are an easy and intuitive way of visualizing the composition of your samples. However, the way this is implemented in phyloseq causes problems with the order of the taxa in the legend at higher taxonomic levels.
# We need to set Palette
taxic <- as.data.frame(ps1.com@tax_table) # this will help in setting large color options
colourCount = length(unique(taxic$Family)) #define number of variable colors based on number of Family (change the level accordingly to phylum/class/order)
getPalette = colorRampPalette(brewer.pal(12, "Paired")) # change the palette as well as the number of colors will change according to palette.
otu.df <- as.data.frame(otu_table(ps1.com)) # make a dataframe for OTU information.
# head(otu.df) # check the rows and columns
taxic$OTU <- row.names.data.frame(otu.df) # Add the OTU ids from OTU table into the taxa table at the end.
colnames(taxic) # You can see that we now have extra taxonomy levels.
## [1] "Domain" "Phylum" "Class" "Order" "Family" "Genus" "OTU"
library(knitr)
head(kable(taxic)) # check the table.
## [1] " Domain Phylum Class Order Family Genus OTU "
## [2] "-------- ------------ ------------------- ----------------------- ---------------------- ------------------------- ------------------------ --------"
## [3] "5656206 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacteriales f__Enterobacteriaceae g__ 5656206 "
## [4] "5656109 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacteriales f__Enterobacteriaceae g__ 5656109 "
## [5] "4444188 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacteriales f__Enterobacteriaceae g__ 4444188 "
## [6] "4444187 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacteriales f__Enterobacteriaceae g__ 4444187 "
taxmat <- as.matrix(taxic) # convert it into a matrix.
new.tax <- tax_table(taxmat) # convert into phyloseq compatible file.
tax_table(ps1.com) <- new.tax # incroporate into phyloseq Object
# now edit the unclassified taxa
tax_table(ps1.com)[tax_table(ps1.com)[, "Family"] == "f__", "Family"] <- "Unclassified family"
# We will also remove the 'f__' patterns for cleaner labels
tax_table(ps1.com)[, colnames(tax_table(ps1.com))] <- gsub(tax_table(ps1.com)[,
colnames(tax_table(ps1.com))], pattern = "[a-z]__", replacement = "")
# it would be nice to have the Taxonomic names in italics.
# for that we set this
guide_italics <- guides(fill = guide_legend(label.theme = element_text(size = 15,
face = "italic", colour = "Black", angle = 0)))
## Now we need to plot at family level, We can do it as follows:
# first remove the phy_tree
ps1.com@phy_tree <- NULL
# Second merge at family level
ps1.com.fam <- aggregate_taxa(ps1.com, "Family")
plot.composition.COuntAbun <- plot_composition(ps1.com.fam) + theme(legend.position = "bottom") +
scale_fill_manual("Family", values = getPalette(colourCount)) + theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Relative abundance") + guide_italics + theme(legend.title = element_text(size=18))
plot.composition.COuntAbun
ggsave("./output/Family_barplot_CountAbundance.pdf", height = 6, width = 8)
This plot is based on the reads/sample. You can see the reads are not evenly distributed over the samples, nevertheless their overall composition seems to be the same. The only thing that is different is the scaling. You don’t need any other normalization algorithms. To check this in the next step we plot the relative abundance.
Make it relative abundance
# the previous pseq object ps1.com.fam is only counts.
# Use transform function of microbiome to convert it to rel abun.
ps1.com.fam.rel <- microbiome::transform(ps1.com.fam, "compositional")
plot.composition.relAbun <- plot_composition(ps1.com.fam.rel) + theme(legend.position = "bottom") +
scale_fill_manual("Family", values = getPalette(colourCount)) + theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Relative abundance") + guide_italics + theme(legend.title = element_text(size=18))
plot.composition.relAbun
ggsave("./output/Family_barplot_RelAbundance.pdf", height = 6, width = 8)
For more information Microbiome tutorial
In the legends of these composition plots now consists of the taxonomic labels.
Beta diversity metrics
This approach, together with alpha-diversity is very sensitive to spurious otus, lots of zeros and skewed distribution of counts. Hence, different people have different preferences, such as rarefaction or normalization algorithms. Rarefaction is a common treatment, however for NG-Tax data this is unnecessary even though the sequencing depth ranges from 1867 to 315845 reads per sample, therefore we can just transform the data into relative abundance. In case you want to still rarefy your data, (REMEMBER! you throw away >90% of your data!) below are the commands for different ordination methods.
For more information:
Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.
Normalisation and data transformation.
What is Constrained and Unconstrained Ordination.
Unconstrained
Without rarefaction
ps1.rel <- transform(ps1, "relative.abundance")
set.seed(49275) #set seed for reproducible rooting of the tree
ordu.wt.uni = ordinate(ps1.rel, "PCoA", "wunifrac")
## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root
## as -- 444439 -- in the phylogenetic tree in the data you provided.
scree.plot <- plot_scree(ordu.wt.uni, "Check for importance of axis in Scree plot for MCs, UniFrac/PCoA")
print(scree.plot)
wt.unifrac <- plot_ordination(ps1.rel, ordu.wt.uni,
color = "MC_type2",
shape = "Region")
wt.unifrac <- wt.unifrac + scale_fill_manual(values = c("#CBD588",
"#5F7FC7", "orange", "#DA5724", "#508578")) +
ggtitle("Weighted UniFrac relative abundance") +
geom_point(size = 3)
print(wt.unifrac)
set.seed(475) #set seed for reproducible rooting of the tree
ordu.unwt.uni = ordinate(ps1.rel, "PCoA", "unifrac", weighted = F)
## Warning in UniFrac(physeq, ...): Randomly assigning root as -- 5656250 --
## in the phylogenetic tree in the data you provided.
unwt.unifrac <- plot_ordination(ps1.rel, ordu.unwt.uni,
color = "MC_type2",
shape = "Region") +
ggtitle("Unweighted UniFrac relative abundance") +
geom_point(size = 3)
print(unwt.unifrac)
# Bray Curtis dissimilarity (purely OTU based, doesn't take
# the sequence of the OTUs into account)
ordu.bray = ordinate(ps1.rel, "PCoA", "bray", weighted = F)
bray <- plot_ordination(ps1.rel, ordu.bray,
color = "Mock_type",
shape = "Region")
bray <- bray + scale_fill_manual(values = c("#CBD588", "#5F7FC7",
"orange", "#DA5724", "#508578")) +
ggtitle("Bray-Curtis relative abundance V4 and V5-V6") +
geom_point(size = 3) + theme_bw()
print(bray)
With rarefaction
#with rarefaction
set.seed(09809) # alway set a random number seed for reproducibility of the rarefaction
ps1.rar <- rarefy_even_depth(ps1)
## You set `rngseed` to FALSE. Make sure you've set & recorded
## the random seed of your session for reproducibility.
## See `?set.seed`
## ...
## 17OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...
head(sample_sums(ps1.rar)) # check whether it is rarefied
## Mc.1.1.l02 Mc.1.2.l02 Mc.1.3.l02 Mc.2.1.l02 Mc.2.2.l02 Mc.2.3.l02
## 1867 1867 1867 1867 1867 1867
set.seed(999)
ordu.wt.uni = ordinate(ps1.rar, "PCoA", "wunifrac")
## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root
## as -- 4444137 -- in the phylogenetic tree in the data you provided.
wt.unifrac.rar <- plot_ordination(ps1.rar, ordu.wt.uni,
color="MC_type2",
shape="Region")
wt.unifrac.rar <- wt.unifrac.rar + scale_fill_manual(values = c("MC_type2")) +
ggtitle("Weighted UniFrac rarefied to 1867 sequences per sample") +
geom_point(size = 3)
print(wt.unifrac.rar)
set.seed(949)
ordu.unwt.uni = ordinate(ps1.rar, "PCoA", "unifrac", weighted=F)
## Warning in UniFrac(physeq, ...): Randomly assigning root as -- 5656256 --
## in the phylogenetic tree in the data you provided.
unwt.unifrac.rare <- plot_ordination(ps1.rar, ordu.unwt.uni, color="MC_type2", shape="Region") +
ggtitle("Unweighted UniFrac rarefied to 1867 sequences per sample") +
geom_point(size = 3)
print(unwt.unifrac.rare)
#Bray Curtis dissimilarity (purely OTU based, doesn't take the sequence of the OTUs into account)
ordu.bray = ordinate(ps1.rar, "PCoA", "bray", weighted=F)
bray.rar <- plot_ordination(ps1.rel, ordu.bray, color="Mock_type", shape="Region")
bray.rar <- bray + scale_fill_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578")) +
ggtitle("Bray-Curtis rarefied to 1867 sequences per sample, region V4 and V5-V6") +
geom_point(size = 3)
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
print(bray.rar)
You can see rarefaction is unnecessary, furthermore it seems that separation of the MCs is even better without rarefaction as you retain minor signals.
Bray-Curtis based dissimilarity using separate regions
Because Bray-Curtis based dissimilarity seems to differentiate between the two regions instead of the biological difference (i.e. MC type) we can take look at a single region.
ps1V4.rel <- transform_sample_counts(ps1.V4, function(x) x / sum(x) )
ordu.brayV4 = ordinate(ps1V4.rel, "PCoA", "bray")
bray <- plot_ordination(ps1V4.rel, ordu.brayV4,
color="MC_type2",
shape="Region")
bray <- bray + scale_fill_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578")) +
ggtitle("Bray-Curtis relative abundance V4") +
geom_point(size = 3)
print(bray)
ps1.V5V6.rel <- transform_sample_counts(ps1.V5V6, function(x) x / sum(x) )
ordu.brayV5V6 = ordinate(ps1.V5V6.rel, "PCoA", "bray")
bray <- plot_ordination(ps1.V5V6.rel, ordu.brayV5V6,
color="MC_type2",
shape="Shape")
## Warning in plot_ordination(ps1.V5V6.rel, ordu.brayV5V6, color =
## "MC_type2", : Shape variable was not found in the available data you
## provided.No shape mapped.
bray <- bray + scale_fill_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578")) +
ggtitle("Bray-Curtis relative abundance V5V6") +
geom_point(size = 3) + theme_bw()
print(bray)
From these plots we can conclude that methods that use information contained in the sequence, like Unifrac, are more powerful and less dependent on small (sequencing/PCR) errors.
More on Adonis test or GUSTA ME
metadf <- data.frame(sample_data(ps1.V5V6.rel))
set.seed(28567)
unifrac.dist <- UniFrac(ps1.V5V6.rel,
weighted = TRUE,
normalized = TRUE,
parallel = FALSE,
fast = TRUE)
## Warning in UniFrac(ps1.V5V6.rel, weighted = TRUE, normalized = TRUE,
## parallel = FALSE, : Randomly assigning root as -- 5656234 -- in the
## phylogenetic tree in the data you provided.
# Adonis test
adonis(unifrac.dist ~ Facet, data = metadf)
##
## Call:
## adonis(formula = unifrac.dist ~ Facet, data = metadf)
##
## Permutation: free
## Number of permutations: 999
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## Facet 4 0.154051 0.038513 9.4929 0.77538 0.001 ***
## Residuals 11 0.044627 0.004057 0.22462
## Total 15 0.198678 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Facet has a significant effect on the community.
But is it really possible or is it a chance event.
Check beta dispersity
ps.disper <- betadisper(unifrac.dist, metadf$Facet)
permutest(ps.disper)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 4 0.0054760 0.00136901 1.9121 999 0.179
## Residuals 11 0.0078755 0.00071596
The dispersity is not significant means our previous adonis results are good.
Constrained
[Partial] Constrained Analysis of Principal Coordinates (CAPSCALE) or distance-based RDA, via capscale. See capscale.phyloseq for more details. In particular, a formula argument must be provided.
# CAP ordinate using Bray Curtis dissimilarity only looking at V4
set.seed(23234)
ps1V4.rel_bray <- phyloseq::distance(ps1V4.rel, method = "bray") # CAP ordinate
cap_ord <- ordinate(physeq = ps1V4.rel,
method = "CAP",
distance = ps1V4.rel_bray,
formula = ~ Mock_type)
# chech which asix are explaining how mauch variation
scree.cap <- plot_scree(cap_ord, "Scree Plot for MCs in Constrained Analysis of Principal Coordinates (CAPSCALE)")
print(scree.cap)
# CAP plot
cap_plot <- plot_ordination(physeq = ps1V4.rel,
ordination = cap_ord,
color = "Mock_type",
shape = "Region",
axes = c(1,2)) +
geom_point(aes(colour = Mock_type), size = 3) +
geom_point(size = 3) +
scale_color_manual(
values = c( "#CBD588", "#5F7FC7", "orange","#DA5724", "#508578"))
cap_plot + ggtitle("CAP_Plot") + theme_bw()
ggsave("./output/CAP_plot.pdf", height = 8, width = 10)
# Now add the environmental variables as arrows
arrowmat <- vegan::scores(cap_ord, display = "bp")
# Add labels, make a data.frame
arrowdf <- data.frame(labels = rownames(arrowmat), arrowmat)
# Define the arrow aesthetic mapping
arrow_map <- aes(xend = CAP1,
yend = CAP2,
x = 0, y = 0,
shape = NULL,
color = NULL,
label = labels)
label_map <- aes(x = 1.3 * CAP1, y = 1.3 * CAP2,
shape = NULL,
color = NULL,
label = labels)
arrowhead = arrow(length = unit(0.02, "npc"))
##now plot the arrow
cap_plot <- cap_plot + geom_segment(mapping = arrow_map, size = .7,
data = arrowdf, color = "black",
arrow = arrowhead) +
geom_text(mapping = label_map, size = 4,
data = arrowdf,
show.legend = TRUE) + ggtitle("CAP_Plot") + theme_bw()
## Warning: Ignoring unknown aesthetics: shape, label
## Warning: Ignoring unknown aesthetics: shape
print(cap_plot)
ggsave("./output/CAP_plot_arrows.pdf", height = 8, width = 10)
#further editing can be done using any of the image processing tools to easily clarify labels.
Testing for significance in beta diversity
##statistical testing
set.seed(19743)
cap_anova <- anova(cap_ord, by="terms", permu=999) # we test the impact of the environmental variable seperately. Kindly check for these functions in the help section on the right side.
print(cap_anova)
## Permutation test for capscale under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## Model: capscale(formula = distance ~ Mock_type, data = data)
## Df SumOfSqs F Pr(>F)
## Mock_type 3 2.37632 48.212 0.001 ***
## Residual 37 0.60789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
set.seed(19773)
#we see that Mock type has a significant impact on the beta diversity using only region V4. (Remember; for Bray Curtis the sequenced region was more important)
#testing various factors (increase the permutation for real data)
anova(cap_ord) ## Global test of the significance of the analysis.
## Permutation test for capscale under reduced model
## Permutation: free
## Number of permutations: 999
##
## Model: capscale(formula = distance ~ Mock_type, data = data)
## Df SumOfSqs F Pr(>F)
## Model 3 2.37632 48.212 0.001 ***
## Residual 37 0.60789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# calculate R squared values
RsquareAdj (cap_ord)
## $r.squared
## [1] 0.7962975
##
## $adj.r.squared
## [1] 0.779781
Check this link for more possibilities in ordination analysis
Additional examples
Utilities/handy commands
# Saving the individuals objects within phyloseq as *.csv
write_phyloseq(ps1V4.rel, "OTU", path = "./output")
## Writing OTU in the file ./output/otu_table.csv
## [1] "./output"
write_phyloseq(ps1V4.rel, "TAXONOMY", path = "./output")
## Writing TAXONOMY in the file ./output/taxonomy_table.csv
## [1] "./output"
write_phyloseq(ps1V4.rel, "METADATA", path = "./output")
## Writing METADATA in the file ./output/metadata_table.csv
## [1] "./output"
# select only a subset of samples_P1
# as an example, only samples and other data for theoretical will be selected from original phyloseq object "ps1"
ps1.subset.theoretical <- subset_samples(ps1, Facet == "Theoretical") # notice the == sign
print(ps1.subset.theoretical)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 466 taxa and 8 samples ]
## sample_data() Sample Data: [ 8 samples by 14 sample variables ]
## tax_table() Taxonomy Table: [ 466 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 466 tips and 464 internal nodes ]
# you can see that we have only 8 samples and no samples from mock community.
# select only a subset of samples
# as an example, we will now remove all information from theoretical samples from original phyloseq object "ps1"
ps1.subset.no.theoretical <- subset_samples(ps1, Facet != "Theoretical") # notice the != sign
print(ps1.subset.no.theoretical)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 466 taxa and 49 samples ]
## sample_data() Sample Data: [ 49 samples by 14 sample variables ]
## tax_table() Taxonomy Table: [ 466 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 466 tips and 464 internal nodes ]
# you can see that we have 49 samples and no samples from theoretical.
# get only genus level information
ps1.genus <- tax_glom(ps1, taxrank = "Genus")
ps1.df <- psmelt(ps1.genus)
# Code : write.csv(ps1.df, file='otus_genus_metadata.csv')
Aggregating your OTUs to a specifc taxonomic level
This is somewhat similar to summarize_taxa
function in QIIME but this one uses phylogenetic tree for merging OTUs. So much better because the reference databases are not perfect (i.e reads that are close in sequence but for some reason are classified differently will be more separate based on their name than the sequence actually would show).
tax_glom is good but we might also have data where a lot of OTUs are unassigned. In such scenarios, we can use tip_glom(), which uses the phylogenetic tree to merge closely related otus.
ps1.tip.glom <- tip_glom(ps1, h = 0.04) # h = "X" is an iterative process
# Using a value of 0.04 as the height where the tree should be cut, we get close to the number of genus we expect in our mock community. But again, to get a good feeling for your data play with these settings and see whether it is appropriate for your data.
p.tree <- plot_tree(ps1.tip.glom,
label.tips="taxa_names",
color = "Facet",
title="tree")
print(p.tree)
# ps1.tip.glom can used for comparative analysis at the genus level.
Dendrograms
# install ggdendro from CRAN or using the install button on the right panel.
library(ggplot2)
library(ggdendro)
clustering_samples_dendeogram <- hclust(phyloseq::distance(ps1, method="Unifrac"))
dendeogram.plot <- ggdendrogram(clustering_samples_dendeogram,
rotate = FALSE, size = 2) +
ggtitle("UniFrac distance based hierarchical clustering")
print(dendeogram.plot)
ggsave("./output/dendrogram.pdf", height = 8, width = 18)
Heatmaps
heat.sample <- plot_taxa_heatmap(ps1, subset.top = 10,
VariableA = "MC_type2",
heatcolors = brewer.pal(10, "Blues"),
transformation = "compositional")
## Top 10 OTUs selected
## First converted to compositional
## then top taxa were selected
## Warning in brewer.pal(10, "Blues"): n too large, allowed maximum for palette Blues is 9
## Returning the palette you asked for with that many colors
Core microbiota
ps1.rel <- microbiome::transform(ps1, "compositional")
ps.rel.f <- format_to_besthit(ps1.rel)
#Set different detection levels and prevalence
prevalences <- seq(.5, 1, .5) #0.5 = 95% prevalence
detections <- 10^seq(log10(1e-3), log10(.2), length = 10)
#(1e-3) = 0.001% abundance; change "-3" to -2 to increase to 0.01%
p <- plot_core(ps.rel.f, plot.type = "heatmap",
colours = rev(brewer.pal(10, "Spectral")),
min.prevalence = 0.6,
prevalences = prevalences,
detections = detections) +
xlab("Detection Threshold (Relative Abundance (%))")
print(p)
Selecting taxa to plot
Let us plot only the top OTUs
ps1.f <- format_to_besthit(ps1)
top_otu <- top_taxa(ps1.f, 10)
print(top_otu)
## [1] "OTU-444469:f__Enterobacteriaceae" "OTU-44441:Bacteroides"
## [3] "OTU-44442:Alistipes" "OTU-44440:Faecalibacterium"
## [5] "OTU-44447:f__Lachnospiraceae" "OTU-44446:f__Enterobacteriaceae"
## [7] "OTU-44443:Akkermansia" "OTU-444471:f__Enterobacteriaceae"
## [9] "OTU-444470:Bacteroides" "OTU-444473:Fusobacterium"
The function below will plot relative abundance of the top 10 OTUs that are specified in top_otu
.
p <- plot_select_taxa(ps1.f, select.taxa= top_otu, "Region", "Paired", plot.type = "stripchart")
## An additonal column Sam_rep with sample names is created for reference purpose
print(p)
Use of other useful R packages
Metagenomeseq:
Information can be found here
Metagenomeseq
Metagenomeseq paper
Also check the following website
This is another good package that can be used for a variety of analyses. The examples shown here are easy ones and you will need to play around with settings as per your requirements. Also with the legends in the figures. This is just an example without specifics. WE want to show how you can switch between packages for various analysis.
library(metagenomeSeq)
# convert phyloseq object "ps1" to "mseq1"
mseq1 <- phyloseq_to_metagenomeSeq(ps1)
# identify and set the variable you want to investigate, in this case it is the mock community type.
Type.community = pData(mseq1)$Facet
heatmapColColors = brewer.pal(12, "Set3")[as.integer(factor(Type.community))] # change Set3 to any color combination you wish.
# R color brewer options click here https://www.r-bloggers.com/choosing-colour-palettes-part-ii-educated-choices/
heatmapCols = colorRampPalette(brewer.pal(9, "RdBu"))(50) #color for heatmap
plotMRheatmap(obj = mseq1, n = 200, cexRow = 0.4, cexCol = 0.4,
trace = "none", col = heatmapCols, ColSideColors = heatmapColColors)
# there is a small change on how we save the file from this one
pdf("./output/heatmap.pdf", height = 8, width = 18) # create an empty file
plotMRheatmap(obj = mseq1, n = 200,
cexRow = 0.4,
cexCol = 0.4,
trace = "none",
col = heatmapCols,
ColSideColors = heatmapColColors) # plot the figure
title(main = "This is only example figure!!")
dev.off() # close the file and autosave
## png
## 2
OTU prevalence
This will help in identifying the distribution of OTUs with taxonomic information. This will also help you to evaluate the composition of different groups that you are interested in (such as environment or healthy/disease), while retaining information of individual samples.
ps1.prev<- ps1
tax_table(ps1.prev)[tax_table(ps1.prev)[,"Family"]== "f__", "Family" ] <- "f__Unclassified Family"
# We will also remove the "f__" patterns for cleaner labels
tax_table(ps1.prev)[,colnames(tax_table(ps1.prev))] <- gsub(tax_table(ps1.prev)[,colnames(tax_table(ps1.prev))],pattern="[a-z]__",replacement="")
p.prev.plot <- plot_taxa_prevalence(ps1.prev, 'Family')
plot(p.prev.plot)
Useful website for ggplot2 basics. eg. how to add p-values in plots, etc.
Add P-values and Significance Levels to ggplots
We acknowledge the work of and tools created by:
Paul J. McMurdie and colleagues
Leo Lahti
Index
This is useful for beginners and provides a link to which function belongs to which package. If you face issues then you can specifically look at the package websites for more information.
Functions :: Package
read_phyloseq
:: microbiome
read.tree
:: ape
merge_phyloseq
:: phyloseq
subset_taxa
:: phyloseq
data.table
:: data.table
ggplot
:: ggplot2
taxa_sums
:: phyloseq
qplot
:: ggplot2
ggsave
:: ggplot2
pd
:: picante
plot_richness
:: phyloseq
estimate_richness
:: phyloseq
subset_samples
:: phyloseq
kruskal.test
:: stats
pairwise.wilcox.test
:: stats
plot_composition
:: microbiome
transform
:: microbiome
ordinate
:: phyloseq
plot_ordination
:: phyloseq
rarefy_even_depth
:: phyloseq
anova
:: stats
tax_glom
:: phyloseq
psmelt
:: phyloseq
hclust
:: stats
ggdendrogram
:: ggdendro
phyloseq_to_metagenomeSeq
:: phyloseq
pData
:: metagenomeseq
colorRampPalette
:: grDevices
plotMRheatmap
:: metagenomeseq
plot_taxa_prevalence
:: microbiome
plot_taxa_cv
:: microbiomeutilities
plot_taxa_heatmap
:: microbiomeutilities
Kindly cite this work as follows: “Leo Lahti, Sudarshan Shetty et al. (2017). Tools for microbiome analysis in R. Version 1.1.2. URL: http://microbiome.github.com/microbiome. Check also the relevant references listed in the manual page of each function.
The tutorial utilizes tools from a number of other R extensions, including dplyr (Wickham, Francois, Henry, et al., 2017), ggplot2 (Wickham, 2009), phyloseq (McMurdie and Holmes, 2013), tidyr (Wickham, 2017), vegan (Oksanen, Blanchet, Friendly, et al., 2017).
This website theme was created by modifiying the rmdformats readthedown format.
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252
## [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
## [5] LC_TIME=Dutch_Netherlands.1252
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] metagenomeSeq_1.20.0 glmnet_2.0-13
## [3] foreach_1.4.3 Matrix_1.2-12
## [5] limma_3.34.3 Biobase_2.38.0
## [7] BiocGenerics_0.24.0 ggdendro_0.1-20
## [9] bindrcpp_0.2 knitr_1.18
## [11] picante_1.6-2 nlme_3.1-131
## [13] microbiomeutilities_0.99.0 DT_0.2
## [15] dplyr_0.7.4 microbiome_1.1.10008
## [17] phyloseq_1.22.3 data.table_1.10.4-3
## [19] scales_0.5.0 reshape2_1.4.3
## [21] RColorBrewer_1.1-2 vegan_2.4-6
## [23] lattice_0.20-35 permute_0.9-4
## [25] plyr_1.8.4 ape_5.0
## [27] ggplot2_2.2.1
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 matrixStats_0.52.2 rprojroot_1.3-1
## [4] tools_3.4.1 backports_1.1.1 R6_2.2.2
## [7] KernSmooth_2.23-15 lazyeval_0.2.1 mgcv_1.8-22
## [10] questionr_0.6.2 colorspace_1.3-2 ade4_1.7-8
## [13] tidyselect_0.2.3 gridExtra_2.3 compiler_3.4.1
## [16] labeling_0.3 bookdown_0.5 caTools_1.17.1
## [19] stringr_1.3.0 digest_0.6.12 rmarkdown_1.8.5
## [22] XVector_0.18.0 pkgconfig_2.0.1 htmltools_0.3.6
## [25] highr_0.6 htmlwidgets_1.0 rlang_0.2.0
## [28] rstudioapi_0.7 shiny_1.0.5 bindr_0.1
## [31] jsonlite_1.5 gtools_3.5.0 magrittr_1.5
## [34] biomformat_1.6.0 Rcpp_0.12.15 munsell_0.4.3
## [37] S4Vectors_0.16.0 viridis_0.5.0 stringi_1.1.6
## [40] yaml_2.1.16 MASS_7.3-47 zlibbioc_1.24.0
## [43] rhdf5_2.22.0 gplots_3.0.1 grid_3.4.1
## [46] gdata_2.18.0 ggrepel_0.7.0 miniUI_0.1.1
## [49] Biostrings_2.46.0 splines_3.4.1 multtest_2.34.0
## [52] pillar_1.1.0 igraph_1.1.2 ggpubr_0.1.6
## [55] codetools_0.2-15 stats4_3.4.1 glue_1.2.0
## [58] evaluate_0.10.1 rmdformats_0.3.3 httpuv_1.3.5
## [61] gtable_0.2.0 purrr_0.2.4 tidyr_0.8.0
## [64] assertthat_0.2.0 mime_0.5 xtable_1.8-2
## [67] survival_2.41-3 viridisLite_0.3.0 tibble_1.4.2
## [70] pheatmap_1.0.8 iterators_1.0.8 IRanges_2.12.0
## [73] cluster_2.0.6