Microbial diversity analysis using R tools

Introduction
- An outline of the basic steps:
Loading the libraries/packages
Making a phyloseq object
Exploring your sequence data
Alpha diversity calculations
Test of significance
- Classification in NG-Tax
Plot composition
- Barplot composition
Beta diversity metrics
- Unconstrained
- Constrained
Additional examples
References
Index

This the microbial diversity analysis tutorial (Version 2.01)

Introduction

This is a simplified version of various methods available these days to microbial ecologists. The ideology of putting all of this together is to share the information and also clarify the ‘ease’(you see we didn’t say ‘simple’) of using R-software and related packages. The analyses shown here are basic and aimed mostly at introducing the reader to commonly used packages, scripts and data analysis methods. Descision making related to different parameters will still be soley upon the user, although output generated using NG-Tax/DADA2/Deblur may need very little polishing, depending on the settings that were used to get OTUs/ASVs. OTUs i.e. Operational taxonomic units and ASVs i.e Amplicon sequence variants are terminologies that will approach used to process raw reads click here. It is advised that the reader does adequate referencing for each of the features and makes better choices on the methods, etc. during the analysis of her/his data, based on the biological question.

The main tools used here are Phyloseq, microbiome, Vegan, ggplot2, tidyverse packages.

Important Note 1:

Wisdom1 - There is no substitute for careful reading, so read the tutorial first and then start playing with it.
Wisdom2 - never skip a step or piece of text, you might need a file that was generated previously.
Wisdom 3 - If it took the authors weeks to make this tutorial don’t expect to grasp it in one afternoon. Take your time and don’t rush.

Kindly cite all the packages and tools that you have used in your analysis. Also make sure that you provide the scripts you used for analysis as supplementary material with your research article.
You can also find useful cheat sheets for R in the folder Useful “Useful R cheat sheets”.
or other simple commands on plotting or data transformation on Quick-R
============================================================================================

Here, we use data from Mock Communities (MCs) used to benchmark the NG-Tax pipeline. These MCs are synthetic communities of known composition. It represents data from 3 different Hiseq runs spread over 7 libraries with two different primers covering region V4 (F515-R806) and V5-V6 (F784-R1064) and different PCR settings (25,30 and 35 cycles and pooling of triplicate PCR reactions or a single one). More information with regards to PCR settings and generation of the data can be found here:NG-Tax. For other published test datasets you can check Qiita

Location of test data files: ./input_files

Therefore, this data will alow you to compare what you sequenced to perfect error free data and see how well your sequenced data can reproduce the theoretical composition and underlying biological signals.

Download the complete repository to your local PC

In our lab the commonly used tool for OTU picking is NG-Tax. This approach removes a lot of spurious OTUs arising from PCR and sequencing errors. Although, based on the analysis of the positive controls (Mock Communities) further filtering or data transformations might be necessary. This small tutorial aims to provide tools on which to base these decisions.

An outline of the basic steps:

Import OTU table/Biom file, metadata, tree files into R using the Phyloseq package.

NOTE
A new function read_phyloseq has been created in Microbiome package, please update to the latest version for easy creation of phyloseq object.

Pre-proccessing of the phyloseq object.
Alpha diversity analysis
Beta diveristy analysis (unconstrained, constrained and with and without rarefaction).
Other useful R packages.

Loading the libraries/packages

These are necessary to run the commands If you are setting up RStudio for the first time you need to install a wide range of R-packages. This can be found in the file required_packages.R in the ./information folder.

You may need to do some changes and this might be specific for your PC. Read the required_packages.R file properly.

For instructions on installing different R packages click here.

For instructions on how to load different file formats into R see import data.

#Load the required packages
library(ggplot2)
library(ape)
library(plyr)
library(vegan)
library(RColorBrewer)
library(reshape2)
library(scales)
library(data.table)
library(microbiome)
library(dplyr)
library(phyloseq)
library(DT) #for interactive tables
library(microbiomeutilities)

When you do your own analysis, create a new project in RStudio.
If you have the mapping file in .txt format then open it in excel and save it as CSV(comma delimited) “.csv”. Also good to renames samples column from “#SampleID” to “SampleID”. Remove the # sign. Do not keep any special characters like #,$ and also spaces etc. use underscore “_" to seperate two words for eg. SampleID can also be written as Sample_ID.

If you have downloaded this repository, open the .Rproj and follow the step outlined below.

Open the Microbiome_tutorial_V2.rmd in RStudio and run run the code chunks one by one and analyse.

**Remember the path in windows is given as “" however for R it has to be changed to”/“. **

When you are working on your own data kindly check that:

Making a phyloseq object

This is the basis for your analyses. In this phyloseq object, information on OTUs, taxonomy, the phylogenetic tree and metadata is stored. A single object with all this information provides a very convinient way of handling data. Please remember that the metadata (i.e. mapping) file has to be in *.csv format (columns have sample attributes). Below you can see how the mapping file has been used.

For more infromation: phyloseq

Things to be done in QIIME terminal (if required): Important Note 2: If you have error in loading the biom files stating JSON or HDF5 then you need to convert it in to a JSON format.

For this, use the following command within the linux installed with biom terminal and not in R!

# biom convert -i NGTaxMerged.biom -o ngtax_json.biom --table-type "OTU table" --to-json

For more information on the biom format please click here.

Let’s begin:

1] Read the biom and mapping file

From now on, all the commands are for R.

NOTE
Update to latest version of Microbiome package to use the read_phyloseq function. This function can be used for reading other outputs (like .shared and consensus taxonomy files from mothur) into phyloseq object.

Read input to phyloseq object

# may take anywhere between 30 seconds to 2 or more minutes to create a phyloseq object depending on the size of biom file and your PCs processing strength.

pseq1 <- read_phyloseq(otu.file = "./input_files/NGTaxMerged_conv.biom", taxonomy.file = NULL, metadata.file = "./input_files/mappingMerged_edit.csv", type = "biom")

## Time to complete depends on OTU file size

Read the tree file.

Note: requires a package called ape and the extension has to be “.tre” and not “.tree” (you can just change the name of the file extension)

# Load tree file
library(ape)
treefile_p1 <- read.tree("./input_files/combinedTree.tre")

Merge into phyloseq object.

ps1 <-merge_phyloseq(pseq1,treefile_p1)
# ps1 is the first phyloseq object.
print(ps1)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 466 taxa and 57 samples ]
## sample_data() Sample Data:       [ 57 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 466 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 466 tips and 464 internal nodes ]

The next steps will guide you through the process of understanding the noise levels in your data and provide you with reasoning to eliminate some of it.

Important Note 4

All the things you do with your data from this point on are based on your decision making. The settings of the NG-Tax pipeline that you used to get this data are optimised. However, if you have very noisy data, first re-run your data with a more strict abundance cut-off before trying to remove it yourself.

Always keep track of the filtering steps you performed and make a note of it!

The decision making starts here! For instance: Do we expect or want to include Choloropast and Mitrochondria related sequences? if NO then filter them out, if yes skip to next step using ps1 as input file. But first ask yourself, WHY do I want to remove these sequences?

# check the data

print(ps1)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 466 taxa and 57 samples ]
## sample_data() Sample Data:       [ 57 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 466 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 466 tips and 464 internal nodes ]

# check if any OTUs are not present in any samples
any(taxa_sums(ps1) == 0)

## [1] FALSE

The answer is FALSE, there are no OTUs that are not found in any samples. It is usually TRUE in the case when data is subset to remove some samples. OTUs unique to those sample are not removed along with the samples. Therefore, it is important to check this everytime the phyloseq object is filtered for samples using subset_samples function.

ps1a <- prune_taxa(taxa_sums(ps1) > 0, ps1)

# check again if any OTUs are not present in any samples
any(taxa_sums(ps1a) == 0)

## [1] FALSE

After the prune_taxa function is run to remove OTus with zero occurances, the answer is “FALSE”.

Check how many OTUs are kept.

# subtract the number of OTUs in original (ps1) with number of OTUs in new phyloseq (ps1a) object.
# no. of OTUs in original 
ntaxa(ps1)

## [1] 466

# no. of OTUs in new 
ntaxa(ps1a)

## [1] 466

ntaxa(ps1) - ntaxa(ps1a)

## [1] 0

rank_names(ps1) #we check the taxonomic rank information

## [1] "Domain" "Phylum" "Class"  "Order"  "Family" "Genus"

datatable(tax_table(ps1)) # the table is interactive you can scrol and search thorugh it for details.

# Check the taxonomy levels

rank_names(ps1a)

## [1] "Domain" "Phylum" "Class"  "Order"  "Family" "Genus"

Filter phyloseq

# Use the interactive table to check under which taxonomic rank chloroplast and Mitrochondria are mentioned. Specify them Class or Order in the code below.    

ps1a <- subset_taxa(ps1,Class!="c__Chloroplast")

ps1b <- subset_taxa(ps1a,Order!="o__Mitochondria")

# Second question: Do we expect or want to include the archaeal sequences? if you have used primers which do not  target them, then it is a spurious observation.
#Keep in mind: The fact that you picked up these sequences with any stringing cut-off from your otupicking algorithm means that they were relatively abundant, check the confidence of the sequences and taxonomic assignments before you decide to remove them.
#If you find high abundance/high confidence sequences that in theory should not be there, maybe more things are wrong with your data. Be critical. OR, your primer did pick up these sequences.

ps1 <- subset_taxa(ps1b,Domain!="k__Archaea")

# We again name it as ps1 to avoid complicating or creating lot of ps objects.

sort(sample_sums(ps1)) #check the number of reads/sample,we can see that one of the samples has only 1867 seqs, normally we would remove it, but because it mock community data, we can check wether this sequencing depth is enough to capture the theoretical composition, so for now we will leave it in. Just pay extra attention to where this sample ends up in downstream analysis.

##          Mc.3.2.l01          Mc.4.2.l01          Mc.4.1.l01 
##                1867                2540                2903 
##          Mc.3.1.l01          Mc.3.3.l01          Mc.4.3.l01 
##                2912                5526                5842 
##          Mc.2.3.l01          Mc.2.1.l01          Mc.1.t.l44 
##                6467                7167               16000 
##          Mc.1.t.l56          Mc.2.2.l01          Mc.4.1.l07 
##               16000               19314               22661 
##          Mc.4.1.l06          Mc.2.2.l07          Mc.4.1.l03 
##               30878               31764               31831 
##          Mc.2.1.l03          Mc.4.1.l05     Mc.2.2.35cy.l02 
##               34773               37011               44499 
##          Mc.2.2.l06          Mc.2.2.l05          Mc.2.t.l44 
##               45569               49657               51000 
##          Mc.1.3.l01          Mc.4.1.l04          Mc.2.t.l56 
##               52479               53851               54000 
##          Mc.2.2.l03          Mc.4.2.l07          Mc.1.2.l01 
##               54088               64714               69762 
##          Mc.2.2.l04          Mc.4.1.l02          Mc.3.t.l44 
##               76177               77281               79000 
##          Mc.4.2.l02          Mc.3.t.l56          Mc.4.2.l05 
##               81040               82000               83515 
##          Mc.4.2.l06          Mc.4.t.l44          Mc.3.3.l02 
##               84618               97500               98645 
##          Mc.2.1.l06          Mc.4.2.l03          Mc.4.t.l56 
##               99411               99638              100000 
##     Mc.2.2.30cy.l02          Mc.2.1.l05 Mc.2.2.100v.dup.l02 
##              100574              101903              104536 
##          Mc.3.1.l02          Mc.1.1.l01          Mc.2.1.l07 
##              107261              108440              122869 
##          Mc.4.2.l04          Mc.2.1.l02          Mc.2.1.l04 
##              123830              171787              191415 
##          Mc.3.2.l02          Mc.2.2.l02      Mc.2.2.dup.l02 
##              196934              212613              215903 
##     Mc.2.2.100v.l02          Mc.1.2.l02          Mc.2.3.l02 
##              228492              232876              269422 
##          Mc.1.3.l02          Mc.4.3.l02          Mc.1.1.l02 
##              284202              288597              315845

# Nevertheless, in case you want to remove some samples because of low sequencing depth or some other technical reasons you can remove it from the analysis with the following command:
# ps1.sub <-subset_samples(ps1, #phyloseq object
                       #  SampleID !="SampleName") #metadata category name and != mean including all expect "SampleName"

# again we type the name and hit enter
# ps1 <- ps1.sub # this is our naively filtered phyloseq file which can undergo further filtering

# If you dont have anything to filter then continue with ps1 object for rest of the analysis.
datatable(tax_table(ps1))

In our Mock Community analysis using data from a Hiseq machine, we don’t have any of the aforementioned issues and hence we will continue using “ps1” as our phyloseq object for downstream analysis.

We have the phyloseq object ready for step wise analysis.

Exploring your sequence data

Distribution of the reads over the samples

For this, we need one extra R package to handle data objects called “data.table”. We will then explore our preliminary data. The table is interactive and you can scroll and search the table.

Let’s take a look at the distribution of the reads over the samples

library("data.table")
library(ggplot2)
ps1_df= data.table(as(sample_data(ps1), "data.frame"), Reads_per_sample = sample_sums(ps1), keep.rownames = TRUE)
ps1_df_plot = ggplot(ps1_df, aes(Reads_per_sample)) + geom_histogram() + ggtitle("Distribution of reads per sample") + ylab("Sample counts") 

print(ps1_df_plot) #normal plot

ggsave("./output/Distribution_of_reads_per_sample.pdf", height = 6, width = 8) # you can create a sub-folder for figures to save all files in location. here, "./output/" notice the full stop or punt (in dutch) infornt of slash "./" this tells R to go inside a folder within the exsisting working directory and save the output.

# I include this now for better managing of projects.

Some samples at the right have high number of reads. This is a way to see if you have more or less even sequencing depth. Here, the mock communities were seqeucnced at different depths. Don’t worry, differences of up to 50x were quite normal.

Now let’s check the distribution of the OTUs in our data set.

# We make a data table with information on the OTUs
ps1.dt.taxa = data.table(tax_table(ps1),OTUabundance = taxa_sums(ps1),OTU = taxa_names(ps1))
ps1.dt.tax.plot <- ggplot(ps1.dt.taxa, aes(OTUabundance)) + geom_histogram() + ggtitle("Histogram of OTU (unique sequence) counts") + theme_bw()
print(ps1.dt.tax.plot)

#write.table(ps1.dt.taxa, "ps1_dt_tax_plot.txt", sep = "\t")
ggsave("./output/Histogram of OTU counts.pdf", height = 6, width = 7)

We observe a distribution that is skewed to the left side of the plot. This means that we have some OTUs that are repeated > 300.000 times, but a large number of OTUs are repeated < 100.000 (which is of course is still high). However we see that a very large proportion of our OTUs is repeated a lot less. Remember we have 466 OTUs in total. Now it seems that these are close to zero, but let’s have a closer look. Only a small proportion of reads is repeated less than 250x

library(ggplot2)
plot.zoom <- ggplot(ps1.dt.taxa, aes(OTUabundance)) + geom_histogram() + ggtitle("Histogram of Total Counts") + xlim(0, 1000) + ylim (0,50) + theme_bw()
print(plot.zoom)

## Warning: Removed 166 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_bar).

ggsave("./output/histogram of counts_zoomed_0_to_1000.pdf", height = 6, width = 7)

## Warning: Removed 166 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_bar).

If we zoom in even further, we see that only very little OTUs contain less than 20 reads. In real case senario, if your sequencing and OTU identification did not go well you might see a skewed distribution and may indicate the need for further filtering and processing.

library(ggplot2)
plot.zoom <- ggplot(ps1.dt.taxa, aes(OTUabundance)) + geom_histogram(breaks=seq(0, 20, by =1)) + ggtitle("Histogram of Total Counts") + theme_bw()
print(plot.zoom)

ggsave("./output/histogram of counts_zoomed_0_to_20.pdf", height = 6, width = 7)

Let’s find out what these numbers actually mean with regards to the total amount of data in this set.

Read summary

#total number of reads in the dataset
reads_per_OTU <- taxa_sums(ps1)
print(sum(reads_per_OTU))

## [1] 5251399

# we have a total of 5251399 reads

#We saw in the last graph that only a small fraction of the OTUs consist of less than 10 reads, so how much OTUs are this and how many reads do they contain?

print(length(reads_per_OTU[reads_per_OTU < 10]))

## [1] 109

# there are 109 OTus that contain less than 10 reads
print(sum(reads_per_OTU[reads_per_OTU < 10]))

## [1] 512

# more importantly, these OTUs only contain 512 reads
print((sum(reads_per_OTU[reads_per_OTU < 10])/sum(reads_per_OTU))*100)

## [1] 0.009749783

# which is only 0.00974% of the data, not bad right?

# To put this into context; out of the 466 OTUs, a 109 OTUs contain less than 10 reads, which is:
print((109/466)*100)

## [1] 23.39056

# 23.4% of the OTUs. This sounds like a large number, but remember; they only contain 0.00974% of the total data! This means that OTUs with a lower confidence (ie repeated 10 times or less are only a very small part of this dataset). Check this distribution with your own dataset.

#Important!: From this we can conclude that a very small part of your data can have a significant effect on your downstream analyses, if you use methods that use OTUs instead of the information that is containd in de sequences, such as Unifrac and phylogenetic diversity.

#Some other other checks that are normally done on sequencing data. You might recognize them from literature
# Check for singletons

ps1.dt.taxa[(OTUabundance <= 0), .N] # no singletons

## [1] 0

#check for doubletons
ps1.dt.taxa[(OTUabundance <= 2), .N]

## [1] 11

# out of 466, we have 11 doubletons
print((11/466)*100)

## [1] 2.360515

print((sum(reads_per_OTU[reads_per_OTU <= 2])/sum(reads_per_OTU))*100)

## [1] 0.000418936

# only 2.3% of the OTUs are doubletons and they consist of 0.000418% of the data

What we do now is important and also needs to be documented for reproducibility. We plot the cumulative sum of taxa to make decisions for filtering.

Important Note 6 :
As is clear from the former paragraph, in most instances with NG-Tax there will not be any need to filter out OTUs!! This is just for validation of the data. Remember, if you start filtering afterwards you create a bias, because the following filtering steps ignore the distribution of your data among the samples!!
Based on how the mock community data looks like, you can don’t need to do any filtering. However, if your data is problematic We will discuss some things you can do in later sections using examples.

You can filter by abundance ie. OTUs that contain less than a set number of reads are discarded. You can plot the OTU distribution.

Filtering Threshold, Minimum Total Counts

library(ggplot2)
ps1.cumsum = ps1.dt.taxa[, .N, by = OTUabundance]
setkey(ps1.cumsum, OTUabundance)
ps1.cumsum[, CumSum := cumsum(N)]
# Define the plot
ps1.cumsum.plot = ggplot(ps1.cumsum, aes(OTUabundance, CumSum)) + 
  geom_point() +
  xlab("Filtering Threshold, Minimum Total Counts") +
  ylab("OTUs Filtered") +
  ggtitle("OTUs that would be filtered vs. the minimum count threshold") + theme_bw()

print(ps1.cumsum.plot)

ggsave("./output/OTUs that would be filtered vs the minimum count threshold.pdf", height = 6, width = 7)

Zoom in on the OTUs that contain <100 reads.
This will give an idea about the loss of OTUs when you use a certain count threshold.

library(ggplot2)
ps1.cumsum.plot.zoom <- ps1.cumsum.plot + xlim(0, 100) + theme_bw() #this will give an idea of the potential loss in otus you might expect in our dataset.

print(ps1.cumsum.plot.zoom)

## Warning: Removed 280 rows containing missing values (geom_point).

ggsave("./output/OTUs that would be filtered vs the minimum count threshold_zoom_0_to_100.pdf", height = 6, width = 7)

## Warning: Removed 280 rows containing missing values (geom_point).

In the next plot we can see the variance in the reads/OTU. These plots might be informative if you have very skewed data and see patterns. Here, however we see no clear patterns. We also know that the data represents the theoretical MCs quite good, so no need to draw any conclusions. However, check whether your own data deviates from this dataset.

Variance

library(ggplot2)
Variance.plot <- qplot(log10(apply(otu_table(ps1), 1, var)), xlab = "log10(variance)", main = "Variance in OTUs")
print(Variance.plot)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggsave("./output/OTU_distribution_NG-Tax.pdf", height = 6, width = 7)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

CV

Alternatively, you can also check for Coefficeint of variation of OTUs to check for spurious observations.

ps1.rel <- microbiome::transform(ps1, "compositional") # transform to relative abundance
p <- plot_taxa_cv(ps1.rel)

print(p)

As can be seen, OTUs with low abundance have relatively high Coeeficient of variation. These may affect the differential abundance testing.

Alpha diversity calculations

Note: Some alpha diversity metrics such as ACE are richness estimators, which also consider rare OTUs to estimate the total richness. However if all these rare OTUs are really just error (and the vast majority of them are) then these estimates are meaningless anyway. Thus using NG-Tax, phyloseq will give an error and will ask for “unfiltered data’. We will use the plot_richness function from the phyloseq package for alpha diversity analysis. However, it is important that you document this and be aware that you results will be conservative. However, if you do use unfiltered data your answers will be useless and based on the number of reads you have per sample (as you know this depends on the barcode and the amount of reads you ordered, not on your sample composition). Also, as already explained; the number of OTUs and the amount of data/reads they contain is very skewed. That is why we advise phylogenetic diversity for all diversity analyses, unless other metrics give similar results.

The real composition of the MCs is as follows, the real values will depend slightly on the sequenced region (V4 is more stable so will contain less OTUs when sequenced, while V5-V6 will result in more; if things are done properly):

MC1: 17 species, equimolar MC2: 55 species, equimolar MC3: the same 55 species as MC2, but distributed differently (lower evenness=lower diversity) MC4: 43 species, with some at 0.1,0.01 and 0.001%. Again the same type of species as in the other MCs

Try to figure out what kind of pattern of the MCs you expect based on this information.

Phyogenetic diversity

Phylogenetic diversity is calculated using the picante package.

# if you want to specify specific colors
my_colors <- c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578", "#CD9BCD", "#AD6F3B", "#673770","#D14285", "#652926", "#C84248", "#8569D5", "#5E738F","#D1A33D", "#8A7C64", "#599861", "steelblue" )

# install.packages("picante",repos="http://R-Forge.R-project.org") 
# library(picante)
library(picante)
otu_table_ps1 <- as.data.frame(ps1@otu_table)
metadata_table_ps1  <- as.data.frame(ps1@sam_data)

df.pd <- pd(t(otu_table_ps1), treefile_p1,include.root=F) # t(ou_table) transposes the table for use in picante and the tre file comes from the first code chunck we used to read tree file (see making a phyloseq object section).


datatable(df.pd)

# now we need to plot PD
# check above how to get the metadata file from phyloseq object.
# We will add the results of PD to this file and then plot.
metadata_table_ps1$Phyogenetic_diversity <- df.pd$PD 

plot.pd <- ggplot(metadata_table_ps1, aes(MC_type2, Phyogenetic_diversity)) + geom_boxplot(aes(fill = Region)) + geom_point(size = 2) + theme(axis.text.x = element_text(size=14, angle = 90)) + scale_fill_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578")) + theme_bw()
print(plot.pd)

We can see that the sequenced phylogenetic diversity very accurately reproduces the theoretical diversity.

For the convenience of using other metrics, we will show the plot for several options available in Phyloseq and Microbiome packages.

Non-Phylogenetic metrics

#plot the diversity using different metrics
p <- plot_richness(ps1, "Mock_type", measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"))

## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

p <- p + geom_boxplot(aes(fill = "Mock_type")) + scale_fill_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578"))
print(p)

## Warning: Removed 49 rows containing non-finite values (stat_boxplot).

## Warning: Removed 334 rows containing missing values (geom_errorbar).

#plot the diversity using different metrics and color the points by different variable region, the plot also contains the expected values and sequenced values to compare what you sequenced to what it is supposed to be. Check with your own expectations.
p <- plot_richness(ps1, "MC_type2", measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"))

## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

p <- p + geom_point(aes(color = Region)) + theme_bw()
print(p)

## Warning: Removed 334 rows containing missing values (geom_errorbar).

## Warning: Removed 49 rows containing missing values (geom_point).

To investigate in more detail, we will separate our data based on 16S rRNA gene hypervariable region and plot the alpha diversity metrics.

Only the v4 region.

#to plot the regions separately (because we saw that they give slightly different results)
#filter out the samples based on region

ps1.V4 <- subset_samples(ps1, Region=="V4")
plot.v4.Adiv <- plot_richness(ps1.V4, "MC_type2", measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"))

## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

plot.v4.Adiv <- plot.v4.Adiv + geom_point(aes(color = Mock_type)) + theme_bw()
print(plot.v4.Adiv)

## Warning: Removed 246 rows containing missing values (geom_errorbar).

## Warning: Removed 41 rows containing missing values (geom_point).

Only the V5-V6 region.

ps1.V5V6 <- subset_samples(ps1, Region=="V5-V6")

plot.V5V6.Adiv <- plot_richness(ps1.V5V6, "MC_type2", measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"))

## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

plot.V5V6.Adiv <- plot.V5V6.Adiv + geom_point(aes(color = Mock_type))

As you can see only Simpson/invSimpson reproduces the biological order of the theoretical MCs (i.e perfect data without sequencing/PCR error). Therefore, most of these metrics don’t show the underlying biological conclusions, but are based on other (non-biological) parameters. UNLESS the difference between your biological samples/groups was very large to begin with (such ~3x higher richness as in MC1 compared to the other MC).

However, when we use a metric such as phylogeny diversity or Faith’s diversity index we reproduce the biological order of the theoretical samples, with both regions. Therefore this metric is preferred.

Important note 6 :
Always challenge your data: Use phylogenetic diversity, check the other metrics and see in what ways they give similar results (or not!) and try to deduce why you observe these things.

Others

For more diversity indices please refer to Microbiome Package

Test of significance

Next we can test whether there are significant differences between the MC types

# We can check whether there  are any significant differences in alpha diversity between the sample groups we are interested in.
ps1.adiv <- estimate_richness(ps1, measures = c("Chao1", "Shannon", "Observed", "InvSimpson"))

## Warning in estimate_richness(ps1, measures = c("Chao1", "Shannon", "Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

ps1.metadata <- as(sample_data(ps1), "data.frame")
head(ps1.metadata)

##            BarcodeSequence LibraryNumber Direction
## Mc.1.1.l02        CTGCTGAA             2         p
## Mc.1.2.l02        GCCTTAAG             2         p
## Mc.1.3.l02        CTTGGCCT             2         p
## Mc.2.1.l02        GAACCGTT             2         p
## Mc.2.2.l02        GAACGTAT             2         p
## Mc.2.3.l02        GAACTAAG             2         p
##                                                                                                                      LibraryName
## Mc.1.1.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Mc.1.2.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Mc.1.3.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Mc.2.1.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Mc.2.2.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
## Mc.2.3.l02 NG-6514_barcode__A09385_lib19055_1407_Read1_tagsorting.zip,NG-6514_barcode__A09385_lib19055_1407_Read2_tagsorting.zip
##            Facet MC_type1 MC_type2      ProjectName Region Mock_type
## Mc.1.1.l02  Mc_1     Mc_1     Mc_1 Mock_communities     V4      Mc_1
## Mc.1.2.l02  Mc_1     Mc_1     Mc_1 Mock_communities     V4      Mc_1
## Mc.1.3.l02  Mc_1     Mc_1     Mc_1 Mock_communities     V4      Mc_1
## Mc.2.1.l02  Mc_2     Mc_2     Mc_2 Mock_communities     V4      Mc_2
## Mc.2.2.l02  Mc_2     Mc_2     Mc_2 Mock_communities     V4      Mc_2
## Mc.2.3.l02  Mc_2     Mc_2     Mc_2 Mock_communities     V4      Mc_2
##            Mock_replicate BarcodeNumber Description MockTypeGroup
## Mc.1.1.l02         Mc_1_1            33  Mc_1_1_l02          Mc_1
## Mc.1.2.l02         Mc_1_2            45  Mc_1_2_l02          Mc_1
## Mc.1.3.l02         Mc_1_3            35  Mc_1_3_l02          Mc_1
## Mc.2.1.l02         Mc_2_1            36  Mc_2_1_l02          Mc_2
## Mc.2.2.l02         Mc_2_2            37  Mc_2_2_l02          Mc_2
## Mc.2.3.l02         Mc_2_3            38  Mc_2_3_l02          Mc_2

#We add the columns with the alpha-div indices from ps3.adiv to ps3.adiv.metadata where all info including metadata is now available.
ps1.metadata$Observed <- ps1.adiv$Observed 
ps1.metadata$Shannon <- ps1.adiv$Shannon
ps1.metadata$InvSimpson <- ps1.adiv$InvSimpson
ps1.metadata$Phyogenetic_diversity <- df.pd$PD # we also add Phylogenetic diversity values from previous code for analysis.

colnames(ps1.metadata) #check the last three coloumns will be the alpha diversity indices

##  [1] "BarcodeSequence"       "LibraryNumber"        
##  [3] "Direction"             "LibraryName"          
##  [5] "Facet"                 "MC_type1"             
##  [7] "MC_type2"              "ProjectName"          
##  [9] "Region"                "Mock_type"            
## [11] "Mock_replicate"        "BarcodeNumber"        
## [13] "Description"           "MockTypeGroup"        
## [15] "Observed"              "Shannon"              
## [17] "InvSimpson"            "Phyogenetic_diversity"

For tutorial on doing statistics on diversity values please check the microbiome R package website click here.

Classification in NG-Tax

It is also useful to mark the unclassified OTUs at Domain and Phylum level. You can also do it at any taxonomic level, especially when you want to plot composition plots, but only when you know what you are doing (!) because “NA” or missing levels do not automatically mean “Unidentified” or “Unknown”. Remember that you can and maybe need to manually reclassify unidentified sequences. IMPORTANT: NG-Tax employs relatively strict cutoffs for classification (high confidence in correct classification) and defaults to “NA”, when there are too many mismatches with a sequence in the database. It is basically open reference OTU picking so If you have reasons, which are backed by biology, to reclassified OTUs that are classified as “NA”, do so.

# If there are NA's in you taxonomy table then do this 

tax_table(ps1)[is.na(tax_table(ps1)[,1])] <- "k__"
tax_table(ps1)[is.na(tax_table(ps1)[,2])] <- "p__"
tax_table(ps1)[is.na(tax_table(ps1)[,3])] <- "c__"
tax_table(ps1)[is.na(tax_table(ps1)[,4])] <- "o__"
tax_table(ps1)[is.na(tax_table(ps1)[,5])] <- "f__"
tax_table(ps1)[is.na(tax_table(ps1)[,6])] <- "g__"


# Then you can convert those unclassified such as p__ without phylum name to this or skip this part
tax_table(ps1)[tax_table(ps1)[,"Phylum"]== "p__", "Phylum" ] <- "Unknown_Phylum"

NG-Tax assigns a name to an OTU using the SILVA database as a reference. Quite often these sequences are not fully classified to genus level in this database, depending on the type of organism. If this is the case, NG-Tax reports the name as “< empty >”. This means that this organism could not be classified further using the SILVA alignment. This is different from the cases where NG-Tax cannot classify a sequence because it cannot distinguish it from another one with high confidence and therefore goes down a rank. Example: a sequence finds a genus level match for: Escherichia/Shigella and Enterobacter in the database. Using short reads these two cannot be distinguished therefore NG-Tax classifies it one rank lower to family: Enterobacter. In this case no “< empty >” will be added to the classification, because this is due to lack of information in the sequence.

This extra information can be very useful in specific cases such as poorly studied/covered ecosystems or when you are interested in very distinct micro-organisms, but cannot find them back in your data.

Also, a sequence containing “< empty >” is very distinct from one that doesn’t. Keep this in mind when analyzing your data and check for taxa that are distinctive in your intervention of ecosystems.

Another way to detect classifications with a confidence of < 1 (meaning not all the sequences in the database that have a hit are classified to the same taxon) is the option in NG-Tax to show these using “*~“. This option is off by default.

Nevertheless when plotting the composition or any other plot where the taxonomic lines are shown, especially for publication, it might be useful to remove these for clearer plots.

#Some of the taxonomy levels may include "<empty>" entries. If you want to remove them for plotting then run the following code.  

ps1.com <- ps1 # create a new pseq object
tax.mat <- tax_table(ps1.com)
tax.df <- as.data.frame(tax.mat)

tax.df[tax.df == "__*~"] <- "__"

tax.df[tax.df == "g__<empty>"] <- "g__"
tax.df[tax.df == "k__<empty>"] <- "k__"
tax.df[tax.df == "p__<empty>"] <- "p__"
tax.df[tax.df == "c__<empty>"] <- "c__"
tax.df[tax.df == "o__<empty>"] <- "o__"
tax.df[tax.df == "f__<empty>"] <- "f__"
tax_table(ps1.com) <- tax_table(as.matrix(tax.df))

Plot composition

Important note 7 :
There are different options for doing all the following analyses. You can use the separate OTUs for finer detail. Just remember they are unique sequences, not necessarily ‘species’. For this purpose, you need to use the ps1 object and then do:

Barplot composition

Barplots are an easy and intuitive way of visualizing the composition of your samples. However, the way this is implemented in phyloseq causes problems with the order of the taxa in the legend at higher taxonomic levels.

# We need to set Palette
taxic <- as.data.frame(ps1.com@tax_table)  # this will help in setting large color options

colourCount = length(unique(taxic$Family))  #define number of variable colors based on number of Family (change the level accordingly to phylum/class/order)
getPalette = colorRampPalette(brewer.pal(12, "Paired"))  # change the palette as well as the number of colors will change according to palette.


otu.df <- as.data.frame(otu_table(ps1.com))  # make a dataframe for OTU information.
# head(otu.df) # check the rows and columns

taxic$OTU <- row.names.data.frame(otu.df)  # Add the OTU ids from OTU table into the taxa table at the end.
colnames(taxic)  # You can see that we now have extra taxonomy levels.

## [1] "Domain" "Phylum" "Class"  "Order"  "Family" "Genus"  "OTU"

library(knitr)
head(kable(taxic))  # check the table.

## [1] "          Domain        Phylum               Class                    Order                   Family                     Genus                     OTU     "
## [2] "--------  ------------  -------------------  -----------------------  ----------------------  -------------------------  ------------------------  --------"
## [3] "5656206   k__Bacteria   p__Proteobacteria    c__Gammaproteobacteria   o__Enterobacteriales    f__Enterobacteriaceae      g__                       5656206 "
## [4] "5656109   k__Bacteria   p__Proteobacteria    c__Gammaproteobacteria   o__Enterobacteriales    f__Enterobacteriaceae      g__                       5656109 "
## [5] "4444188   k__Bacteria   p__Proteobacteria    c__Gammaproteobacteria   o__Enterobacteriales    f__Enterobacteriaceae      g__                       4444188 "
## [6] "4444187   k__Bacteria   p__Proteobacteria    c__Gammaproteobacteria   o__Enterobacteriales    f__Enterobacteriaceae      g__                       4444187 "

taxmat <- as.matrix(taxic)  # convert it into a matrix.
new.tax <- tax_table(taxmat)  # convert into phyloseq compatible file.
tax_table(ps1.com) <- new.tax  # incroporate into phyloseq Object



# now edit the unclassified taxa
tax_table(ps1.com)[tax_table(ps1.com)[, "Family"] == "f__", "Family"] <- "Unclassified family"

# We will also remove the 'f__' patterns for cleaner labels
tax_table(ps1.com)[, colnames(tax_table(ps1.com))] <- gsub(tax_table(ps1.com)[, 
    colnames(tax_table(ps1.com))], pattern = "[a-z]__", replacement = "")

# it would be nice to have the Taxonomic names in italics.
# for that we set this
guide_italics <- guides(fill = guide_legend(label.theme = element_text(size = 15, 
    face = "italic", colour = "Black", angle = 0)))


## Now we need to plot at family level, We can do it as follows:

# first remove the phy_tree

ps1.com@phy_tree <- NULL

# Second merge at family level

ps1.com.fam <- aggregate_taxa(ps1.com, "Family")



plot.composition.COuntAbun <- plot_composition(ps1.com.fam) + theme(legend.position = "bottom") + 
    scale_fill_manual("Family", values = getPalette(colourCount)) + theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
  ggtitle("Relative abundance") + guide_italics + theme(legend.title = element_text(size=18))
  
plot.composition.COuntAbun

ggsave("./output/Family_barplot_CountAbundance.pdf", height = 6, width = 8)

This plot is based on the reads/sample. You can see the reads are not evenly distributed over the samples, nevertheless their overall composition seems to be the same. The only thing that is different is the scaling. You don’t need any other normalization algorithms. To check this in the next step we plot the relative abundance.

Make it relative abundance

# the previous pseq object ps1.com.fam is only counts.

# Use transform function of microbiome to convert it to rel abun.

ps1.com.fam.rel <- microbiome::transform(ps1.com.fam, "compositional")

plot.composition.relAbun <- plot_composition(ps1.com.fam.rel) + theme(legend.position = "bottom") + 
    scale_fill_manual("Family", values = getPalette(colourCount)) + theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
  ggtitle("Relative abundance") + guide_italics + theme(legend.title = element_text(size=18))
  
plot.composition.relAbun

ggsave("./output/Family_barplot_RelAbundance.pdf", height = 6, width = 8)

For more information Microbiome tutorial

In the legends of these composition plots now consists of the taxonomic labels.

Beta diversity metrics

This approach, together with alpha-diversity is very sensitive to spurious otus, lots of zeros and skewed distribution of counts. Hence, different people have different preferences, such as rarefaction or normalization algorithms. Rarefaction is a common treatment, however for NG-Tax data this is unnecessary even though the sequencing depth ranges from 1867 to 315845 reads per sample, therefore we can just transform the data into relative abundance. In case you want to still rarefy your data, (REMEMBER! you throw away >90% of your data!) below are the commands for different ordination methods.

For more information:

Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.

Normalisation and data transformation.

What is Constrained and Unconstrained Ordination.

Unconstrained

Without rarefaction

ps1.rel <- transform(ps1, "relative.abundance")

set.seed(49275)  #set seed for reproducible rooting of the tree
ordu.wt.uni = ordinate(ps1.rel, "PCoA", "wunifrac")

## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root
## as -- 444439 -- in the phylogenetic tree in the data you provided.

scree.plot <- plot_scree(ordu.wt.uni, "Check for importance of axis in Scree plot for MCs, UniFrac/PCoA")
print(scree.plot)

wt.unifrac <- plot_ordination(ps1.rel, ordu.wt.uni, 
                              color = "MC_type2", 
                              shape = "Region")

wt.unifrac <- wt.unifrac + scale_fill_manual(values = c("#CBD588", 
    "#5F7FC7", "orange", "#DA5724", "#508578")) + 
  ggtitle("Weighted UniFrac relative abundance") + 
  geom_point(size = 3)

print(wt.unifrac)

set.seed(475)  #set seed for reproducible rooting of the tree
ordu.unwt.uni = ordinate(ps1.rel, "PCoA", "unifrac", weighted = F)

## Warning in UniFrac(physeq, ...): Randomly assigning root as -- 5656250 --
## in the phylogenetic tree in the data you provided.

unwt.unifrac <- plot_ordination(ps1.rel, ordu.unwt.uni, 
                                color = "MC_type2", 
                                shape = "Region") + 
  ggtitle("Unweighted UniFrac relative abundance") + 
  geom_point(size = 3)

print(unwt.unifrac)

# Bray Curtis dissimilarity (purely OTU based, doesn't take
# the sequence of the OTUs into account)

ordu.bray = ordinate(ps1.rel, "PCoA", "bray", weighted = F)
bray <- plot_ordination(ps1.rel, ordu.bray, 
                        color = "Mock_type", 
                        shape = "Region")

bray <- bray + scale_fill_manual(values = c("#CBD588", "#5F7FC7", 
    "orange", "#DA5724", "#508578")) + 
  ggtitle("Bray-Curtis relative abundance V4 and V5-V6") + 
    geom_point(size = 3) + theme_bw()

print(bray)

With rarefaction

#with rarefaction

set.seed(09809) # alway set a random number seed for reproducibility of the rarefaction
ps1.rar <- rarefy_even_depth(ps1)

## You set `rngseed` to FALSE. Make sure you've set & recorded
##  the random seed of your session for reproducibility.
## See `?set.seed`

## ...

## 17OTUs were removed because they are no longer 
## present in any sample after random subsampling

## ...

head(sample_sums(ps1.rar)) # check whether it is rarefied

## Mc.1.1.l02 Mc.1.2.l02 Mc.1.3.l02 Mc.2.1.l02 Mc.2.2.l02 Mc.2.3.l02 
##       1867       1867       1867       1867       1867       1867

set.seed(999)
ordu.wt.uni = ordinate(ps1.rar, "PCoA", "wunifrac")

## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root
## as -- 4444137 -- in the phylogenetic tree in the data you provided.

wt.unifrac.rar <- plot_ordination(ps1.rar, ordu.wt.uni, 
                                  color="MC_type2", 
                                  shape="Region") 

wt.unifrac.rar <- wt.unifrac.rar + scale_fill_manual(values = c("MC_type2")) + 
  ggtitle("Weighted UniFrac rarefied to 1867 sequences per sample") + 
  geom_point(size = 3)

print(wt.unifrac.rar)

set.seed(949)
ordu.unwt.uni = ordinate(ps1.rar, "PCoA", "unifrac", weighted=F)

## Warning in UniFrac(physeq, ...): Randomly assigning root as -- 5656256 --
## in the phylogenetic tree in the data you provided.

unwt.unifrac.rare <- plot_ordination(ps1.rar, ordu.unwt.uni, color="MC_type2", shape="Region") + 
  ggtitle("Unweighted UniFrac rarefied to 1867 sequences per sample") + 
  geom_point(size = 3)

print(unwt.unifrac.rare)

#Bray Curtis dissimilarity (purely OTU based, doesn't take the sequence of the OTUs into account)

ordu.bray = ordinate(ps1.rar, "PCoA", "bray", weighted=F)
bray.rar <- plot_ordination(ps1.rel, ordu.bray, color="Mock_type", shape="Region")
bray.rar <- bray + scale_fill_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578")) + 
  ggtitle("Bray-Curtis rarefied to 1867 sequences per sample, region V4 and V5-V6") + 
  geom_point(size = 3)

## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

print(bray.rar)

You can see rarefaction is unnecessary, furthermore it seems that separation of the MCs is even better without rarefaction as you retain minor signals.

Bray-Curtis based dissimilarity using separate regions

Because Bray-Curtis based dissimilarity seems to differentiate between the two regions instead of the biological difference (i.e. MC type) we can take look at a single region.

ps1V4.rel <- transform_sample_counts(ps1.V4, function(x) x / sum(x) )

ordu.brayV4 = ordinate(ps1V4.rel, "PCoA", "bray")
bray <- plot_ordination(ps1V4.rel, ordu.brayV4, 
                        color="MC_type2", 
                        shape="Region")
bray <- bray + scale_fill_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578")) + 
  ggtitle("Bray-Curtis relative abundance V4") + 
  geom_point(size = 3)

print(bray)

ps1.V5V6.rel <- transform_sample_counts(ps1.V5V6, function(x) x / sum(x) )

ordu.brayV5V6 = ordinate(ps1.V5V6.rel, "PCoA", "bray")
bray <- plot_ordination(ps1.V5V6.rel, ordu.brayV5V6, 
                        color="MC_type2", 
                        shape="Shape")

## Warning in plot_ordination(ps1.V5V6.rel, ordu.brayV5V6, color =
## "MC_type2", : Shape variable was not found in the available data you
## provided.No shape mapped.

bray <- bray + scale_fill_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578")) + 
  ggtitle("Bray-Curtis relative abundance V5V6") + 
  geom_point(size = 3) + theme_bw()

print(bray)

From these plots we can conclude that methods that use information contained in the sequence, like Unifrac, are more powerful and less dependent on small (sequencing/PCR) errors.

Constrained

[Partial] Constrained Analysis of Principal Coordinates (CAPSCALE) or distance-based RDA, via capscale. See capscale.phyloseq for more details. In particular, a formula argument must be provided.

# CAP ordinate using Bray Curtis dissimilarity only looking at V4
set.seed(23234) 
ps1V4.rel_bray <- phyloseq::distance(ps1V4.rel, method = "bray") # CAP ordinate 
cap_ord <- ordinate(physeq = ps1V4.rel,  
                    method = "CAP", 
                    distance = ps1V4.rel_bray, 
                    formula = ~ Mock_type)

# chech which asix are explaining how mauch variation

scree.cap <- plot_scree(cap_ord, "Scree Plot for MCs in Constrained Analysis of Principal Coordinates (CAPSCALE)")
print(scree.cap)

# CAP plot 
cap_plot <- plot_ordination(physeq = ps1V4.rel, 
                            ordination = cap_ord, 
                            color = "Mock_type", 
                            shape = "Region",
                            axes = c(1,2)) + 
  geom_point(aes(colour = Mock_type), size = 3) + 
  geom_point(size = 3) +
  scale_color_manual(
    values = c( "#CBD588", "#5F7FC7", "orange","#DA5724", "#508578")) 
cap_plot + ggtitle("CAP_Plot")  + theme_bw()

ggsave("./output/CAP_plot.pdf", height = 8, width = 10)

# Now add the environmental variables as arrows 
arrowmat <- vegan::scores(cap_ord, display = "bp")
# Add labels, make a data.frame 
arrowdf <- data.frame(labels = rownames(arrowmat), arrowmat)
# Define the arrow aesthetic mapping 
arrow_map <- aes(xend = CAP1, 
                 yend = CAP2, 
                 x = 0, y = 0, 
                 shape = NULL, 
                 color = NULL, 
                 label = labels) 
label_map <- aes(x = 1.3 * CAP1, y = 1.3 * CAP2, 
                 shape = NULL, 
                 color = NULL,
                 label = labels)  

arrowhead = arrow(length = unit(0.02, "npc")) 

##now plot the arrow
cap_plot <- cap_plot + geom_segment(mapping = arrow_map, size = .7, 
                        data = arrowdf, color = "black", 
                        arrow = arrowhead) + 
  geom_text(mapping = label_map, size = 4,  
            data = arrowdf, 
            show.legend = TRUE) + ggtitle("CAP_Plot")  + theme_bw()

## Warning: Ignoring unknown aesthetics: shape, label

## Warning: Ignoring unknown aesthetics: shape

print(cap_plot)

ggsave("./output/CAP_plot_arrows.pdf", height = 8, width = 10)

#further editing can be done using any of the image processing tools to easily clarify labels.

Testing for significance in beta diversity

##statistical testing 
set.seed(19743) 
cap_anova <- anova(cap_ord, by="terms", permu=999) # we test the impact of the environmental variable seperately. Kindly check for these functions in the help section on the right side.

print(cap_anova)

## Permutation test for capscale under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## Model: capscale(formula = distance ~ Mock_type, data = data)
##           Df SumOfSqs      F Pr(>F)    
## Mock_type  3  2.37632 48.212  0.001 ***
## Residual  37  0.60789                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

set.seed(19773) 
#we see that Mock type has a significant impact on the beta diversity using only region V4. (Remember; for Bray Curtis the sequenced region was more important)  

#testing various factors (increase the permutation for real data) 
anova(cap_ord) ## Global test of the significance of the analysis.

## Permutation test for capscale under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: capscale(formula = distance ~ Mock_type, data = data)
##          Df SumOfSqs      F Pr(>F)    
## Model     3  2.37632 48.212  0.001 ***
## Residual 37  0.60789                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# calculate R squared values

RsquareAdj (cap_ord)

## $r.squared
## [1] 0.7962975
## 
## $adj.r.squared
## [1] 0.779781

Check this link for more possibilities in ordination analysis

Additional examples

Utilities/handy commands

# Saving the individuals objects within phyloseq as *.csv
write_phyloseq(ps1V4.rel, "OTU", path = "./output")

## Writing OTU in the file ./output/otu_table.csv

## [1] "./output"

write_phyloseq(ps1V4.rel, "TAXONOMY", path = "./output")

## Writing TAXONOMY in the file ./output/taxonomy_table.csv

## [1] "./output"

write_phyloseq(ps1V4.rel, "METADATA", path = "./output")

## Writing METADATA in the file ./output/metadata_table.csv

## [1] "./output"

# select only a subset of samples_P1
# as an example, only samples and other data for theoretical will be selected from original phyloseq object "ps1"
ps1.subset.theoretical <- subset_samples(ps1, Facet == "Theoretical") # notice the == sign

print(ps1.subset.theoretical)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 466 taxa and 8 samples ]
## sample_data() Sample Data:       [ 8 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 466 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 466 tips and 464 internal nodes ]

# you can see that we have only 8 samples and no samples from mock community.

# select only a subset of samples
# as an example, we will now remove all information from theoretical samples from original phyloseq object "ps1"
ps1.subset.no.theoretical <- subset_samples(ps1, Facet != "Theoretical") # notice the != sign

print(ps1.subset.no.theoretical)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 466 taxa and 49 samples ]
## sample_data() Sample Data:       [ 49 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 466 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 466 tips and 464 internal nodes ]

# you can see that we have 49 samples and no samples from theoretical.

# get only genus level information
ps1.genus <- tax_glom(ps1, taxrank = "Genus")
ps1.df <- psmelt(ps1.genus)
# Code : write.csv(ps1.df, file='otus_genus_metadata.csv')

Aggregating your OTUs to a specifc taxonomic level

This is somewhat similar to summarize_taxa function in QIIME but this one uses phylogenetic tree for merging OTUs. So much better because the reference databases are not perfect (i.e reads that are close in sequence but for some reason are classified differently will be more separate based on their name than the sequence actually would show).

tax_glom is good but we might also have data where a lot of OTUs are unassigned. In such scenarios, we can use tip_glom(), which uses the phylogenetic tree to merge closely related otus.

ps1.tip.glom <- tip_glom(ps1, h = 0.04) # h = "X" is an iterative process
# Using a value of 0.04 as the height where the tree should be cut, we get close to the number of genus we expect in our mock community. But again, to get a good feeling for your data play with these settings and see whether it is appropriate for your data.

p.tree <- plot_tree(ps1.tip.glom, 
                    label.tips="taxa_names", 
                    color = "Facet", 
                    title="tree")
print(p.tree)

# ps1.tip.glom can used for comparative analysis at the genus level.

Dendrograms

# install ggdendro from CRAN or using the install button on the right panel.
library(ggplot2)
library(ggdendro)
clustering_samples_dendeogram <- hclust(phyloseq::distance(ps1, method="Unifrac"))

dendeogram.plot <- ggdendrogram(clustering_samples_dendeogram, 
                                rotate = FALSE, size = 2) + 
  ggtitle("UniFrac distance based hierarchical clustering") 

print(dendeogram.plot)

ggsave("./output/dendrogram.pdf", height = 8, width = 18)

Heatmaps

heat.sample <- plot_taxa_heatmap(ps1, subset.top = 10,
    VariableA = "MC_type2",
    heatcolors = brewer.pal(10, "Blues"),
    transformation = "compositional")

## Top 10 OTUs selected

## First converted to compositional 
##  then top taxa were selected

## Warning in brewer.pal(10, "Blues"): n too large, allowed maximum for palette Blues is 9
## Returning the palette you asked for with that many colors

Core microbiota

ps1.rel <- microbiome::transform(ps1, "compositional")
ps.rel.f <- format_to_besthit(ps1.rel)

#Set different detection levels and prevalence
prevalences <- seq(.5, 1, .5) #0.5 = 95% prevalence
detections <- 10^seq(log10(1e-3), log10(.2), length = 10)
#(1e-3) = 0.001% abundance; change "-3" to -2 to increase to 0.01%

p <- plot_core(ps.rel.f, plot.type = "heatmap", 
               colours = rev(brewer.pal(10, "Spectral")),
               min.prevalence = 0.6, 
               prevalences = prevalences, 
               detections = detections) +
  xlab("Detection Threshold (Relative Abundance (%))")
print(p)

Selecting taxa to plot

Let us plot only the top OTUs

ps1.f <- format_to_besthit(ps1)

top_otu <- top_taxa(ps1.f, 10)

print(top_otu)

##  [1] "OTU-444469:f__Enterobacteriaceae" "OTU-44441:Bacteroides"           
##  [3] "OTU-44442:Alistipes"              "OTU-44440:Faecalibacterium"      
##  [5] "OTU-44447:f__Lachnospiraceae"     "OTU-44446:f__Enterobacteriaceae" 
##  [7] "OTU-44443:Akkermansia"            "OTU-444471:f__Enterobacteriaceae"
##  [9] "OTU-444470:Bacteroides"           "OTU-444473:Fusobacterium"

The function below will plot relative abundance of the top 10 OTUs that are specified in top_otu.

p <- plot_select_taxa(ps1.f, select.taxa= top_otu, "Region", "Paired", plot.type = "stripchart")

## An additonal column Sam_rep with sample names is created for reference purpose

print(p)

Use of other useful R packages

Metagenomeseq:

Information can be found here

Metagenomeseq
Metagenomeseq paper

Also check the following website

This is another good package that can be used for a variety of analyses. The examples shown here are easy ones and you will need to play around with settings as per your requirements. Also with the legends in the figures. This is just an example without specifics. WE want to show how you can switch between packages for various analysis.

library(metagenomeSeq)
# convert phyloseq object "ps1" to "mseq1"

mseq1 <- phyloseq_to_metagenomeSeq(ps1)
# identify and set the variable you want to investigate, in this case it is the mock community type.

Type.community = pData(mseq1)$Facet
heatmapColColors = brewer.pal(12, "Set3")[as.integer(factor(Type.community))] # change Set3 to any color combination you wish. 

# R color brewer options click here https://www.r-bloggers.com/choosing-colour-palettes-part-ii-educated-choices/

heatmapCols = colorRampPalette(brewer.pal(9, "RdBu"))(50) #color for heatmap

plotMRheatmap(obj = mseq1, n = 200, cexRow = 0.4, cexCol = 0.4,
trace = "none", col = heatmapCols, ColSideColors = heatmapColColors)

# there is a small change on how we save the file from this one

pdf("./output/heatmap.pdf", height = 8, width = 18) # create an empty file

plotMRheatmap(obj = mseq1, n = 200, 
              cexRow = 0.4, 
              cexCol = 0.4, 
              trace = "none", 
              col = heatmapCols, 
              ColSideColors = heatmapColColors) # plot the figure

title(main = "This is only example figure!!")

dev.off() # close the file and autosave

## png 
##   2

OTU prevalence

This will help in identifying the distribution of OTUs with taxonomic information. This will also help you to evaluate the composition of different groups that you are interested in (such as environment or healthy/disease), while retaining information of individual samples.

ps1.prev<- ps1

tax_table(ps1.prev)[tax_table(ps1.prev)[,"Family"]== "f__", "Family" ] <- "f__Unclassified Family"
# We will also remove the "f__" patterns for cleaner labels
tax_table(ps1.prev)[,colnames(tax_table(ps1.prev))] <- gsub(tax_table(ps1.prev)[,colnames(tax_table(ps1.prev))],pattern="[a-z]__",replacement="")

p.prev.plot <- plot_taxa_prevalence(ps1.prev, 'Family')

plot(p.prev.plot)

Useful website for ggplot2 basics. eg. how to add p-values in plots, etc.

Add P-values and Significance Levels to ggplots

Publication-ready-plots

We acknowledge the work of and tools created by:
Paul J. McMurdie and colleagues
Leo Lahti

References

For more tools check this website

Index

This is useful for beginners and provides a link to which function belongs to which package. If you face issues then you can specifically look at the package websites for more information.

Functions :: Package

`read_phyloseq` :: microbiome

`read.tree` :: ape

`merge_phyloseq` :: phyloseq

`subset_taxa` :: phyloseq

`data.table` :: data.table

`ggplot` :: ggplot2

`taxa_sums` :: phyloseq

`qplot` :: ggplot2

`ggsave` :: ggplot2

`pd` :: picante

`plot_richness` :: phyloseq

`estimate_richness` :: phyloseq

`subset_samples` :: phyloseq

`kruskal.test` :: stats

`pairwise.wilcox.test` :: stats

`plot_composition` :: microbiome

`transform` :: microbiome

`ordinate` :: phyloseq

`plot_ordination` :: phyloseq

`rarefy_even_depth` :: phyloseq

`anova` :: stats

`tax_glom` :: phyloseq

`psmelt` :: phyloseq

`hclust` :: stats

`ggdendrogram` :: ggdendro

`phyloseq_to_metagenomeSeq` :: phyloseq

`pData` :: metagenomeseq

`colorRampPalette` :: grDevices

`plotMRheatmap` :: metagenomeseq

`plot_taxa_prevalence` :: microbiome

`plot_taxa_cv` :: microbiomeutilities

`plot_taxa_heatmap` :: microbiomeutilities

Kindly cite this work as follows: “Leo Lahti, Sudarshan Shetty et al. (2017). Tools for microbiome analysis in R. Version 1.1.2. URL: http://microbiome.github.com/microbiome. Check also the relevant references listed in the manual page of each function.

The tutorial utilizes tools from a number of other R extensions, including dplyr (Wickham, Francois, Henry, et al., 2017), ggplot2 (Wickham, 2009), phyloseq (McMurdie and Holmes, 2013), tidyr (Wickham, 2017), vegan (Oksanen, Blanchet, Friendly, et al., 2017).

This website theme was created by modifiying the rmdformats readthedown format.

sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252   
## [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
## [5] LC_TIME=Dutch_Netherlands.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] metagenomeSeq_1.20.0       glmnet_2.0-13             
##  [3] foreach_1.4.3              Matrix_1.2-12             
##  [5] limma_3.34.3               Biobase_2.38.0            
##  [7] BiocGenerics_0.24.0        ggdendro_0.1-20           
##  [9] bindrcpp_0.2               knitr_1.18                
## [11] picante_1.6-2              nlme_3.1-131              
## [13] microbiomeutilities_0.99.0 DT_0.2                    
## [15] dplyr_0.7.4                microbiome_1.1.10008      
## [17] phyloseq_1.22.3            data.table_1.10.4-3       
## [19] scales_0.5.0               reshape2_1.4.3            
## [21] RColorBrewer_1.1-2         vegan_2.4-6               
## [23] lattice_0.20-35            permute_0.9-4             
## [25] plyr_1.8.4                 ape_5.0                   
## [27] ggplot2_2.2.1             
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6       matrixStats_0.52.2 rprojroot_1.3-1   
##  [4] tools_3.4.1        backports_1.1.1    R6_2.2.2          
##  [7] KernSmooth_2.23-15 lazyeval_0.2.1     mgcv_1.8-22       
## [10] questionr_0.6.2    colorspace_1.3-2   ade4_1.7-8        
## [13] tidyselect_0.2.3   gridExtra_2.3      compiler_3.4.1    
## [16] labeling_0.3       bookdown_0.5       caTools_1.17.1    
## [19] stringr_1.3.0      digest_0.6.12      rmarkdown_1.8.5   
## [22] XVector_0.18.0     pkgconfig_2.0.1    htmltools_0.3.6   
## [25] highr_0.6          htmlwidgets_1.0    rlang_0.2.0       
## [28] rstudioapi_0.7     shiny_1.0.5        bindr_0.1         
## [31] jsonlite_1.5       gtools_3.5.0       magrittr_1.5      
## [34] biomformat_1.6.0   Rcpp_0.12.15       munsell_0.4.3     
## [37] S4Vectors_0.16.0   viridis_0.5.0      stringi_1.1.6     
## [40] yaml_2.1.16        MASS_7.3-47        zlibbioc_1.24.0   
## [43] rhdf5_2.22.0       gplots_3.0.1       grid_3.4.1        
## [46] gdata_2.18.0       ggrepel_0.7.0      miniUI_0.1.1      
## [49] Biostrings_2.46.0  splines_3.4.1      multtest_2.34.0   
## [52] pillar_1.1.0       igraph_1.1.2       ggpubr_0.1.6      
## [55] codetools_0.2-15   stats4_3.4.1       glue_1.2.0        
## [58] evaluate_0.10.1    rmdformats_0.3.3   httpuv_1.3.5      
## [61] gtable_0.2.0       purrr_0.2.4        tidyr_0.8.0       
## [64] assertthat_0.2.0   mime_0.5           xtable_1.8-2      
## [67] survival_2.41-3    viridisLite_0.3.0  tibble_1.4.2      
## [70] pheatmap_1.0.8     iterators_1.0.8    IRanges_2.12.0    
## [73] cluster_2.0.6

Microbial diversity analysis using R tools

Microbial diversity analysis using R tools

Introduction

An outline of the basic steps:

Loading the libraries/packages

Making a phyloseq object

Read input to phyloseq object

Read the tree file.

Merge into phyloseq object.

Filter phyloseq

Exploring your sequence data

Distribution of the reads over the samples

Read summary

Filtering Threshold, Minimum Total Counts

Variance

CV

Alpha diversity calculations

Phyogenetic diversity

Non-Phylogenetic metrics

Others

Test of significance

Classification in NG-Tax

Plot composition

Barplot composition

Beta diversity metrics

Unconstrained

Without rarefaction

With rarefaction

Bray-Curtis based dissimilarity using separate regions

Constrained

Additional examples

Utilities/handy commands

Aggregating your OTUs to a specifc taxonomic level

Dendrograms

Heatmaps

Core microbiota

Selecting taxa to plot

Use of other useful R packages

Metagenomeseq:

OTU prevalence

References

Index

Functions :: Package

read_phyloseq :: microbiome

read.tree :: ape

merge_phyloseq :: phyloseq

subset_taxa :: phyloseq

data.table :: data.table

ggplot :: ggplot2

taxa_sums :: phyloseq

qplot :: ggplot2

ggsave :: ggplot2

pd :: picante

plot_richness :: phyloseq

estimate_richness :: phyloseq

subset_samples :: phyloseq

kruskal.test :: stats

pairwise.wilcox.test :: stats

plot_composition :: microbiome

transform :: microbiome

ordinate :: phyloseq

plot_ordination :: phyloseq

rarefy_even_depth :: phyloseq

anova :: stats

tax_glom :: phyloseq

psmelt :: phyloseq

hclust :: stats

ggdendrogram :: ggdendro

phyloseq_to_metagenomeSeq :: phyloseq

pData :: metagenomeseq

colorRampPalette :: grDevices

plotMRheatmap :: metagenomeseq

plot_taxa_prevalence :: microbiome

plot_taxa_cv :: microbiomeutilities

plot_taxa_heatmap :: microbiomeutilities

`read_phyloseq` :: microbiome

`read.tree` :: ape

`merge_phyloseq` :: phyloseq

`subset_taxa` :: phyloseq

`data.table` :: data.table

`ggplot` :: ggplot2

`taxa_sums` :: phyloseq

`qplot` :: ggplot2

`ggsave` :: ggplot2

`pd` :: picante

`plot_richness` :: phyloseq

`estimate_richness` :: phyloseq

`subset_samples` :: phyloseq

`kruskal.test` :: stats

`pairwise.wilcox.test` :: stats

`plot_composition` :: microbiome

`transform` :: microbiome

`ordinate` :: phyloseq

`plot_ordination` :: phyloseq

`rarefy_even_depth` :: phyloseq

`anova` :: stats

`tax_glom` :: phyloseq

`psmelt` :: phyloseq

`hclust` :: stats

`ggdendrogram` :: ggdendro

`phyloseq_to_metagenomeSeq` :: phyloseq

`pData` :: metagenomeseq

`colorRampPalette` :: grDevices

`plotMRheatmap` :: metagenomeseq

`plot_taxa_prevalence` :: microbiome

`plot_taxa_cv` :: microbiomeutilities

`plot_taxa_heatmap` :: microbiomeutilities