4 Set-up and Pre-processing

This tutorial will introduce you to basic steps of microbial community analysis. More importantly on how to look at your data and filter appropriately. We will use the Human microbiome project phase I data.

4.1 OTU or ASVs or sOTUs

For past few years (maybe decade), identifying Operational taxonomic units (OTUs) from raw sequences used clustering approach. Using 97% identity cut-off was a standard approach and often closed reference OTU picking was accepted in the sicentific community. During the time of the development of tools and bioinformatics approaches this was possibly the best available method. However, as with many other fields of science, the knowledge has been updated. Evolution of bioinformatics approaches is a constant process. Based on current knowledge, the cons of 97% OTU picking stratergy (using clustering approaches) have out-weighed the pros (eg. less time).

Recent approaches are now focused towards Amplicon Seuence Variants/sOTUs:
* Oligotyping
* Deblur
* DADA2
* NG-Tax v2.0

All above approaches have one common theme, they avoid 97% clustering and focus on minor differences (in many cases single nucleotide variations) to identify unique ASVs/sOTU.

Note: Some how naming is different and variable. For this purpose and in this book, I will stick to ASVs when data from NG-tax is used.

In this, section, we will compare outputs from 97% OTU picking approach and NG-tax approach.
The data used here is the 16S rRNA gene variable region (V1-V3) for 97% OTU-pciking. The raw reads were processed using QIIME 1.9.1, SortMeRNA, and OTU picking was done using the closed-reference OTU-picking at 97% identity.

QIIME is now replaced by QIIME2 with several features has been developed. It includes support for diverse of sOTU/ASVs picking stratergies. However, here we only use the 97% OTU picking data generated by us previously using QIIME (v1)

More about QIIME2 click here and the article describing it can be found here Bolyen E, Rideout JR, Dillon MR, Bokulich NA… et al. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37: 852-857.

For NG-Tax, we use the same raw data and processed through default settings.

Here, we do not aim to bench mark. For this course, we aim to show differences between results from two approaches.

For down stream analysis of *.biom files we use Phyloseq and microbiome.
Kindly cite all the packages and tools that were used in your analysis as listed at the end of each document in sessionInfo. Also make sure that you provide the workflow and scripts you used for analysis atleast as supplementary material with your research article.
Check Quick-R.

4.2 General overview

4.3 Structure

Let us create few folders to organize the analysis. While this can be personal preference, make sure you write the structure to guide others who do not know your data well.

# Create Folders as following
# Tables
dir.create("tables")

# Figures
dir.create("figures")

# Phyloseq objects
dir.create("phyobjects")

# Custom codes/notes
dir.create("codes_notes")

Load packages

library(microbiome) # data analysis and visualisation
library(phyloseq) # also the basis of data object. Data analysis and visualisation
library(microbiomeutilities) # some utility tools
library(RColorBrewer) # nice color options
library(ggpubr) # publication quality figures, based on ggplot2
library(DT) # interactive tables in html and markdown
library(data.table) # alternative to data.frame
library(dplyr) # data handling

4.4 Making a phyloseq object

This is the basis for the analyses demonstrated in this tutorial. In the phyloseq object, information on OTU abundances, taxonomy of OTUs, the phylogenetic tree and metadata is stored. A single object with all this information provides a convinient way of handling, manipulating and visualizing data.
For more infromation: phyloseq

Please remember that the metadata (i.e. mapping) file has to be in .csv format (columns have sample attributes). The read_phylseq function from microbiome package requires metadata in .csv format.

Important Note 2: If you have error in loading the biom files stating JSON or HDF5 then you need to convert it in to a JSON format.

For this, use the following command within the QIIME terminal and not in R!

# biom convert -i NGTaxMerged.biom -o ngtax_json.biom --table-type "OTU table" --to-json

For more information on the biom format please click here.

Important Note 3: The most recent version of NG-Tax does not have this issue.

NOTE
The read_phyloseq function can be used for reading other outputs (like .shared and consensus taxonomy files from mothur) into phyloseq object. type ?read_phyloseq in the console pane for more information.
If you don’t have your own biom file, we have a test dataset stored in input_data. Unzip the humanmicrobiome.zip and you will have the original biom file, copy it in the input_data folder.

4.5 Read input to phyloseq object

NG-Tax output

ps.ng.tax <- read_phyloseq(otu.file = "./input_data/humanmicrobiome.biom", 
                    taxonomy.file = NULL, 
                    metadata.file = "./input_data/metadata_table.csv", 
                    type = "biom")

## Time to complete depends on OTU file size

NOTE Creating the mapping file in Excel can give issues with *.csv* format. European format will use ; as seperator and American format. One way to change this in MS-Excel is go to Options->Advanced then uncheck Use system operators and change decimal to period . and Thousands to comma '

Notice above, we use relative path and not D:/myproject/input/mybiom.biom. This is important! With an RStudio project, the project folder is considered the root folder and any folders within this folder will be the branches to access data. Hence, sharing the Rproject folder with your article, users can conviniently re-run your analysis without having to play around with location of files.

4.6 Read the tree file.

Note: Requires a package called ape and the extension has to be “.tre” and not “.tree” (you can just change the name of the file extension)

# Load tree file
library(ape)

## 
## Attaching package: 'ape'

## The following object is masked from 'package:ggpubr':
## 
##     rotate

treefile_p1 <- read.tree("./input_data/humanmicrobiome.tree")

4.7 Merge into phyloseq object.

ps.ng.tax <- merge_phyloseq(ps.ng.tax, treefile_p1)
# ps.ng.tax is the first phyloseq object.

print(ps.ng.tax)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4710 taxa and 474 samples ]
## sample_data() Sample Data:       [ 474 samples by 30 sample variables ]
## tax_table()   Taxonomy Table:    [ 4710 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 4710 tips and 4709 internal nodes ]

rank_names(ps.ng.tax) # we check the taxonomic rank information

## [1] "Domain" "Phylum" "Class"  "Order"  "Family" "Genus"

datatable(tax_table(ps.ng.tax)) # the table is interactive you can scrol and search thorugh it for details.

Before that we will clean the taxonomy table.

tax_table(ps.ng.tax)[, colnames(tax_table(ps.ng.tax))] <- gsub(tax_table(ps.ng.tax)[, colnames(tax_table(ps.ng.tax))], pattern = "[a-z]__", replacement = "")

tax_table(ps.ng.tax)[tax_table(ps.ng.tax)[, "Phylum"] == "", "Phylum"] <- "Unidentified"

# save the pseq object

saveRDS(ps.ng.tax, "./phyobjects/ps.ng.tax.rds")

4.8 Read data from OTU-picking stratergy

The data for tutorial is stored as *.rds file in the R project input_data folder within the main Microbial-bioinformatics-introductory-course-Material-2018-master folder.

The data is from the Human microbiome project phase I data.

ps.otu <- readRDS("./input_data/ps.sub.rds")
# microbiome::write_phyloseq(ps.otu, "METADATA")
# use print option to see the data saved as phyloseq object.


ps.ng.tax <- readRDS("./phyobjects/ps.ng.tax.rds")

print(ps.otu)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4125 taxa and 474 samples ]
## sample_data() Sample Data:       [ 474 samples by 31 sample variables ]
## tax_table()   Taxonomy Table:    [ 4125 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 4125 tips and 4124 internal nodes ]

As can be seen, there is a difference in the number of OTU identifed by both approaches. NG-Tax has more OTUs identifed even aftering filtering the raw reads for rare sequences.

# check for features of data  
summarize_phyloseq(ps.ng.tax)

## Compositional = NO2

## 1] Min. number of reads = 24582] Max. number of reads = 1150233] Total number of reads = 46623174] Average number of reads = 9836.111814345995] Median number of reads = 8188.57] Sparsity = 0.9866421206338976] Any OTU sum to 1 or less? NO8] Number of singletons = 09] Percent of OTUs that are singletons 
##         (i.e. exactly one read detected across all samples)010] Number of sample variables are: 30BarcodeSequenceLinkerPrimerSequencerun_prefixbody_habitatbody_productbody_sitebodysitedna_extractedelevationenvenv_biomeenv_featureenv_materialenv_packagegeo_loc_namehost_common_namehost_scientific_namehost_subject_idhost_taxidlatitudelongitudephysical_specimen_locationphysical_specimen_remainingpsnpublicsample_typescientific_namesequencecentertitleDescription2

## [[1]]
## [1] "1] Min. number of reads = 2458"
## 
## [[2]]
## [1] "2] Max. number of reads = 115023"
## 
## [[3]]
## [1] "3] Total number of reads = 4662317"
## 
## [[4]]
## [1] "4] Average number of reads = 9836.11181434599"
## 
## [[5]]
## [1] "5] Median number of reads = 8188.5"
## 
## [[6]]
## [1] "7] Sparsity = 0.986642120633897"
## 
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
## 
## [[8]]
## [1] "8] Number of singletons = 0"
## 
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)0"
## 
## [[10]]
## [1] "10] Number of sample variables are: 30"
## 
## [[11]]
##  [1] "BarcodeSequence"             "LinkerPrimerSequence"       
##  [3] "run_prefix"                  "body_habitat"               
##  [5] "body_product"                "body_site"                  
##  [7] "bodysite"                    "dna_extracted"              
##  [9] "elevation"                   "env"                        
## [11] "env_biome"                   "env_feature"                
## [13] "env_material"                "env_package"                
## [15] "geo_loc_name"                "host_common_name"           
## [17] "host_scientific_name"        "host_subject_id"            
## [19] "host_taxid"                  "latitude"                   
## [21] "longitude"                   "physical_specimen_location" 
## [23] "physical_specimen_remaining" "psn"                        
## [25] "public"                      "sample_type"                
## [27] "scientific_name"             "sequencecenter"             
## [29] "title"                       "Description"

summarize_phyloseq(ps.otu)

## Compositional = NO2

## 1] Min. number of reads = 20112] Max. number of reads = 441063] Total number of reads = 24572544] Average number of reads = 5184.080168776375] Median number of reads = 43287] Sparsity = 0.9641212121212126] Any OTU sum to 1 or less? YES8] Number of singletons = 6879] Percent of OTUs that are singletons 
##         (i.e. exactly one read detected across all samples)16.654545454545510] Number of sample variables are: 31X.SampleIDBarcodeSequenceLinkerPrimerSequencerun_prefixbody_habitatbody_productbody_sitebodysitedna_extractedelevationenvenv_biomeenv_featureenv_materialenv_packagegeo_loc_namehost_common_namehost_scientific_namehost_subject_idhost_taxidlatitudelongitudephysical_specimen_locationphysical_specimen_remainingpsnpublicsample_typescientific_namesequencecentertitleDescription2

## [[1]]
## [1] "1] Min. number of reads = 2011"
## 
## [[2]]
## [1] "2] Max. number of reads = 44106"
## 
## [[3]]
## [1] "3] Total number of reads = 2457254"
## 
## [[4]]
## [1] "4] Average number of reads = 5184.08016877637"
## 
## [[5]]
## [1] "5] Median number of reads = 4328"
## 
## [[6]]
## [1] "7] Sparsity = 0.964121212121212"
## 
## [[7]]
## [1] "6] Any OTU sum to 1 or less? YES"
## 
## [[8]]
## [1] "8] Number of singletons = 687"
## 
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)16.6545454545455"
## 
## [[10]]
## [1] "10] Number of sample variables are: 31"
## 
## [[11]]
##  [1] "X.SampleID"                  "BarcodeSequence"            
##  [3] "LinkerPrimerSequence"        "run_prefix"                 
##  [5] "body_habitat"                "body_product"               
##  [7] "body_site"                   "bodysite"                   
##  [9] "dna_extracted"               "elevation"                  
## [11] "env"                         "env_biome"                  
## [13] "env_feature"                 "env_material"               
## [15] "env_package"                 "geo_loc_name"               
## [17] "host_common_name"            "host_scientific_name"       
## [19] "host_subject_id"             "host_taxid"                 
## [21] "latitude"                    "longitude"                  
## [23] "physical_specimen_location"  "physical_specimen_remaining"
## [25] "psn"                         "public"                     
## [27] "sample_type"                 "scientific_name"            
## [29] "sequencecenter"              "title"                      
## [31] "Description"

Notice the Sparsity, it is high for both approaches, the data has many zeros. A common property of amplicon based microbiota data generated by sequencing.

4.9 Variablity

We will check the spread of variation for taxa identifed using both approaches. Coefficient of variation (C.V), i.e. sd(x)/mean(x) is a widely used approach to measure heterogeneity in OTU/ASV abundance data. The plot below shows CV-mean(relaive mean abundance) relationship in the scatter plot, where variation is calculated for each OTU/ASV across samples versus mean relative abundance.
Now plot C.V.

# the plot_taxa_cv will first convert the counts to relative abundances and then calculate the C.V.

p1 <- plot_taxa_cv(ps.ng.tax, plot.type = "scatter")

p1 + scale_x_log10()

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

p2 <- plot_taxa_cv(ps.otu, plot.type = "scatter")

p2 + scale_x_log10()

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

From the above two plots, we see that there are several OTUs which have high C.V. and low mean. The OTU-picking approach seems to be noisy and NG-Tax approach seems to have less noise. Thus, the ASVs identified by NG-Tax are of high confidence as they seem to be reprsented by high number of sequences in the raw data.
Note: The data we use here is from different body sites and have large difference in composition and community structure. The observed high variance shown here can also be explained in part by the difference between sample groups (stool, sebus, oral) biology of samples is also important.

Let us check for distribution of number of sequences retained after OTU picking and NG-tax approach.

p_seqdepth.otu <- plot_read_distribution(ps.otu, "scientific_name", "density")

## [1] "Done plotting"

p_seqdepth.ngtax <- plot_read_distribution(ps.ng.tax, "scientific_name", "density")

## [1] "Done plotting"

print(p_seqdepth.otu)

print(p_seqdepth.ngtax)

We see a better distribution of reads in NG-Tax data.

Moving on to distribution of taxa

# We make a data table with information on the OTUs
ps0_df_taxa <- data.table(tax_table(ps.ng.tax),
                          ASVabundance = taxa_sums(ps.ng.tax),
                          ASV = taxa_names(ps.ng.tax))

## Warning in setDT(value): Some columns are a multi-column type (such as a
## matrix column): [1, 2, 3, 4, 5, 6]. setDT will retain these columns as-is
## but subsequent operations like grouping and joining may fail. Please consider
## as.data.table() instead which will create a new column for each embedded column.

ps0_tax_plot <- ggplot(ps0_df_taxa, aes(ASVabundance)) +
  geom_histogram() + ggtitle("Histogram of ASVs (unique sequence) counts (NG-Tax)") +
  theme_bw() + scale_x_log10() + ylab("Frequency of ASVs") + xlab("Abundance (raw counts)")

print(ps0_tax_plot)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# We make a data table with information on the OTUs
ps1_df_taxa <- data.table(tax_table(ps.otu), 
                        OTUabundance = taxa_sums(ps.otu), 
                        OTU = taxa_names(ps.otu))

## Warning in setDT(value): Some columns are a multi-column type (such as a
## matrix column): [1, 2, 3, 4, 5, 6, 7]. setDT will retain these columns as-is
## but subsequent operations like grouping and joining may fail. Please consider
## as.data.table() instead which will create a new column for each embedded column.

ps1_tax_plot <- ggplot(ps1_df_taxa, aes(OTUabundance)) + 
  geom_histogram() + ggtitle("Histogram of OTU counts (OTU-picking 97%)") + 
  theme_bw() + scale_x_log10() + ylab("Frequency of OTUs") + xlab("Abundance (raw counts)")

print(ps1_tax_plot)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The ASVs distribution is close to normal distribution compared to OTU distribution for 97% otu-picking approach. This has implications in downstream statistical analysis like differential abundance testing (covered on Day3).

We can also check for prevalance abundance distribtuion of ASVs and OTUs.

p <- plot_taxa_prevalence(ps.ng.tax, "Phylum")
p

p2 <- plot_taxa_prevalence(ps.otu, "Phylum")
p2

The ASVs identified by NG-tax at low prevalance are represented by high abundance in the sequencing data. There is a higher chance that these are real observations.

Checking for potentially spurious OTUs/ASVs. Usually, we do not expect mitochondria and chloroplast as part of the human microbiome.

4.10 OTU-picking stratergy

ps0.1.otu <- subset_taxa(ps.otu, Family != "mitochondria")

print(ps0.1.otu)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3690 taxa and 474 samples ]
## sample_data() Sample Data:       [ 474 samples by 31 sample variables ]
## tax_table()   Taxonomy Table:    [ 3690 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 3690 tips and 3689 internal nodes ]

# also check how many are lost 

ntaxa(ps.otu)-ntaxa(ps0.1.otu)

## [1] 435

4.11 NG-tax stratergy

ps1.1.ngtax <- subset_taxa(ps.ng.tax, Family != "Mitochondria")

# also check how many lost 

ntaxa(ps.ng.tax)-ntaxa(ps1.1.ngtax)

## [1] 5

There is large different in the mitochondrial sequences detected by OTU-picking approach and NG-Tax approach. Usually, sequence of mitochondrial and chloroplast origin are present is low abundance.

Check how many total reads are there in the NG-tax data set.

#total number of reads in the dataset
reads_per_asv <- taxa_sums(ps1.1.ngtax)
print(sum(reads_per_asv))

## [1] 4662048

There are 4662048 reads in the total data set.
How many ASVs are less than 10 reads and how many reads do they contain?

print(length(reads_per_asv[reads_per_asv < 10]))

## [1] 654

print(sum(reads_per_asv[reads_per_asv < 10]))

## [1] 4405

To put this into context; out of the 4705 OTUs, a 654 OTUs contain less than 10 reads, which is:

print((654/4705)*100)

## [1] 13.90011

OTU picking approach.

#total number of reads in the dataset
reads_per_otu <- taxa_sums(ps0.1.otu)

print(sum(reads_per_otu))

## [1] 2398446

There are 2398446 reads in the total data set.
How many ASVs are less than 10 reads and how many reads do they contain?

print(length(reads_per_otu[reads_per_otu < 10]))

## [1] 1678

print(sum(reads_per_otu[reads_per_otu < 10]))

## [1] 5198

To put this into context; out of the 3690 OTUs, 654 OTUs contain less than 10 reads, which is:

print((1678/3690)*100)

## [1] 45.47425

This is a major drawback of the OTU picking strategy. This percent can be lowered with NG_tax, DADA2, Deblur like approaches.

Let us see how many singletons are there?

length(which(taxa_sums(ps0.1.otu) <= 1))

## [1] 613

length(which(taxa_sums(ps1.1.ngtax) <= 1))

## [1] 0

Check how many doubletons are there?

length(which(taxa_sums(ps0.1.otu) == 2))

## [1] 279

length(which(taxa_sums(ps1.1.ngtax) == 2))

## [1] 0

Check how many Singletons and doubletons are there?

length(which(taxa_sums(ps0.1.otu) <= 2))

## [1] 892

Singletons and doubletons

round((892/3690)*100, digits = 2)

## [1] 24.17

24.17% of the OTUs are doubletons or singletons. This is suggests that there can be potentially spurious OTUs.

It is important to understand that depending on origin of sample, the parameters in NG-tax have to be changed. If you expect your samples to have large diversity of which rare taxa are a major fraction, using less stringent parameters is important. Every data has its own properties.

It is commonly observed that a large fraction of taxa are rare. A nice reading for this topic is the review by Michael D. J. Lynch & Josh D. Neufeld Ecology and exploration of the rare biosphere.

sessionInfo()

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Netherlands.1252  LC_CTYPE=English_Netherlands.1252   
## [3] LC_MONETARY=English_Netherlands.1252 LC_NUMERIC=C                        
## [5] LC_TIME=English_Netherlands.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ape_5.3                     dplyr_0.8.5                
##  [3] data.table_1.12.8           DT_0.12                    
##  [5] ggpubr_0.2.5                magrittr_1.5               
##  [7] RColorBrewer_1.1-2          microbiomeutilities_0.99.02
##  [9] microbiome_1.8.0            ggplot2_3.3.0              
## [11] phyloseq_1.30.0            
## 
## loaded via a namespace (and not attached):
##  [1] ggrepel_0.8.2       Rcpp_1.0.4          lattice_0.20-40    
##  [4] tidyr_1.0.2         Biostrings_2.54.0   assertthat_0.2.1   
##  [7] digest_0.6.25       foreach_1.4.8       R6_2.4.1           
## [10] plyr_1.8.6          stats4_3.6.3        evaluate_0.14      
## [13] pillar_1.4.3        zlibbioc_1.32.0     rlang_0.4.5        
## [16] vegan_2.5-6         S4Vectors_0.24.3    Matrix_1.2-18      
## [19] rmarkdown_2.1       labeling_0.3        splines_3.6.3      
## [22] Rtsne_0.15          stringr_1.4.0       htmlwidgets_1.5.1  
## [25] igraph_1.2.4.2      pheatmap_1.0.12     munsell_0.5.0      
## [28] compiler_3.6.3      xfun_0.12           pkgconfig_2.0.3    
## [31] BiocGenerics_0.32.0 multtest_2.42.0     mgcv_1.8-31        
## [34] htmltools_0.4.0     biomformat_1.14.0   tidyselect_1.0.0   
## [37] gridExtra_2.3       tibble_2.1.3        bookdown_0.18      
## [40] IRanges_2.20.2      codetools_0.2-16    viridisLite_0.3.0  
## [43] permute_0.9-5       crayon_1.3.4        withr_2.1.2        
## [46] MASS_7.3-51.5       grid_3.6.3          nlme_3.1-145       
## [49] jsonlite_1.6.1      gtable_0.3.0        lifecycle_0.2.0    
## [52] scales_1.1.0        stringi_1.4.6       farver_2.0.3       
## [55] XVector_0.26.0      ggsignif_0.6.0      reshape2_1.4.3     
## [58] viridis_0.5.1       vctrs_0.2.4         Rhdf5lib_1.8.0     
## [61] iterators_1.0.12    tools_3.6.3         ade4_1.7-15        
## [64] Biobase_2.46.0      glue_1.3.2          purrr_0.3.3        
## [67] crosstalk_1.1.0.1   parallel_3.6.3      survival_3.1-11    
## [70] yaml_2.2.1          colorspace_1.4-1    rhdf5_2.30.1       
## [73] cluster_2.1.0       knitr_1.28