4 Set-up and Pre-processing

This tutorial will introduce you to basic steps of microbial community analysis. More importantly on how to look at your data and filter appropriately. We will use the Human microbiome project phase I data.

4.1 OTU or ASVs or sOTUs

For past few years (maybe decade), identifying Operational taxonomic units (OTUs) from raw sequences used clustering approach. Using 97% identity cut-off was a standard approach and often closed reference OTU picking was accepted in the sicentific community. During the time of the development of tools and bioinformatics approaches this was possibly the best available method. However, as with many other fields of science, the knowledge has been updated. Evolution of bioinformatics approaches is a constant process. Based on current knowledge, the cons of 97% OTU picking stratergy (using clustering approaches) have out-weighed the pros (eg. less time).

Recent approaches are now focused towards Amplicon Seuence Variants/sOTUs:
* Oligotyping
* Deblur
* DADA2
* NG-Tax v2.0

All above approaches have one common theme, they avoid 97% clustering and focus on minor differences (in many cases single nucleotide variations) to identify unique ASVs/sOTU.

Note: Some how naming is different and variable. For this purpose and in this book, I will stick to ASVs when data from NG-tax is used.

In this, section, we will compare outputs from 97% OTU picking approach and NG-tax approach.
The data used here is the 16S rRNA gene variable region (V1-V3) for 97% OTU-pciking. The raw reads were processed using QIIME 1.9.1, SortMeRNA, and OTU picking was done using the closed-reference OTU-picking at 97% identity.

QIIME is now replaced by QIIME2 with several features has been developed. It includes support for diverse of sOTU/ASVs picking stratergies. However, here we only use the 97% OTU picking data generated by us previously using QIIME (v1)

More about QIIME2 click here and the article describing it can be found here Bolyen E, Rideout JR, Dillon MR, Bokulich NA… et al. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37: 852-857.

For NG-Tax, we use the same raw data and processed through default settings.

Here, we do not aim to bench mark. For this course, we aim to show differences between results from two approaches.

For down stream analysis of *.biom files we use Phyloseq and microbiome.
Kindly cite all the packages and tools that were used in your analysis as listed at the end of each document in sessionInfo. Also make sure that you provide the workflow and scripts you used for analysis atleast as supplementary material with your research article.
Check Quick-R.

4.2 General overview

microbiome R package

4.4 Making a phyloseq object

This is the basis for the analyses demonstrated in this tutorial. In the phyloseq object, information on OTU abundances, taxonomy of OTUs, the phylogenetic tree and metadata is stored. A single object with all this information provides a convinient way of handling, manipulating and visualizing data.
For more infromation: phyloseq

Please remember that the metadata (i.e. mapping) file has to be in .csv format (columns have sample attributes). The read_phylseq function from microbiome package requires metadata in .csv format.

Important Note 2: If you have error in loading the biom files stating JSON or HDF5 then you need to convert it in to a JSON format.

For this, use the following command within the QIIME terminal and not in R!

For more information on the biom format please click here.

Important Note 3: The most recent version of NG-Tax does not have this issue.

NOTE
The read_phyloseq function can be used for reading other outputs (like .shared and consensus taxonomy files from mothur) into phyloseq object. type ?read_phyloseq in the console pane for more information.
If you don’t have your own biom file, we have a test dataset stored in input_data. Unzip the humanmicrobiome.zip and you will have the original biom file, copy it in the input_data folder.

4.5 Read input to phyloseq object

NG-Tax output

## Time to complete depends on OTU file size

NOTE Creating the mapping file in Excel can give issues with *.csv* format. European format will use ; as seperator and American format. One way to change this in MS-Excel is go to Options->Advanced then uncheck Use system operators and change decimal to period . and Thousands to comma '

Notice above, we use relative path and not D:/myproject/input/mybiom.biom. This is important! With an RStudio project, the project folder is considered the root folder and any folders within this folder will be the branches to access data. Hence, sharing the Rproject folder with your article, users can conviniently re-run your analysis without having to play around with location of files.

4.6 Read the tree file.

Note: Requires a package called ape and the extension has to be “.tre” and not “.tree” (you can just change the name of the file extension)

## 
## Attaching package: 'ape'
## The following object is masked from 'package:ggpubr':
## 
##     rotate

4.7 Merge into phyloseq object.

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4710 taxa and 474 samples ]
## sample_data() Sample Data:       [ 474 samples by 30 sample variables ]
## tax_table()   Taxonomy Table:    [ 4710 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 4710 tips and 4709 internal nodes ]
## [1] "Domain" "Phylum" "Class"  "Order"  "Family" "Genus"

Before that we will clean the taxonomy table.

4.8 Read data from OTU-picking stratergy

The data for tutorial is stored as *.rds file in the R project input_data folder within the main Microbial-bioinformatics-introductory-course-Material-2018-master folder.

The data is from the Human microbiome project phase I data.

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4125 taxa and 474 samples ]
## sample_data() Sample Data:       [ 474 samples by 31 sample variables ]
## tax_table()   Taxonomy Table:    [ 4125 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 4125 tips and 4124 internal nodes ]

As can be seen, there is a difference in the number of OTU identifed by both approaches. NG-Tax has more OTUs identifed even aftering filtering the raw reads for rare sequences.

## Compositional = NO2
## 1] Min. number of reads = 24582] Max. number of reads = 1150233] Total number of reads = 46623174] Average number of reads = 9836.111814345995] Median number of reads = 8188.57] Sparsity = 0.9866421206338976] Any OTU sum to 1 or less? NO8] Number of singletons = 09] Percent of OTUs that are singletons 
##         (i.e. exactly one read detected across all samples)010] Number of sample variables are: 30BarcodeSequenceLinkerPrimerSequencerun_prefixbody_habitatbody_productbody_sitebodysitedna_extractedelevationenvenv_biomeenv_featureenv_materialenv_packagegeo_loc_namehost_common_namehost_scientific_namehost_subject_idhost_taxidlatitudelongitudephysical_specimen_locationphysical_specimen_remainingpsnpublicsample_typescientific_namesequencecentertitleDescription2
## [[1]]
## [1] "1] Min. number of reads = 2458"
## 
## [[2]]
## [1] "2] Max. number of reads = 115023"
## 
## [[3]]
## [1] "3] Total number of reads = 4662317"
## 
## [[4]]
## [1] "4] Average number of reads = 9836.11181434599"
## 
## [[5]]
## [1] "5] Median number of reads = 8188.5"
## 
## [[6]]
## [1] "7] Sparsity = 0.986642120633897"
## 
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
## 
## [[8]]
## [1] "8] Number of singletons = 0"
## 
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)0"
## 
## [[10]]
## [1] "10] Number of sample variables are: 30"
## 
## [[11]]
##  [1] "BarcodeSequence"             "LinkerPrimerSequence"       
##  [3] "run_prefix"                  "body_habitat"               
##  [5] "body_product"                "body_site"                  
##  [7] "bodysite"                    "dna_extracted"              
##  [9] "elevation"                   "env"                        
## [11] "env_biome"                   "env_feature"                
## [13] "env_material"                "env_package"                
## [15] "geo_loc_name"                "host_common_name"           
## [17] "host_scientific_name"        "host_subject_id"            
## [19] "host_taxid"                  "latitude"                   
## [21] "longitude"                   "physical_specimen_location" 
## [23] "physical_specimen_remaining" "psn"                        
## [25] "public"                      "sample_type"                
## [27] "scientific_name"             "sequencecenter"             
## [29] "title"                       "Description"
## Compositional = NO2
## 1] Min. number of reads = 20112] Max. number of reads = 441063] Total number of reads = 24572544] Average number of reads = 5184.080168776375] Median number of reads = 43287] Sparsity = 0.9641212121212126] Any OTU sum to 1 or less? YES8] Number of singletons = 6879] Percent of OTUs that are singletons 
##         (i.e. exactly one read detected across all samples)16.654545454545510] Number of sample variables are: 31X.SampleIDBarcodeSequenceLinkerPrimerSequencerun_prefixbody_habitatbody_productbody_sitebodysitedna_extractedelevationenvenv_biomeenv_featureenv_materialenv_packagegeo_loc_namehost_common_namehost_scientific_namehost_subject_idhost_taxidlatitudelongitudephysical_specimen_locationphysical_specimen_remainingpsnpublicsample_typescientific_namesequencecentertitleDescription2
## [[1]]
## [1] "1] Min. number of reads = 2011"
## 
## [[2]]
## [1] "2] Max. number of reads = 44106"
## 
## [[3]]
## [1] "3] Total number of reads = 2457254"
## 
## [[4]]
## [1] "4] Average number of reads = 5184.08016877637"
## 
## [[5]]
## [1] "5] Median number of reads = 4328"
## 
## [[6]]
## [1] "7] Sparsity = 0.964121212121212"
## 
## [[7]]
## [1] "6] Any OTU sum to 1 or less? YES"
## 
## [[8]]
## [1] "8] Number of singletons = 687"
## 
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)16.6545454545455"
## 
## [[10]]
## [1] "10] Number of sample variables are: 31"
## 
## [[11]]
##  [1] "X.SampleID"                  "BarcodeSequence"            
##  [3] "LinkerPrimerSequence"        "run_prefix"                 
##  [5] "body_habitat"                "body_product"               
##  [7] "body_site"                   "bodysite"                   
##  [9] "dna_extracted"               "elevation"                  
## [11] "env"                         "env_biome"                  
## [13] "env_feature"                 "env_material"               
## [15] "env_package"                 "geo_loc_name"               
## [17] "host_common_name"            "host_scientific_name"       
## [19] "host_subject_id"             "host_taxid"                 
## [21] "latitude"                    "longitude"                  
## [23] "physical_specimen_location"  "physical_specimen_remaining"
## [25] "psn"                         "public"                     
## [27] "sample_type"                 "scientific_name"            
## [29] "sequencecenter"              "title"                      
## [31] "Description"

Notice the Sparsity, it is high for both approaches, the data has many zeros. A common property of amplicon based microbiota data generated by sequencing.

4.9 Variablity

We will check the spread of variation for taxa identifed using both approaches. Coefficient of variation (C.V), i.e. sd(x)/mean(x) is a widely used approach to measure heterogeneity in OTU/ASV abundance data. The plot below shows CV-mean(relaive mean abundance) relationship in the scatter plot, where variation is calculated for each OTU/ASV across samples versus mean relative abundance.
Now plot C.V.

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

From the above two plots, we see that there are several OTUs which have high C.V. and low mean. The OTU-picking approach seems to be noisy and NG-Tax approach seems to have less noise. Thus, the ASVs identified by NG-Tax are of high confidence as they seem to be reprsented by high number of sequences in the raw data.
Note: The data we use here is from different body sites and have large difference in composition and community structure. The observed high variance shown here can also be explained in part by the difference between sample groups (stool, sebus, oral) biology of samples is also important.

Let us check for distribution of number of sequences retained after OTU picking and NG-tax approach.

## [1] "Done plotting"
## [1] "Done plotting"

We see a better distribution of reads in NG-Tax data.

Moving on to distribution of taxa

## Warning in setDT(value): Some columns are a multi-column type (such as a
## matrix column): [1, 2, 3, 4, 5, 6]. setDT will retain these columns as-is
## but subsequent operations like grouping and joining may fail. Please consider
## as.data.table() instead which will create a new column for each embedded column.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in setDT(value): Some columns are a multi-column type (such as a
## matrix column): [1, 2, 3, 4, 5, 6, 7]. setDT will retain these columns as-is
## but subsequent operations like grouping and joining may fail. Please consider
## as.data.table() instead which will create a new column for each embedded column.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The ASVs distribution is close to normal distribution compared to OTU distribution for 97% otu-picking approach. This has implications in downstream statistical analysis like differential abundance testing (covered on Day3).

We can also check for prevalance abundance distribtuion of ASVs and OTUs.

The ASVs identified by NG-tax at low prevalance are represented by high abundance in the sequencing data. There is a higher chance that these are real observations.

Checking for potentially spurious OTUs/ASVs. Usually, we do not expect mitochondria and chloroplast as part of the human microbiome.

4.10 OTU-picking stratergy

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3690 taxa and 474 samples ]
## sample_data() Sample Data:       [ 474 samples by 31 sample variables ]
## tax_table()   Taxonomy Table:    [ 3690 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 3690 tips and 3689 internal nodes ]
## [1] 435

4.11 NG-tax stratergy

## [1] 5

There is large different in the mitochondrial sequences detected by OTU-picking approach and NG-Tax approach. Usually, sequence of mitochondrial and chloroplast origin are present is low abundance.

Check how many total reads are there in the NG-tax data set.

## [1] 4662048

There are 4662048 reads in the total data set.
How many ASVs are less than 10 reads and how many reads do they contain?

## [1] 654
## [1] 4405

To put this into context; out of the 4705 OTUs, a 654 OTUs contain less than 10 reads, which is:

## [1] 13.90011

OTU picking approach.

## [1] 2398446

There are 2398446 reads in the total data set.
How many ASVs are less than 10 reads and how many reads do they contain?

## [1] 1678
## [1] 5198

To put this into context; out of the 3690 OTUs, 654 OTUs contain less than 10 reads, which is:

## [1] 45.47425

This is a major drawback of the OTU picking strategy. This percent can be lowered with NG_tax, DADA2, Deblur like approaches.

Let us see how many singletons are there?

## [1] 613
## [1] 0

Check how many doubletons are there?

## [1] 279
## [1] 0

Check how many Singletons and doubletons are there?

## [1] 892

Singletons and doubletons

## [1] 24.17

24.17% of the OTUs are doubletons or singletons. This is suggests that there can be potentially spurious OTUs.

It is important to understand that depending on origin of sample, the parameters in NG-tax have to be changed. If you expect your samples to have large diversity of which rare taxa are a major fraction, using less stringent parameters is important. Every data has its own properties.

It is commonly observed that a large fraction of taxa are rare. A nice reading for this topic is the review by Michael D. J. Lynch & Josh D. Neufeld Ecology and exploration of the rare biosphere.

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Netherlands.1252  LC_CTYPE=English_Netherlands.1252   
## [3] LC_MONETARY=English_Netherlands.1252 LC_NUMERIC=C                        
## [5] LC_TIME=English_Netherlands.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ape_5.3                     dplyr_0.8.5                
##  [3] data.table_1.12.8           DT_0.12                    
##  [5] ggpubr_0.2.5                magrittr_1.5               
##  [7] RColorBrewer_1.1-2          microbiomeutilities_0.99.02
##  [9] microbiome_1.8.0            ggplot2_3.3.0              
## [11] phyloseq_1.30.0            
## 
## loaded via a namespace (and not attached):
##  [1] ggrepel_0.8.2       Rcpp_1.0.4          lattice_0.20-40    
##  [4] tidyr_1.0.2         Biostrings_2.54.0   assertthat_0.2.1   
##  [7] digest_0.6.25       foreach_1.4.8       R6_2.4.1           
## [10] plyr_1.8.6          stats4_3.6.3        evaluate_0.14      
## [13] pillar_1.4.3        zlibbioc_1.32.0     rlang_0.4.5        
## [16] vegan_2.5-6         S4Vectors_0.24.3    Matrix_1.2-18      
## [19] rmarkdown_2.1       labeling_0.3        splines_3.6.3      
## [22] Rtsne_0.15          stringr_1.4.0       htmlwidgets_1.5.1  
## [25] igraph_1.2.4.2      pheatmap_1.0.12     munsell_0.5.0      
## [28] compiler_3.6.3      xfun_0.12           pkgconfig_2.0.3    
## [31] BiocGenerics_0.32.0 multtest_2.42.0     mgcv_1.8-31        
## [34] htmltools_0.4.0     biomformat_1.14.0   tidyselect_1.0.0   
## [37] gridExtra_2.3       tibble_2.1.3        bookdown_0.18      
## [40] IRanges_2.20.2      codetools_0.2-16    viridisLite_0.3.0  
## [43] permute_0.9-5       crayon_1.3.4        withr_2.1.2        
## [46] MASS_7.3-51.5       grid_3.6.3          nlme_3.1-145       
## [49] jsonlite_1.6.1      gtable_0.3.0        lifecycle_0.2.0    
## [52] scales_1.1.0        stringi_1.4.6       farver_2.0.3       
## [55] XVector_0.26.0      ggsignif_0.6.0      reshape2_1.4.3     
## [58] viridis_0.5.1       vctrs_0.2.4         Rhdf5lib_1.8.0     
## [61] iterators_1.0.12    tools_3.6.3         ade4_1.7-15        
## [64] Biobase_2.46.0      glue_1.3.2          purrr_0.3.3        
## [67] crosstalk_1.1.0.1   parallel_3.6.3      survival_3.1-11    
## [70] yaml_2.2.1          colorspace_1.4-1    rhdf5_2.30.1       
## [73] cluster_2.1.0       knitr_1.28