Bacterial and Archaeal 16S rRNA gene Amplicon Analysis Workflow
This section outlines the workflow required to analyse 16S rRNA amplicon sequences for Bacteria (27f – 519r) and Archaea (A2f-519r), to produce Amplicon Sequence Variant (ASV) information for the Australian Microbiome database
Analysis is completed on a per sequencing run (sequencing plate) basis. The workflow consists of the following stages:
A] Sequence preparation and merging
1. Merge paired end reads (non-merged reads are discarded)
2. Convert fastq file format to fasta file format
3. Add sampleID, runID and "sample=" information to the sequence headers
4. Concatenate all sequences per sequencing run into a single file
B] Sequence analysis
1. Quality screening, zotu/ASV calling and sequence mapping
2. Classify and remove flipped sequences
3. Replace arbitrary ASV ID's with the sequence itself in the table index
C] Following analysis, all data is combined to give a single dataset, by the following steps:
1. Merge tables into a single table
2. Remove controls from the abundance tables, to create separate samples and control datasets
3. Make a fasta from the abundance table
4. Classify Sequences
Software Required
The following software is used in the steps below:
1. FLASH2 (Magoc and Salzberg, 2011)
2. Mothur (Schloss, et al., 2009)
3. USEARCH (Edgar 2010)
4. Seqtk (https://github.com/lh3/seqtk)
5. Fastx (http://hannonlab.cshl.edu/fastx_toolkit/)
6. Python3
A] Sequence preparation and merging
Merge the paired end reads
Paired end reads are merged using FLASH2 (Magoc and Salzberg, 2011). Flash is run with the following arguments,
--min-overlap=30 --max-overlap=250
Following merging, the merge quality is manually checked by examining the FLASH log file for the percentage of reads that were merged. Plates with low merge rates (< 70%) are manually checked to see if the alignments can be improved.
Unmerged reads are discarded.
File naming
File names are formatted to the following format:
sampleID_plateID.fasta
The sampleID of the mock communities and negative controls on each plate are standardised to the following naming convention (all listed for completeness):
- 16S Bacteria mock communities: BACMOCK
- A16 Archaeal mock communities: ARCMOCK
- 18Sv4 Eukaryote mock communities: EUKMOCKv4
- 18Sv9 Eukaryote mock communities: EUKMOCKv9
- ITS Fungal mock communities: FUNMOCK
- Negative control: NEG
- STAN: STAN
fastq files are then converted to fasta using seqtk
Add sample and file name to the sequence headers
Sample identifiers are added to the header of each sequence for downstream processing. As each fasta is now named SampleID_plateID.fasta we simply add the file name with the extension to the sequence header. At this stage we also add any other information and delimiters that downstream programs will likely require. For USEARCH we add “sample=” and “;”, for Qiime we add “_”.
#!/bin/bash
#this initiates the loop for all files with extension .fasta in the list generated by ls -l
for file in `ls -1 *.fasta`
do
#tell me which file you're working on
echo working on $file
#cut the appropriate string from the name of the file.
VAR1=`echo $file | cut -d "." -f 1`
#tell me what you've cut and called VAR1
echo VAR1 is $VAR1
#use VAR1 to add the sample name to the sequence id. Substitute > for >VAR1_ in all files. The variable (VAR1) needs to be in quotation marks to be passed from bash to perl, the perl script needs to be embedded in the bash script with the ` ` marks
`perl -pi -e 's/\>/\>sample='$VAR1';_16S_/' $file;`
done
Identify unique sequences per sample
An abundance table of all unique sequences in each sample on the plate is then prepared. Unique sequences are first identified using fastx using the following Usearch command:
usearch64 -fastx_uniques SampleID_plateID.fasta -fastaout SampleID_plateID_uniques.fasta -sizeout
Unique sequences per sample (non-Denoised or quality filtered) are provided as a data output and are available from the Australian Microbiome Website.
Concatenate all sequences per sequencing run into a single file
All of the files for each plate are concatenated into a single file for zotu/ASV calling. The resulting file name is standardised to the format: plateID_all_16S.fasta
B] Sequence analysis
Quality screening, zotu/ASV calling and sequence mapping
The first step removes sequences that are too short or too long, have ambiguous bases, or have more than 12 homopolymers.
mothur "#set.dir(modifynames=F);summary.seqs(fasta=plateID_all.fasta, processors=10); screen.seqs(fasta=current, minlength=432, maxlength=520, maxambig=0, maxhomop=12, processors=10); summary.seqs()"
Reads are dereplicated
usearch64 -fastx_uniques plateID_all_16S.good.fasta -fastaout plateID_all_16S.good_uniques.fasta -sizeout
Sort unique reads by abundance
usearch64 -sortbysize plateID_all_16S.good_uniques.fasta -fastaout plateID_all_16S.good_sorted_uniques.fasta -sizeout
ASV/Zotus are called by UNOISE3, from sequences that have => 4 representatives
usearch64 -unoise3 plateID_all_16S.good_sorted_uniques.fasta -zotus plateID_all_16S.good_sorted_uniques_zotus.fasta -ampout plateID_all_16S.good_sorted_uniques_ampout.fasta -tabbedout plateID_all_16S.good_sorted_uniques_unoise3.txt -minsize 4
Map reads to zotus to generate abundances
Reads are mapped against the zotus using USEARCH. Note the termination conditions on the mapping run (-maxaccepts 0), this seems to be required to ensure the best match is found and to produce consistent results when adding multiple plates together, as we do later.
usearch64 -otutab plateID_all_16S.fasta -zotus plateID_all_16S.good_sorted_uniques_zotus_renamed.fasta -otutabout plateID_all_16S.good_sorted_uniques_zotutab_MA0.txt -mapout plateID_all_16S.good_sorted_uniques_zmap_MA0.txt -maxaccepts 0 -threads X
Classify and remove flipped sequences
A final QC step is performed to remove likely erroneous sequences. The ASVs are classified, with those that do not align to the 16S database in the correct orientation being removed. Those that need to be “flipped” to a new orientation are likely errors, since we know the reads should be in 27f – 519r orientation.
1. classify the seqs against 16S database
"#set.dir(modifynames=F); classify.seqs(fasta=plateID_all_16S.good_sorted_uniques_zotutab_relabelled_MA0.fasta, reference=gg_13_8_99.fasta, taxonomy=gg_13_8_99.gg.tax, cutoff=60, probs=FALSE)"
2. use the *acnos.flip list to remove sequences from the table
The above produces the final abundance table, with ASV sequences as index for each sequencing run. These tables are then combined as below to produce a single dataset for the Australian Microbiome.
Replace table indexes with ASV sequence
After denoising and mapping the ASV tables have an arbitrary zOTU number as the index, we replace this with the sequence that the arbitrary number represents. These sequences are unique strings and allow tables to be merged etc. downstream easily. They also negate the need to maintain a dictionary of ASVs and the sequences they represent.
C] Prepare the single dataset
Now we have an ASV abundance table for each plate, with ASV’s as row and SampleID_runID as column headers. To prepare this data for ingest into the AMD database the following steps are carried out:
1. Each table is converted from short to long format (from rectangular to 3 column, with columns ASV, sampleID, Abundance)
2. All of these 3 column tables are concatenated into a single table
3. Controls and samples are split into separate tables
4. Sequencing run ID's are removed from the column headers and any sample sequenced more than once has the abundances from these runs summed to give a single abundance per sample
5. A fasta file of unique ASV's is created from all ASV's in this final table
6. Sequences are classified.
Classify the sequences
Sequences are classified to provide taxonnomies relative to the Silva database (Quast et al., 2013; Yilmaz et al 2014; Glöckner 2017) as below:
mothur "#classify.seqs(fasta=seqs.fasta, reference=SILVA132.ng.fasta, taxonomy=SILVA132.tax, cutoff=60, probs=FALSE, processors=X)"
References
Magoc, T. and Salzberg, S. (2011). FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27(21): 2957-2963.
Schloss, P.D., et al., (2009). Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75(23):7537-7541.
Edgar, R.C., (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19): 2460-2461.
Seqtk available at: https://github.com/lh3/seqtk (last accessed 23 Jan 2019).
Fastx available at: (http://hannonlab.cshl.edu/fastx_toolkit/) (last accessed 23 Jan 2019).
DeSantis T.Z., Hugenholtz P., Larsen N., Rojas M., Brodie E.L., Keller K., Huber T., Dalevi D., Hu P., Andersen G.L.(2006) Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl. Environ. Microbiol. 72(7): 5069-5072; DOI: 10.1128/AEM.03006-05
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl. Acids Res. 41 (D1): D590-D596.
Yilmaz P, Parfrey LW, Yarza P, Gerken J, Pruesse E, Quast C, Schweer T, Peplies J, Ludwig W, Glöckner FO (2014) The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks. Nucl. Acids Res. 42:D643-D648
Glöckner FO, Yilmaz P, Quast C, Gerken J, Beccati A, Ciuprina A, Bruns G, Yarza P, Peplies J, Westram R, Ludwig W (2017) 25 years of serving the community with ribosomal RNA gene reference databases and tools. J. Biotechnol.