QC

The qc module uses a variety of tools to perform quality control on Illumina and Oxford Nanopore reads. The tools used are:

Tool	Technology	Description
bbtools	Illumina	A suite of tools for manipulating reads
fastp	Illumina	A tool designed to provide fast all-in-one preprocessing for FastQ files
fastqc	Illumina	A quality control tool for high throughput sequence data
fastq_scan	Nanopore	A tool for quickly scanning FASTQ files
lighter	Illumina	A tool for correcting sequencing errors in Illumina reads
NanoPlot	Nanopore	A tool for plotting long read sequencing data
nanoq	Nanopore	A tool for calculating quality metrics for Oxford Nanopore reads
porechop	Nanopore	A tool for removing adapters from Oxford Nanopore reads
rasusa	Nanopore	Randomly subsample sequencing reads to a specified coverage

Similar to the gather step, the qc step will also stop samples that fail to meet basic QC checks from continuing downstream.

Output Overview¶

Below is the default output structure for the qc step in Bactopia. Where possible the file descriptions below were modified from a tools description.

<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│   └── main
│       └── qc
│           ├── extra
│           ├── logs
│           │   ├── nf-qc.{begin,err,log,out,run,sh,trace}
│           │   ├── output-fastp.log
│           │   └── versions.yml
│           ├── <SAMPLE_NAME>-low-read-count-error.txt
│           ├── <SAMPLE_NAME>-low-sequence-coverage-error.txt
│           ├── <SAMPLE_NAME>-low-sequence-depth-error.txt
│           ├── <SAMPLE_NAME>{|_R1|_R2}.error-fastq.gz
│           ├── <SAMPLE_NAME>{|_R1|_R2}.fastq.gz
│           └── summary
│               ├── <SAMPLE_NAME>.fastp.{html|json}
│               ├── <SAMPLE_NAME>{|_R1|_R2}-{final|original}.json
│               └── <SAMPLE_NAME>{|_R1|_R2}-{final|original}_fastqc.{html|zip}
│               └── <SAMPLE_NAME>{|_R1|_R2}-{final|original}_NanoPlot-report.html
│               └── <SAMPLE_NAME>{|_R1|_R2}-{final|original}_NanoPlot.tar.gz
└── bactopia-runs
    └── bactopia-<TIMESTAMP>
        └── nf-reports
            ├── bactopia-dag.dot
            ├── bactopia-report.html
            ├── bactopia-timeline.html
            └── bactopia-trace.txt

Results¶

Quality Control¶

Below is a description of the per-sample results from qc subworkflow.

Filename	Description
<SAMPLE_NAME>.fastq.gz	A gzipped FASTQ file containing the cleaned Illumina single-end, or Oxford Nanopore reads
<SAMPLE_NAME>_R{1\|2}.fastq.gz	A gzipped FASTQ file containing the cleaned Illumina paired-end reads
<SAMPLE_NAME>-{final\|original}.json	A JSON file containing the QC results generated by fastq-scan
<SAMPLE_NAME>-{final\|original}_fastqc.html	(Illumina Only) A HTML report of the QC results generated by fastqc
<SAMPLE_NAME>-{final\|original}_fastqc.zip	(Illumina Only) A zip file containing the complete set of fastqc results
<SAMPLE_NAME>-{final\|original}_fastp.json	(Illumina Only) A JSON file containing the QC results generated by fastp
<SAMPLE_NAME>-{final\|original}_fastp.html	(Illumina Only) A HTML report of the QC results generated by fastp
<SAMPLE_NAME>-{final\|original}_NanoPlot-report.html	(ONT Only) A HTML report of the QC results generated by NanoPlot
<SAMPLE_NAME>-{final\|original}_NanoPlot.tar.gz	(ONT Only) A tarball containing the complete set of NanoPlot results

Failed Quality Checks¶

Built into Bactopia are few basic quality checks to help prevent downstream failures. If a sample fails one of these checks, it will be excluded from further analysis. By excluding these samples, complete pipeline failures are prevented.

Extension	Description
.error-fastq.gz	A gzipped FASTQ file of Illumina Single-End or Oxford Nanopore reads that failed QC
_R{1\|2}.error-fastq.gz	A gzipped FASTQ file of Illumina Single-End or Oxford Nanopore reads that failed QC
-low-read-count-error.txt	Sample failed read count checks and excluded from further analysis
-low-sequence-coverage-error.txt	Sample failed sequenced coverage checks and excluded from further analysis
-low-sequence-depth-error.txt	Sample failed sequenced basepair checks and excluded from further analysis

Poor samples are excluded to prevent downstream failures

Samples that fail any of the QC checks will be excluded from further analysis. Those samples will generate a *-error.txt file with the error message. Excluding these samples prevents downstream failures that cause the whole workflow to fail.

Example Error: After QC, too few reads remain

If after cleaning reads, a sample has less than the minimum required reads, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_reads parameter.

Example Text from <SAMPLE_NAME>-low-read-count-error.txt
<SAMPLE_NAME> FASTQ(s) contain X total reads. This does not exceed the required minimum Y read count. Further analysis is discontinued.

Example Error: After QC, too little sequence coverage remains

If after cleaning reads, a sample has failed to meet the minimum sequence coverage required, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_coverage parameter.

Note: This check is only performed when a genome size is available.

Example Text from <SAMPLE_NAME>-low-sequence-coverage-error.txt
After QC, <SAMPLE_NAME> FASTQ(s) contain X total basepairs. This does not exceed the required minimum Y bp (Zx coverage). Further analysis is discontinued.

Example Error: After QC, too little sequenced basepairs remain

If after cleaning reads, a sample has failed to meet the minimum number of sequenced basepairs, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_basepairs parameter.

Example Text from <SAMPLE_NAME>-low-sequence-depth-error.txt
<SAMPLE_NAME> FASTQ(s) contain X total basepairs. This does not exceed the required minimum Y bp. Further analysis is discontinued.

Audit Trail¶

Below are files that can assist you in understanding which parameters and program versions were used.

Logs¶

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

Extension	Description
.begin	An empty file used to designate the process started
.err	Contains STDERR outputs from the process
.log	Contains both STDERR and STDOUT outputs from the process
.out	Contains STDOUT outputs from the process
.run	The script Nextflow uses to stage/unstage files and queue processes based on given profile
.sh	The script executed by bash for the process
.trace	The Nextflow Trace report for the process
versions.yml	A YAML formatted file with program versions

Parameters¶

QC¶

Parameter	Description
`--use_bbmap`	Illumina reads will be QC'd using BBMap Type: `boolean`
`--use_porechop`	Use Porechop to remove adapters from ONT reads Type: `boolean`
`--skip_qc`	The QC step will be skipped and it will be assumed the inputs sequences have already been QCed. Type: `boolean`
`--skip_qc_plots`	QC Plot creation by FastQC or Nanoplot will be skipped Type: `boolean`
`--skip_error_correction`	FLASH error correction of reads will be skipped. Type: `boolean`
`--adapters`	A FASTA file containing adapters to remove Type: `string`, Default: `/home/robert_petit/bactopia/data/EMPTY_ADAPTERS`
`--adapter_k`	Kmer length used for finding adapters. Type: `integer`, Default: `23`
`--phix`	phiX174 reference genome to remove Type: `string`, Default: `/home/robert_petit/bactopia/data/EMPTY_PHIX`
`--phix_k`	Kmer length used for finding phiX174. Type: `integer`, Default: `31`
`--ktrim`	Trim reads to remove bases matching reference kmers Type: `string`, Default: `r`
`--mink`	Look for shorter kmers at read tips down to this length, when k-trimming or masking. Type: `integer`, Default: `11`
`--hdist`	Maximum Hamming distance for ref kmers (subs only) Type: `integer`, Default: `1`
`--tpe`	When kmer right-trimming, trim both reads to the minimum length of either Type: `string`, Default: `t`
`--tbo`	Trim adapters based on where paired reads overlap Type: `string`, Default: `t`
`--qtrim`	Trim read ends to remove bases with quality below trimq. Type: `string`, Default: `rl`
`--trimq`	Regions with average quality BELOW this will be trimmed if qtrim is set to something other than f Type: `integer`, Default: `6`
`--maq`	Reads with average quality (after trimming) below this will be discarded Type: `integer`, Default: `10`
`--minlength`	Reads shorter than this after trimming will be discarded Type: `integer`, Default: `35`
`--ftm`	If positive, right-trim length to be equal to zero, modulo this number Type: `integer`, Default: `5`
`--tossjunk`	Discard reads with invalid characters as bases Type: `string`, Default: `t`
`--ain`	When detecting pair names, allow identical names Type: `string`, Default: `f`
`--qout`	PHRED offset to use for output FASTQs Type: `string`, Default: `33`
`--maxcor`	Max number of corrections within a 20bp window Type: `integer`, Default: `1`
`--sampleseed`	Set to a positive number to use as the random number generator seed for sampling Type: `integer`, Default: `42`
`--ont_minlength`	ONT Reads shorter than this will be discarded Type: `integer`, Default: `1000`
`--ont_minqual`	Minimum average read quality filter of ONT reads Type: `integer`
`--porechop_opts`	Extra Porechop options in quotes Type: `string`
`--nanoplot_opts`	Extra NanoPlot options in quotes Type: `string`
`--bbduk_opts`	Extra BBDuk options in quotes Type: `string`
`--fastp_opts`	Extra fastp options in quotes Type: `string`

Citations¶

If you use Bactopia and qc in your analysis, please cite the following.

Bactopia
Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020)
BBTools
Bushnell B BBMap short read aligner, and other bioinformatic tools. (Link)
fastp
Chen S, Zhou Y, Chen Y, and Gu J fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. (2018)
fastq-scan
Petit III RA fastq-scan: generate summary statistics of input FASTQ sequences. (GitHub)
FastQC
Andrews S FastQC: a quality control tool for high throughput sequence data. (WebLink)
Lighter
Song L, Florea L, Langmead B Lighter: Fast and Memory-efficient Sequencing Error Correction without Counting. Genome Biol. 15(11):509 (2014)
NanoPlot
De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C NanoPack: visualizing and processing long-read sequencing data Bioinformatics Volume 34, Issue 15 (2018)
Nanoq
Steinig E Nanoq: Minimal but speedy quality control for nanopore reads in Rust (GitHub)
Pigz
Adler M. pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory (2015)
Porechop
Wick RR, Judd LM, Gorrie CL, Holt KE. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom. 3(10):e000132 (2017)
Rasusa
Hall MB Rasusa: Randomly subsample sequencing reads to a specified coverage. (2019).