QC
The qc
module uses a variety of tools to perform quality control on Illumina and
Oxford Nanopore reads. The tools used are:
Tool | Technology | Description |
---|---|---|
bbtools | Illumina | A suite of tools for manipulating reads |
fastp | Illumina | A tool designed to provide fast all-in-one preprocessing for FastQ files |
fastqc | Illumina | A quality control tool for high throughput sequence data |
fastq_scan | Nanopore | A tool for quickly scanning FASTQ files |
lighter | Illumina | A tool for correcting sequencing errors in Illumina reads |
NanoPlot | Nanopore | A tool for plotting long read sequencing data |
nanoq | Nanopore | A tool for calculating quality metrics for Oxford Nanopore reads |
porechop | Nanopore | A tool for removing adapters from Oxford Nanopore reads |
rasusa | Nanopore | Randomly subsample sequencing reads to a specified coverage |
Similar to the gather
step, the qc
step will also stop samples that fail to meet
basic QC checks from continuing downstream.
Output Overview¶
Below is the default output structure for the qc
step in Bactopia. Where
possible the file descriptions below were modified from a tools description.
<BACTOPIA_DIR>
βββ <SAMPLE_NAME>
β βββ main
β βββ qc
β βββ extra
β βββ logs
β β βββ nf-qc.{begin,err,log,out,run,sh,trace}
β β βββ output-fastp.log
β β βββ versions.yml
β βββ <SAMPLE_NAME>-low-read-count-error.txt
β βββ <SAMPLE_NAME>-low-sequence-coverage-error.txt
β βββ <SAMPLE_NAME>-low-sequence-depth-error.txt
β βββ <SAMPLE_NAME>{|_R1|_R2}.error-fastq.gz
β βββ <SAMPLE_NAME>{|_R1|_R2}.fastq.gz
β βββ summary
β βββ <SAMPLE_NAME>.fastp.{html|json}
β βββ <SAMPLE_NAME>{|_R1|_R2}-{final|original}.json
β βββ <SAMPLE_NAME>{|_R1|_R2}-{final|original}_fastqc.{html|zip}
β βββ <SAMPLE_NAME>{|_R1|_R2}-{final|original}_NanoPlot-report.html
β βββ <SAMPLE_NAME>{|_R1|_R2}-{final|original}_NanoPlot.tar.gz
βββ bactopia-runs
βββ bactopia-<TIMESTAMP>
βββ nf-reports
βββ bactopia-dag.dot
βββ bactopia-report.html
βββ bactopia-timeline.html
βββ bactopia-trace.txt
Results¶
Quality Control¶
Below is a description of the per-sample results from qc
subworkflow.
Filename | Description |
---|---|
<SAMPLE_NAME>.fastq.gz | A gzipped FASTQ file containing the cleaned Illumina single-end, or Oxford Nanopore reads |
<SAMPLE_NAME>_R{1|2}.fastq.gz | A gzipped FASTQ file containing the cleaned Illumina paired-end reads |
<SAMPLE_NAME>-{final|original}.json | A JSON file containing the QC results generated by fastq-scan |
<SAMPLE_NAME>-{final|original}_fastqc.html | (Illumina Only) A HTML report of the QC results generated by fastqc |
<SAMPLE_NAME>-{final|original}_fastqc.zip | (Illumina Only) A zip file containing the complete set of fastqc results |
<SAMPLE_NAME>-{final|original}_fastp.json | (Illumina Only) A JSON file containing the QC results generated by fastp |
<SAMPLE_NAME>-{final|original}_fastp.html | (Illumina Only) A HTML report of the QC results generated by fastp |
<SAMPLE_NAME>-{final|original}_NanoPlot-report.html | (ONT Only) A HTML report of the QC results generated by NanoPlot |
<SAMPLE_NAME>-{final|original}_NanoPlot.tar.gz | (ONT Only) A tarball containing the complete set of NanoPlot results |
Failed Quality Checks¶
Built into Bactopia are few basic quality checks to help prevent downstream failures. If a sample fails one of these checks, it will be excluded from further analysis. By excluding these samples, complete pipeline failures are prevented.
Extension | Description |
---|---|
.error-fastq.gz | A gzipped FASTQ file of Illumina Single-End or Oxford Nanopore reads that failed QC |
_R{1|2}.error-fastq.gz | A gzipped FASTQ file of Illumina Single-End or Oxford Nanopore reads that failed QC |
-low-read-count-error.txt | Sample failed read count checks and excluded from further analysis |
-low-sequence-coverage-error.txt | Sample failed sequenced coverage checks and excluded from further analysis |
-low-sequence-depth-error.txt | Sample failed sequenced basepair checks and excluded from further analysis |
Poor samples are excluded to prevent downstream failures
Samples that fail any of the QC checks will be excluded from further analysis.
Those samples will generate a *-error.txt
file with the error message. Excluding
these samples prevents downstream failures that cause the whole workflow to fail.
Example Error: After QC, too few reads remain
If after cleaning reads, a sample has less than the minimum required reads, the
sample will be excluded from further analysis. You can adjust this minimum read
count using the --min_reads
parameter.
Example Text from <SAMPLE_NAME>-low-read-count-error.txt
<SAMPLE_NAME> FASTQ(s) contain X
total reads. This does not exceed the required
minimum Y
read count. Further analysis is discontinued.
Example Error: After QC, too little sequence coverage remains
If after cleaning reads, a sample has failed to meet the minimum sequence
coverage required, the sample will be excluded from further analysis. You can
adjust this minimum read count using the --min_coverage
parameter.
Note: This check is only performed when a genome size is available.
Example Text from <SAMPLE_NAME>-low-sequence-coverage-error.txt
After QC, <SAMPLE_NAME> FASTQ(s) contain X
total basepairs. This does not
exceed the required minimum Y
bp (Z
x coverage). Further analysis is
discontinued.
Example Error: After QC, too little sequenced basepairs remain
If after cleaning reads, a sample has failed to meet the minimum number of
sequenced basepairs, the sample will be excluded from further analysis. You can
adjust this minimum read count using the --min_basepairs
parameter.
Example Text from <SAMPLE_NAME>-low-sequence-depth-error.txt
<SAMPLE_NAME> FASTQ(s) contain X
total basepairs. This does not exceed the
required minimum Y
bp. Further analysis is discontinued.
Audit Trail¶
Below are files that can assist you in understanding which parameters and program versions were used.
Logs¶
Each process that is executed will have a folder named logs
. In this folder are helpful
files for you to review if the need ever arises.
Extension | Description |
---|---|
.begin | An empty file used to designate the process started |
.err | Contains STDERR outputs from the process |
.log | Contains both STDERR and STDOUT outputs from the process |
.out | Contains STDOUT outputs from the process |
.run | The script Nextflow uses to stage/unstage files and queue processes based on given profile |
.sh | The script executed by bash for the process |
.trace | The Nextflow Trace report for the process |
versions.yml | A YAML formatted file with program versions |
Parameters¶
QC¶
Parameter | Description |
---|---|
--use_bbmap |
Illumina reads will be QC'd using BBMap Type: boolean |
--use_porechop |
Use Porechop to remove adapters from ONT reads Type: boolean |
--skip_qc |
The QC step will be skipped and it will be assumed the inputs sequences have already been QCed. Type: boolean |
--skip_qc_plots |
QC Plot creation by FastQC or Nanoplot will be skipped Type: boolean |
--skip_error_correction |
FLASH error correction of reads will be skipped. Type: boolean |
--adapters |
A FASTA file containing adapters to remove Type: string , Default: /home/robert_petit/bactopia/data/EMPTY_ADAPTERS |
--adapter_k |
Kmer length used for finding adapters. Type: integer , Default: 23 |
--phix |
phiX174 reference genome to remove Type: string , Default: /home/robert_petit/bactopia/data/EMPTY_PHIX |
--phix_k |
Kmer length used for finding phiX174. Type: integer , Default: 31 |
--ktrim |
Trim reads to remove bases matching reference kmers Type: string , Default: r |
--mink |
Look for shorter kmers at read tips down to this length, when k-trimming or masking. Type: integer , Default: 11 |
--hdist |
Maximum Hamming distance for ref kmers (subs only) Type: integer , Default: 1 |
--tpe |
When kmer right-trimming, trim both reads to the minimum length of either Type: string , Default: t |
--tbo |
Trim adapters based on where paired reads overlap Type: string , Default: t |
--qtrim |
Trim read ends to remove bases with quality below trimq. Type: string , Default: rl |
--trimq |
Regions with average quality BELOW this will be trimmed if qtrim is set to something other than f Type: integer , Default: 6 |
--maq |
Reads with average quality (after trimming) below this will be discarded Type: integer , Default: 10 |
--minlength |
Reads shorter than this after trimming will be discarded Type: integer , Default: 35 |
--ftm |
If positive, right-trim length to be equal to zero, modulo this number Type: integer , Default: 5 |
--tossjunk |
Discard reads with invalid characters as bases Type: string , Default: t |
--ain |
When detecting pair names, allow identical names Type: string , Default: f |
--qout |
PHRED offset to use for output FASTQs Type: string , Default: 33 |
--maxcor |
Max number of corrections within a 20bp window Type: integer , Default: 1 |
--sampleseed |
Set to a positive number to use as the random number generator seed for sampling Type: integer , Default: 42 |
--ont_minlength |
ONT Reads shorter than this will be discarded Type: integer , Default: 1000 |
--ont_minqual |
Minimum average read quality filter of ONT reads Type: integer |
--porechop_opts |
Extra Porechop options in quotes Type: string |
--nanoplot_opts |
Extra NanoPlot options in quotes Type: string |
--bbduk_opts |
Extra BBDuk options in quotes Type: string |
--fastp_opts |
Extra fastp options in quotes Type: string |
Citations¶
If you use Bactopia and qc
in your analysis, please cite the following.
-
Bactopia
Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020) -
BBTools
Bushnell B BBMap short read aligner, and other bioinformatic tools. (Link) -
fastp
Chen S, Zhou Y, Chen Y, and Gu J fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884βi890. (2018) -
fastq-scan
Petit III RA fastq-scan: generate summary statistics of input FASTQ sequences. (GitHub) -
FastQC
Andrews S FastQC: a quality control tool for high throughput sequence data. (WebLink) -
Lighter
Song L, Florea L, Langmead B Lighter: Fast and Memory-efficient Sequencing Error Correction without Counting. Genome Biol. 15(11):509 (2014) -
NanoPlot
De Coster W, DβHert S, Schultz DT, Cruts M, Van Broeckhoven C NanoPack: visualizing and processing long-read sequencing data Bioinformatics Volume 34, Issue 15 (2018) -
Nanoq
Steinig E Nanoq: Minimal but speedy quality control for nanopore reads in Rust (GitHub) -
Pigz
Adler M. pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory (2015) -
Porechop
Wick RR, Judd LM, Gorrie CL, Holt KE. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom. 3(10):e000132 (2017) -
Rasusa
Hall MB Rasusa: Randomly subsample sequencing reads to a specified coverage. (2019).