Skip to content

QC

The qc module uses a variety of tools to perform quality control on Illumina and Oxford Nanopore reads. The tools used are:

Tool Technology Description
bbtools Illumina A suite of tools for manipulating reads
fastp Illumina A tool designed to provide fast all-in-one preprocessing for FastQ files
fastqc Illumina A quality control tool for high throughput sequence data
fastq_scan Nanopore A tool for quickly scanning FASTQ files
lighter Illumina A tool for correcting sequencing errors in Illumina reads
NanoPlot Nanopore A tool for plotting long read sequencing data
nanoq Nanopore A tool for calculating quality metrics for Oxford Nanopore reads
porechop Nanopore A tool for removing adapters from Oxford Nanopore reads
rasusa Nanopore Randomly subsample sequencing reads to a specified coverage

Similar to the gather step, the qc step will also stop samples that fail to meet basic QC checks from continuing downstream.

Output Overview

Below is the default output structure for the qc step in Bactopia. Where possible the file descriptions below were modified from a tools description.

<BACTOPIA_DIR>
β”œβ”€β”€ <SAMPLE_NAME>
β”‚   └── main
β”‚       └── qc
β”‚           β”œβ”€β”€ extra
β”‚           β”œβ”€β”€ logs
β”‚           β”‚   β”œβ”€β”€ nf-qc.{begin,err,log,out,run,sh,trace}
β”‚           β”‚   β”œβ”€β”€ output-fastp.log
β”‚           β”‚   └── versions.yml
β”‚           β”œβ”€β”€ <SAMPLE_NAME>-low-read-count-error.txt
β”‚           β”œβ”€β”€ <SAMPLE_NAME>-low-sequence-coverage-error.txt
β”‚           β”œβ”€β”€ <SAMPLE_NAME>-low-sequence-depth-error.txt
β”‚           β”œβ”€β”€ <SAMPLE_NAME>{|_R1|_R2}.error-fastq.gz
β”‚           β”œβ”€β”€ <SAMPLE_NAME>{|_R1|_R2}.fastq.gz
β”‚           └── summary
β”‚               β”œβ”€β”€ <SAMPLE_NAME>.fastp.{html|json}
β”‚               β”œβ”€β”€ <SAMPLE_NAME>{|_R1|_R2}-{final|original}.json
β”‚               └── <SAMPLE_NAME>{|_R1|_R2}-{final|original}_fastqc.{html|zip}
β”‚               └── <SAMPLE_NAME>{|_R1|_R2}-{final|original}_NanoPlot-report.html
β”‚               └── <SAMPLE_NAME>{|_R1|_R2}-{final|original}_NanoPlot.tar.gz
└── bactopia-runs
    └── bactopia-<TIMESTAMP>
        └── nf-reports
            β”œβ”€β”€ bactopia-dag.dot
            β”œβ”€β”€ bactopia-report.html
            β”œβ”€β”€ bactopia-timeline.html
            └── bactopia-trace.txt

Results

Quality Control

Below is a description of the per-sample results from qc subworkflow.

Filename Description
<SAMPLE_NAME>.fastq.gz A gzipped FASTQ file containing the cleaned Illumina single-end, or Oxford Nanopore reads
<SAMPLE_NAME>_R{1|2}.fastq.gz A gzipped FASTQ file containing the cleaned Illumina paired-end reads
<SAMPLE_NAME>-{final|original}.json A JSON file containing the QC results generated by fastq-scan
<SAMPLE_NAME>-{final|original}_fastqc.html (Illumina Only) A HTML report of the QC results generated by fastqc
<SAMPLE_NAME>-{final|original}_fastqc.zip (Illumina Only) A zip file containing the complete set of fastqc results
<SAMPLE_NAME>-{final|original}_fastp.json (Illumina Only) A JSON file containing the QC results generated by fastp
<SAMPLE_NAME>-{final|original}_fastp.html (Illumina Only) A HTML report of the QC results generated by fastp
<SAMPLE_NAME>-{final|original}_NanoPlot-report.html (ONT Only) A HTML report of the QC results generated by NanoPlot
<SAMPLE_NAME>-{final|original}_NanoPlot.tar.gz (ONT Only) A tarball containing the complete set of NanoPlot results

Failed Quality Checks

Built into Bactopia are few basic quality checks to help prevent downstream failures. If a sample fails one of these checks, it will be excluded from further analysis. By excluding these samples, complete pipeline failures are prevented.

Extension Description
.error-fastq.gz A gzipped FASTQ file of Illumina Single-End or Oxford Nanopore reads that failed QC
_R{1|2}.error-fastq.gz A gzipped FASTQ file of Illumina Single-End or Oxford Nanopore reads that failed QC
-low-read-count-error.txt Sample failed read count checks and excluded from further analysis
-low-sequence-coverage-error.txt Sample failed sequenced coverage checks and excluded from further analysis
-low-sequence-depth-error.txt Sample failed sequenced basepair checks and excluded from further analysis

Poor samples are excluded to prevent downstream failures

Samples that fail any of the QC checks will be excluded from further analysis. Those samples will generate a *-error.txt file with the error message. Excluding these samples prevents downstream failures that cause the whole workflow to fail.

Example Error: After QC, too few reads remain

If after cleaning reads, a sample has less than the minimum required reads, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_reads parameter.

Example Text from <SAMPLE_NAME>-low-read-count-error.txt
<SAMPLE_NAME> FASTQ(s) contain X total reads. This does not exceed the required minimum Y read count. Further analysis is discontinued.

Example Error: After QC, too little sequence coverage remains

If after cleaning reads, a sample has failed to meet the minimum sequence coverage required, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_coverage parameter.

Note: This check is only performed when a genome size is available.

Example Text from <SAMPLE_NAME>-low-sequence-coverage-error.txt
After QC, <SAMPLE_NAME> FASTQ(s) contain X total basepairs. This does not exceed the required minimum Y bp (Zx coverage). Further analysis is discontinued.

Example Error: After QC, too little sequenced basepairs remain

If after cleaning reads, a sample has failed to meet the minimum number of sequenced basepairs, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_basepairs parameter.

Example Text from <SAMPLE_NAME>-low-sequence-depth-error.txt
<SAMPLE_NAME> FASTQ(s) contain X total basepairs. This does not exceed the required minimum Y bp. Further analysis is discontinued.

Audit Trail

Below are files that can assist you in understanding which parameters and program versions were used.

Logs

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

Extension Description
.begin An empty file used to designate the process started
.err Contains STDERR outputs from the process
.log Contains both STDERR and STDOUT outputs from the process
.out Contains STDOUT outputs from the process
.run The script Nextflow uses to stage/unstage files and queue processes based on given profile
.sh The script executed by bash for the process
.trace The Nextflow Trace report for the process
versions.yml A YAML formatted file with program versions

Parameters

QC

Parameter Description
--use_bbmap Illumina reads will be QC'd using BBMap
Type: boolean
--use_porechop Use Porechop to remove adapters from ONT reads
Type: boolean
--skip_qc The QC step will be skipped and it will be assumed the inputs sequences have already been QCed.
Type: boolean
--skip_qc_plots QC Plot creation by FastQC or Nanoplot will be skipped
Type: boolean
--skip_error_correction FLASH error correction of reads will be skipped.
Type: boolean
--adapters A FASTA file containing adapters to remove
Type: string, Default: /home/robert_petit/bactopia/data/EMPTY_ADAPTERS
--adapter_k Kmer length used for finding adapters.
Type: integer, Default: 23
--phix phiX174 reference genome to remove
Type: string, Default: /home/robert_petit/bactopia/data/EMPTY_PHIX
--phix_k Kmer length used for finding phiX174.
Type: integer, Default: 31
--ktrim Trim reads to remove bases matching reference kmers
Type: string, Default: r
--mink Look for shorter kmers at read tips down to this length, when k-trimming or masking.
Type: integer, Default: 11
--hdist Maximum Hamming distance for ref kmers (subs only)
Type: integer, Default: 1
--tpe When kmer right-trimming, trim both reads to the minimum length of either
Type: string, Default: t
--tbo Trim adapters based on where paired reads overlap
Type: string, Default: t
--qtrim Trim read ends to remove bases with quality below trimq.
Type: string, Default: rl
--trimq Regions with average quality BELOW this will be trimmed if qtrim is set to something other than f
Type: integer, Default: 6
--maq Reads with average quality (after trimming) below this will be discarded
Type: integer, Default: 10
--minlength Reads shorter than this after trimming will be discarded
Type: integer, Default: 35
--ftm If positive, right-trim length to be equal to zero, modulo this number
Type: integer, Default: 5
--tossjunk Discard reads with invalid characters as bases
Type: string, Default: t
--ain When detecting pair names, allow identical names
Type: string, Default: f
--qout PHRED offset to use for output FASTQs
Type: string, Default: 33
--maxcor Max number of corrections within a 20bp window
Type: integer, Default: 1
--sampleseed Set to a positive number to use as the random number generator seed for sampling
Type: integer, Default: 42
--ont_minlength ONT Reads shorter than this will be discarded
Type: integer, Default: 1000
--ont_minqual Minimum average read quality filter of ONT reads
Type: integer
--porechop_opts Extra Porechop options in quotes
Type: string
--nanoplot_opts Extra NanoPlot options in quotes
Type: string
--bbduk_opts Extra BBDuk options in quotes
Type: string
--fastp_opts Extra fastp options in quotes
Type: string

Citations

If you use Bactopia and qc in your analysis, please cite the following.