Skip to content

Workflow Overview

Bactopia is an extensive workflow integrating numerous steps in bacterial genome analysis. Through out the workflow there are steps that are always enabled and dataset enabled. Each of the steps depicted in the image below are described in this section.

A list of software directly used in each step is also listed. Please check out the Acknowledgements section to get the full list of software as well how to download and cite said software.

Bactopia Workflow

Always Enabled Steps

The Always Enabled Steps are always executed by Bactopia. These steps do not depend of external datasets and thus are always enabled.

Gather FASTQs

Specifies exactly where the input FASTQ/FASTAs are coming from. If you are using local inputs (e.g. --R1/--R2, --fastqs) it will verify they can be accessed.

If an accession(s) (--accession or --accessions) was given, the corresponding FASTQs (SRA/ENA) or assemblies (NCBI Assembly) are downloaded in this step. All assemblies will have 2x250bp Illumina reads simulated withour insertions or deletions and a minimum PHRED score of Q33.

Software Usage
ART Generate simulated Illumina reads from an assembly
ena-dl Download FASTQ files from ENA
ncbi-genome-download Download GenBank/RefSeq assemblies from NCBI Assembly database

Validate FASTQs

Determines if the FASTQ file contains enough sequencing to continue processing. The --min_reads and --min_basepairs parameters adjust the minimum amount of sequencing required to continue processing. This step does not directly test the validity of the FASTQ format (although, it would fail if the format is invalid!).

Software Usage
fastq-scan Determine total read and basepairs of FASTQ

Original Summary

Produces summary statistics (read lengths, quality scores, etc...) based on the original input FASTQs.

Software Usage
FastQC Generates a HTML report of original FASTQ summary statistics
fastq-scan Generates original FASTQ summary statistics in JSON format

Genome Size

The genome size is by various programs in the Bactopia workflow. By default, if no genome size is given one is estimated using Mash. Otherwise, a specific genome size can be specified or completely disabled using the --genome_size parameter. See Genome Size Parameter to learn more about specifying the genome size.

Software Usage
Mash If not given, estimates genome size of sample

Quality Control

The input FASTQs go through a few clean up steps. First, Illumina related adapters and phiX contaminants are removed. Then reads that fail to pass length and/or quality requirements are filtered out. If the genome size is available, sequence error-corrections are made and the total sequencing is reduced to a specified coverage. After this step, all downstream analyses are based on the QC'd FASTQ and the original is no longer used.

Software Description
BBTools Removes Illumina adapters, phiX contaminants, filters reads based on length and quality score, and reduces inputs to a specified coverage.
Lighter Corrects sequencing errors

QC Summary

Produces summary statistics (read lengths, quality scores, etc...) based on the final set of QC'd FASTQs.

Software Usage
FastQC Generates a HTML report of QC'd FASTQ summary statistics
fastq-scan Generates QC'd FASTQ summary statistics in JSON format

Count 31-mers

All 31 basepair (31-mers) sequences are counted and the singletons (those 31-mers only counted once) are filtered out.

Software Description
McCortex Counts 31-mers in the input FASTQ

Minmer Sketch

A minmer sketch and signature is created based on the QC'd FASTQs for multiple values of k. If datasets are available, the sketches/signatures are used for further downstream analysis.

Software Usage
Mash Produces a sketch (k=21,31) of tje QC'd FASTQ
Sourmash Produces a signature (k=21,31,51) of the QC'd FASTQ

De novo Assembly

The QC'd FASTQs are assembled using the Shovill pipeline. This allows for a seamless assembly process using MEGAHIT, SKESA, SPAdes or Velvet. Alternatively, if long reads are available to complement Illumina paired-end reads, hybrid assembly is available through Unicycler.

Software Usage
assembly-scan Generates summary statistics of the final assembly
Shovill Manages multiple steps in the Illumina assembly process
Unicycler Manages multiple steps in the hybrid assembly process

Assembly Quality Assessment

After assembly, the de novo assembly is assessed for its biological (e.g. containment & contamination) as well as its technical (e.g. misassemblies and errors) quality using CheckM and QUAST.

Software Usage
CheckM Assess the biological quality of a de novo assembly based on presence of marker genes
QUAST Gives a summary on the technical (e.g. misassemblies etc) quality of a de novo assembly

Genome Annotation

Genes are predicted and annotated from the assembled genome using Prokka. If available, a clustered RefSeq protein set is used for the first pass of annotation.

Software Usage
Prokka Predicts and annotates assembled genomes

Antimicrobial Resistance

Searches for antimicrobial resistance genes and assosiated point mutations in the annotated gene and protein sequences. If datasets are available, local assemblies can also be used to predict antibiotic resistance.

Software Usage
AMRFinderPlus Predicts antimicrobial resistance based on genes and point mutations

Dataset Enabled Steps

The remaining Dataset Enabled Steps require supplemental datasets to be available to be executed. There are many datasets available that Bactopia can take advantage of. To learn more about setting up these datasets, check out Build Datasets. These datasets can be broken into two groups, Public Datasets and User Datasets.

Public Datasets

Publicly available datasets can be used for further analysis.

Call Variants (Auto)

Variants are predicted using Snippy. The QC'd FASTQs are aligned to the nearest (based on Mash distance) RefSeq completed genome. By default, only the nearest genome is selected, but multiple genomes can be selected (--max_references) or this feature can be completely disabled (disable_auto_variants).

Software Usage
Bedtools Generates the per-base coverage of the reference alignment
NCBI Genome Download Downloads the RefSeq completed genome
Snippy Manages multiple steps in the haploid variant calling process
vcf-annotator Adds annotations from reference GenBank to the final VCF

Minmer Query

Screens QC'd FASTQs and signatures against available Minmer Datasets.

Software Usage
Mash Screens against RefSeq and/or PLSDB sketches
Sourmash Screens signature against GenBank

Sequence Type

Uses a PubMLST.org MLST schema to determine the sequence type of the sample.

Software Usage
Ariba Runs QC'd FASTQ against a MLST database
BLAST Aligns MLST loci against the assembled genome

User Datasets

Another option is for users to provide their own data to include in the analysis.

BLAST Alignment

Each gene, protein, or primer sequence provided by the user is aligned against the assembled genome.

Software Usage
BLAST Aligns reference sequences against the assembled genome

Call Variants (User)

Uses the same procedure as Call Variants (Auto), except variants are called against each reference provided by the user.

Software Usage
Bedtools Generates the per-base coverage of the reference alignment
Snippy Manages multiple steps in the haploid variant calling process
vcf-annotator Adds annotations from reference GenBank to the final VCF

Reference Mapping

Aligns the QC'd FASTQs to each sequence provided by the user.

Software Usage
Bedtools Generates the per-base coverage of the reference alignment
BWA Aligns QC'd FASTQ to a reference sequence
Samtools Converts alignment from SAM to BAM