Workflow Overview¶
Bactopia is an extensive workflow integrating numerous steps in bacterial genome analysis. Through out the workflow there are steps that are always enabled and dataset enabled. Each of the steps depicted in the image below are described in this section.
A list of software directly used in each step is also listed. Please check out the Acknowledgements section to get the full list of software as well how to download and cite said software.
Always Enabled Steps¶
The Always Enabled Steps are always executed by Bactopia. These steps do not depend of external datasets and thus are always enabled.
Gather FASTQs¶
Specifies exactly where the input FASTQ/FASTAs are coming from. If you are using local inputs (e.g. --R1/--R2
, --fastqs
) it will verify they can be accessed.
If an accession(s) (--accession
or --accessions
) was given, the corresponding FASTQs (SRA/ENA) or assemblies (NCBI Assembly) are downloaded in this step. All assemblies will have 2x250bp Illumina reads simulated withour insertions or deletions and a minimum PHRED score of Q33.
Software | Usage |
---|---|
ART | Generate simulated Illumina reads from an assembly |
ena-dl | Download FASTQ files from ENA |
ncbi-genome-download | Download GenBank/RefSeq assemblies from NCBI Assembly database |
Validate FASTQs¶
Determines if the FASTQ file contains enough sequencing to continue processing. The --min_reads
and --min_basepairs
parameters adjust the minimum amount of sequencing required to continue processing. This step does not directly test the validity of the FASTQ format (although, it would fail if the format is invalid!).
Software | Usage |
---|---|
fastq-scan | Determine total read and basepairs of FASTQ |
Original Summary¶
Produces summary statistics (read lengths, quality scores, etc...) based on the original input FASTQs.
Software | Usage |
---|---|
FastQC | Generates a HTML report of original FASTQ summary statistics |
fastq-scan | Generates original FASTQ summary statistics in JSON format |
Genome Size¶
The genome size is by various programs in the Bactopia workflow. By default, if no genome size is given one is estimated using Mash. Otherwise, a specific genome size can be specified or completely disabled using the --genome_size
parameter. See Genome Size Parameter to learn more about specifying the genome size.
Software | Usage |
---|---|
Mash | If not given, estimates genome size of sample |
Quality Control¶
The input FASTQs go through a few clean up steps. First, Illumina related adapters and phiX contaminants are removed. Then reads that fail to pass length and/or quality requirements are filtered out. If the genome size is available, sequence error-corrections are made and the total sequencing is reduced to a specified coverage. After this step, all downstream analyses are based on the QC'd FASTQ and the original is no longer used.
Software | Description |
---|---|
BBTools | Removes Illumina adapters, phiX contaminants, filters reads based on length and quality score, and reduces inputs to a specified coverage. |
Lighter | Corrects sequencing errors |
QC Summary¶
Produces summary statistics (read lengths, quality scores, etc...) based on the final set of QC'd FASTQs.
Software | Usage |
---|---|
FastQC | Generates a HTML report of QC'd FASTQ summary statistics |
fastq-scan | Generates QC'd FASTQ summary statistics in JSON format |
Count 31-mers¶
All 31 basepair (31-mers) sequences are counted and the singletons (those 31-mers only counted once) are filtered out.
Software | Description |
---|---|
McCortex | Counts 31-mers in the input FASTQ |
Minmer Sketch¶
A minmer sketch and signature is created based on the QC'd FASTQs for multiple values of k. If datasets are available, the sketches/signatures are used for further downstream analysis.
Software | Usage |
---|---|
Mash | Produces a sketch (k=21,31) of tje QC'd FASTQ |
Sourmash | Produces a signature (k=21,31,51) of the QC'd FASTQ |
De novo Assembly¶
The QC'd FASTQs are assembled using the Shovill pipeline. This allows for a seamless assembly process using MEGAHIT, SKESA, SPAdes or Velvet. Alternatively, if long reads are available to complement Illumina paired-end reads, hybrid assembly is available through Unicycler.
Software | Usage |
---|---|
assembly-scan | Generates summary statistics of the final assembly |
Shovill | Manages multiple steps in the Illumina assembly process |
Unicycler | Manages multiple steps in the hybrid assembly process |
Assembly Quality Assessment¶
After assembly, the de novo assembly is assessed for its biological (e.g. containment & contamination) as well as its technical (e.g. misassemblies and errors) quality using CheckM and QUAST.
Software | Usage |
---|---|
CheckM | Assess the biological quality of a de novo assembly based on presence of marker genes |
QUAST | Gives a summary on the technical (e.g. misassemblies etc) quality of a de novo assembly |
Genome Annotation¶
Genes are predicted and annotated from the assembled genome using Prokka. If available, a clustered RefSeq protein set is used for the first pass of annotation.
Software | Usage |
---|---|
Prokka | Predicts and annotates assembled genomes |
Antimicrobial Resistance¶
Searches for antimicrobial resistance genes and assosiated point mutations in the annotated gene and protein sequences. If datasets are available, local assemblies can also be used to predict antibiotic resistance.
Software | Usage |
---|---|
AMRFinderPlus | Predicts antimicrobial resistance based on genes and point mutations |
Dataset Enabled Steps¶
The remaining Dataset Enabled Steps require supplemental datasets to be available to be executed. There are many datasets available that Bactopia can take advantage of. To learn more about setting up these datasets, check out Build Datasets. These datasets can be broken into two groups, Public Datasets and User Datasets.
Public Datasets¶
Publicly available datasets can be used for further analysis.
Call Variants (Auto)¶
Variants are predicted using Snippy. The QC'd FASTQs are aligned to the nearest (based on Mash distance) RefSeq completed genome. By default, only the nearest genome is selected, but multiple genomes can be selected (--max_references
) or this feature can be completely disabled (disable_auto_variants
).
Software | Usage |
---|---|
Bedtools | Generates the per-base coverage of the reference alignment |
NCBI Genome Download | Downloads the RefSeq completed genome |
Snippy | Manages multiple steps in the haploid variant calling process |
vcf-annotator | Adds annotations from reference GenBank to the final VCF |
Minmer Query¶
Screens QC'd FASTQs and signatures against available Minmer Datasets.
Software | Usage |
---|---|
Mash | Screens against RefSeq and/or PLSDB sketches |
Sourmash | Screens signature against GenBank |
Sequence Type¶
Uses a PubMLST.org MLST schema to determine the sequence type of the sample.
Software | Usage |
---|---|
Ariba | Runs QC'd FASTQ against a MLST database |
BLAST | Aligns MLST loci against the assembled genome |
User Datasets¶
Another option is for users to provide their own data to include in the analysis.
BLAST Alignment¶
Each gene, protein, or primer sequence provided by the user is aligned against the assembled genome.
Software | Usage |
---|---|
BLAST | Aligns reference sequences against the assembled genome |
Call Variants (User)¶
Uses the same procedure as Call Variants (Auto), except variants are called against each reference provided by the user.
Software | Usage |
---|---|
Bedtools | Generates the per-base coverage of the reference alignment |
Snippy | Manages multiple steps in the haploid variant calling process |
vcf-annotator | Adds annotations from reference GenBank to the final VCF |
Reference Mapping¶
Aligns the QC'd FASTQs to each sequence provided by the user.
Software | Usage |
---|---|
Bedtools | Generates the per-base coverage of the reference alignment |
BWA | Aligns QC'd FASTQ to a reference sequence |
Samtools | Converts alignment from SAM to BAM |