Acknowledgements¶
Bactopia is truly a case of "standing upon the shoulders of giants". Bactopia currently integrates more than 157 datasets and software packages. Nearly every component utilized in Bactopia, from the workflow to the datasets to the software packages and even the framework of this site, was created by others and made freely accessible to the public.
I would like to personally extend my many thanks and gratitude to the authors of these software packages and public datasets. If you've made it this far, I owe you a beer đ» (or coffee â!) if we ever encounter one another in person. Really, thank you very much!
Please Cite Datasets and Tools
If you have used Bactopia in your work, please be sure to cite any datasets or software you may have used.
Funding¶
Support for this project came (in part) from an Emory Public Health Bioinformatics Fellowship funded by the CDC Emerging Infections Program (U50CK000485) PPHF/ACA: Enhancing Epidemiology and Laboratory Capacity, the Wyoming Public Health Division, and the Center for Applied Pathogen Epidemiology and Outbreak Control (CAPE).
Influences¶
nf-core¶
nf-core is a great group of individuals volunteering their time to create a set of curated Nextflow analysis pipelines. The nf-core Team has put together some amazing practices that I think really strengthen the Nextflow community as a whole!
I'm often asked: Will Bactopia ever be apart of nf-core?
The answer is: No, but...
Bactopia, was adapted from Staphopia which pre-dates the beginnings of nf-core. As both nf-core and Bactopia grew, it bacame clear adding Bactopia to nf-core was going to be a difficult task. The last opporunity to do so was probably when Bactopia was converted to DSL2, but Bactopia Tools would not likely ever fit into the nf-core mold.
However, where possible, I have tried to implement nf-core practices into Bactopia. Some examples include:
- Arguement parsing based on nf-core library
- All Bactopia Tools are adapted from nf-core/modules
- Testing implemented to follow nf-core/modules
By implementing these practices, Bactopia I believe is much better pipeline to use. For this I'm very grateful to the nf-core community! Thank you!
Ewels P, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. (2020)
Public Datasets¶
Below is a list of 19 public datasets that could have potentially been used through Bactopia or Bactopia Tools.
Ariba Reference Datasets¶
These datasets are available using Ariba's getref
function. You can learn
more about this function at Ariba's Wiki.
-
ARG-ANNOT
Gupta SK, Padmanabhan BR, Diene SM, Lopez-Rojas R, Kempf M, Landraud L, Rolain J-M ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob. Agents Chemother 58, 212â220 (2014) -
CARD
Alcock BP, Raphenya AR, Lau TTY, Tsang KK, Bouchard M, Edalatmand A, Huynh W, Nguyen A-L V, Cheng AA, Liu S, Min SY, Miroshnichenko A, Tran H-K, Werfalli RE, Nasir JA, Oloni M, Speicher DJ, Florescu A, Singh B, Faltyn M, Hernandez-Koutoucheva A, Sharma AN, Bordeleau E, Pawlowski AC, Zubyk HL, Dooley D, Griffiths E, Maguire F, Winsor GL, Beiko RG, Brinkman FSL, Hsiao WWL, Domselaar GV, McArthur AG CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic acids research 48.D1, D517-D525 (2020) -
EcOH
Ingle DJ, Valcanis M, Kuzevski A, Tauschek M, Inouye M, Stinear T, Levine MM, Robins-Browne RM, Holt KE In silico serotyping of E. coli from short read data identifies limited novel O-loci but extensive diversity of O:H serotype combinations within and between pathogenic lineages. Microbial Genomics, 2(7), e000064. (2016) -
MEGARes
Lakin SM, Dean C, Noyes NR, Dettenwanger A, Ross AS, Doster E, Rovira P, Abdo Z, Jones KL, Ruiz J, Belk KE, Morley PS, Boucher C MEGARes: an antimicrobial resistance database for high throughput sequencing. Nucleic Acids Res. 45, D574âD580 (2017) -
MEGARes 2.0
Doster E, Lakin SM, Dean CJ, Wolfe C, Young JG, Boucher C, Belk KE, Noyes NR, Morley PS MEGARes 2.0: a database for classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data. Nucleic Acids Research, 48(D1), D561âD569. (2020) -
NCBI Reference Gene Catalog
Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S, Hsu C-H, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J, Folster JP, Klimke W Validating the NCBI AMRFinder Tool and Resistance Gene Database Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of NARMS Isolates. Antimicrob. Agents Chemother. (2019) -
PlasmidFinder
Carattoli A, Zankari E, GarcĂa-FernĂĄndez A, Larsen MV, Lund O, Villa L, Aarestrup FM, Hasman H In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob. Agents Chemother. 58, 3895â3903 (2014) -
ResFinder
Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O, Aarestrup FM, Larsen MV Identification of acquired antimicrobial resistance genes. J. Antimicrob. Chemother. 67, 2640â2644 (2012) -
SRST2
Inouye M, Dashnow H, Raven L-A, Schultz MB, Pope BJ, Tomita T, Zobel J, Holt KE SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Med. 6, 90 (2014) -
VFDB
Chen L, Zheng D, Liu B, Yang J, Jin Q VFDB 2016: hierarchical and refined dataset for big data analysis--10 years on. Nucleic Acids Res. 44, D694â7 (2016) -
VirulenceFinder
Joensen KG, Scheutz F, Lund O, Hasman H, Kaas RS, Nielsen EM, Aarestrup FM Real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic Escherichia coli. J. Clin. Microbiol. 52, 1501â1510 (2014)
Minmer Datasets¶
-
Mash Refseq (release 88) Sketch
Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM Mash Screen: high-throughput sequence containment estimation for genome discovery Genome Biol 20, 232 (2019) -
Sourmash Genbank LCA Signature
Brown CT, Irber L sourmash: a library for MinHash sketching of DNA. JOSS 1, 27 (2016)
Everything Else¶
-
eggNOG 5.0 Database
Huerta-Cepas J, Szklarczyk D, Heller D, HernĂĄndez-Plaza A, Forslund SK, Cook H, Mende DR, Letunic I, Rattei T, Jensen LJ, von Mering C, Bork P eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309âD314 (2019) -
Genome Taxonomy Database
Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil P-A, Hugenholtz P GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy Nucleic Acids Research gkab776 (2021) -
MOB-suite Database
Robertson J, Bessonov K, Schonfeld J, Nash JHE. Universal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance. Microbial Genomics, 6(10)(2020) -
NCBI RefSeq Database
O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O0, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O'Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD, Pruitt KD Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733â45 (2016) -
PubMLST.org
Jolley KA, Bray JE, Maiden MCJ Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res 3, 124 (2018) -
SILVA rRNA Database
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590â6 (2013)
Software Included In Bactopia¶
Below are 138 of software packages used (directly and indirectly) by Bactopia. A link to the software page as well as the citation (if available) have been included.
-
Abricate
Mass screening of contigs for antimicrobial and virulence genes
Seemann T Abricate: mass screening of contigs for antimicrobial and virulence genes (GitHub) -
abriTAMR
A pipeline for running AMRfinderPlus and collating results into functional classes
Sherry NL, Horan KA, Ballard SA, GonÒ«alves da Silva A, Gorrie CL, Schultz MB, Stevens K, Valcanis M, Sait ML, Stinear TP, Howden BP, and Seemann T An ISO-certified genomics workflow for identification and surveillance of antimicrobial resistance. Nature Communications, 14(1), 60. (2023) -
AgrVATE
Rapid identification of Staphylococcus aureus agr locus type and agr operon variants.
Raghuram V. AgrVATE: Rapid identification of Staphylococcus aureus agr locus type and agr operon variants. (GitHub) -
AMRFinderPlus
Find acquired antimicrobial resistance genes and some point mutations in protein or assembled nucleotide sequences.
Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S, Hsu C-H, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J, Folster JP, Klimke W Validating the NCBI AMRFinder Tool and Resistance Gene Database Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of NARMS Isolates. Antimicrob. Agents Chemother. (2019) -
any2fasta
Convert various sequence formats to FASTA
Seemann T any2fasta: Convert various sequence formats to FASTA (GitHub) -
Aragorn
Finds transfer RNA features (tRNA)
Laslett D, Canback B ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 32(1):11-6 (2004) -
Ariba
Antimicrobial Resistance Identification By Assembly
Hunt M, Mather AE, SĂĄnchez-BusĂł L, Page AJ, Parkhill J, Keane JA, Harris SR ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads. Microb Genom 3, e000131 (2017) -
ART
A set of simulation tools to generate synthetic next-generation sequencing reads
Huang W, Li L, Myers JR, Marth GT ART: a next-generation sequencing read simulator. Bioinformatics 28, 593â594 (2012) -
assembly-scan
Generate basic stats for an assembly.
Petit III RA assembly-scan: generate basic stats for an assembly (GitHub) -
Bakta
Rapid & standardized annotation of bacterial genomes & plasmids
Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A Bakta - rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics 7(11) (2021) -
Barrnap
Bacterial ribosomal RNA predictor
Seemann T Barrnap: Bacterial ribosomal RNA predictor (GitHub) -
BBTools
BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data.
Bushnell B BBMap short read aligner, and other bioinformatic tools. (Link) -
BCFtools
Utilities for variant calling and manipulating VCFs and BCFs.
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H Twelve years of SAMtools and BCFtools GigaScience Volume 10, Issue 2 (2021) -
Bedtools
A powerful toolset for genome arithmetic.
Quinlan AR, Hall IM BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841â842 (2010) -
BLAST
Basic Local Alignment Search Tool
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009) -
Bowtie2
A fast and sensitive gapped read aligner
Langmead B, Salzberg SL Fast gapped-read alignment with Bowtie 2. Nat. Methods. 9, 357â359 (2012) -
Bracken
Bracken a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample
Lu J, Breitwieser FP, Thielen P, and Salzberg SL Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science, 3, e104. (2017) -
BTyper3
In silico taxonomic classification of Bacillus cereus group genomes using whole-genome sequencing data
Carroll LM, Wiedmann M, Kovac J Proposal of a Taxonomic Nomenclature for the Bacillus cereus Group Which Reconciles Genomic Definitions of Bacterial Species with Clinical and Industrial Phenotypes. mBio, 11(1). (2020) -
BTyper3
In silico taxonomic classification of Bacillus cereus group genomes using whole-genome sequencing data
Carroll LM, Cheng RA, Kovac J No Assembly Required: Using BTyper3 to Assess the Congruency of a Proposed Taxonomic Framework for the Bacillus cereus Group With Historical Typing Methods. Frontiers in Microbiology, 11, 580691. (2020) -
BUSCO
Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCO)
Manni M, Berkeley MR, Seppey M, SimĂŁo FA, Zdobnov EM BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution 38(10), 4647â4654. (2021) -
BWA
Burrow-Wheeler Aligner for short-read alignment
Li H Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bio.GN] (2013) -
CD-HIT
Accelerated for clustering the next-generation sequencing data
Li W, Godzik A Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658â1659 (2006) -
CD-HIT-EST
Accelerated for clustering the next-generation sequencing data
Fu L, Niu B, Zhu Z, Wu S, Li W CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150â3152 (2012) -
CheckM
Assess the quality of microbial genomes recovered from isolates, single cells, and metagenomes
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25, 1043â1055 (2015) -
ClonalFramML
Efficient Inference of Recombination in Whole Bacterial Genomes
Didelot X, Wilson DJ ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes. PLoS Comput Biol 11(2) e1004041 (2015) -
csvtk
A cross-platform, efficient and practical CSV/TSV toolkit in Golang
Shen, W csvtk: A cross-platform, efficient and practical CSV/TSV toolkit in Golang. (GitHub) -
DIAMOND
Accelerated BLAST compatible local sequence aligner.
Buchfink B, Xie C, Huson DH Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12, 59â60 (2015) -
Dragonflye
Assemble bacterial isolate genomes from Nanopore reads.
Petit III RA Dragonflye: Assemble bacterial isolate genomes from Nanopore reads. (GitHub) -
ECTyper
In-silico prediction of Escherichia coli serotype
Laing C, Bessonov K, Sung S, La Rose C ECTyper - In silico prediction of Escherichia coli serotype (GitHub) -
eggNOG-mapper
Fast genome-wide functional annotation through orthology assignment
Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, von Mering C, Bork P Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Mol. Biol. Evol. 34, 2115â2122 (2017) -
emmtyper
emm Automatic Isolate Labeller
Tan A, Seemann T, Lacey D, Davies M, Mcintyre L, Frost H, Williamson D, Gonçalves da Silva A emmtyper - emm Automatic Isolate Labeller (GitHub) -
FastANI
Fast Whole-Genome Similarity (ANI) Estimation
Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018) -
FastQC
A quality control analysis tool for high throughput sequencing data.
Andrews S FastQC: a quality control tool for high throughput sequence data. (WebLink) -
fastq-dl
Download FASTQ files from SRA or ENA repositories.
Petit III RA fastq-dl: Download FASTQ files from SRA or ENA repositories. (GitHub) -
fastq-scan
Output FASTQ summary statistics in JSON format
Petit III RA fastq-scan: generate summary statistics of input FASTQ sequences. (GitHub) -
fastp
A tool designed to provide fast all-in-one preprocessing for FastQ files
Chen S, Zhou Y, Chen Y, and Gu J fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884âi890. (2018) -
FastTree
Approximately-maximum-likelihood phylogenetic trees
Price MN, Dehal PS, Arkin AP FastTree 2 â Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One 5, e9490 (2010) -
FLASH
A fast and accurate tool to merge paired-end reads.
MagoÄ T, Salzberg SL FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27.21 2957-2963 (2011) -
Flye
De novo assembler for single molecule sequencing reads using repeat graphs
Kolmogorov M, Yuan J, Lin Y, Pevzner P Assembly of Long Error-Prone Reads Using Repeat Graphs Nature Biotechnology (2019) -
freebayes
Bayesian haplotype-based genetic polymorphism discovery and genotyping
Garrison E, Marth G Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] (2012) -
GAMMA
Gene Allele Mutation Microbial Assessment
Stanton RA, Vlachos N, Halpin AL GAMMA: a tool for the rapid identification, classification, and annotation of translated gene matches from sequencing data. Bioinformatics (2021) -
GenoTyphi
Assign genotypes to Salmonella Typhi genomes based on Mykrobe results
Wong VK, Baker S, Connor TR, Pickard D, Page AJ, Dave J, Murphy N, Holliman R, Sefton A, Millar M, Dyson ZA, Dougan G, Holt KE, & International Typhoid Consortium. An extended genotyping framework for Salmonella enterica serovar Typhi, the cause of human typhoid Nature Communications 7, 12827. (2016) -
GNU Parallel
A shell tool for executing jobs in parallel
Tange O GNU Parallel (2018) -
GTDB-Tk
A toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes
Chaumeil PA, Mussig AJ, Hugenholtz P, Parks DH GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics (2019) -
Gubbins
Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences
Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA, Bentley SD, Parkhill J, Harris SR Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Research 43(3), e15. (2015) -
hicap
in silico typing of the H. influenzae cap locus
Watts SC, Holt KE hicap: in silico serotyping of the Haemophilus influenzae capsule locus. Journal of Clinical Microbiology JCM.00190-19 (2019) -
HMMER
Biosequence analysis using profile hidden Markov models
Eddy SR Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195 (2011) -
HpsuisSero
Rapid Haemophilus parasuis serotyping
Lui J HpsuisSero: Rapid Haemophilus parasuis serotyping (GitHub) -
Infernal
Searches DNA sequence databases for RNA structure and sequence similarities
Nawrocki EP, Eddy SR Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29(22), 2933-2935 (2013) -
IQ-TREE
Efficient phylogenomic software by maximum likelihood
Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies. Mol. Biol. Evol. 32:268-274 (2015) -
ModelFinder
Used for automatic model selection
Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS ModelFinder - Fast model selection for accurate phylogenetic estimates. Nat. Methods 14:587-589 (2017) -
UFBoot2
Used to conduct ultrafast bootstrapping
Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35:518â522 (2018) -
ISMapper
IS mapping software
Hawkey J, Hamidian M, Wick RR, Edwards DJ, Billman-Jacobe H, Hall RM, Holt KE ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data. BMC Genomics 16, 667 (2015) -
Kaptive
Surface polysaccharide loci for Klebsiella pneumoniae species complex and Acinetobacter baumannii genomes
Wyres KL, Wick RR, Gorrie C, Jenney A, Follador R, Thomson NR, Holt KE Identification of Klebsiella capsule synthesis loci from whole genome data. Microbial genomics 2(12) (2016) -
Kleborate
Genotyping tool for Klebsiella pneumoniae and its related species complex
Lam MMC, Wick RR, Watts, SC, Cerdeira LT, Wyres KL, Holt KE A genomic surveillance framework and genotyping tool for Klebsiella pneumoniae and its related species complex. Nat Commun 12, 4188 (2021) -
KMC
Fast and frugal disk based k-mer counter
Deorowicz S, Kokot M, Grabowski Sz, Debudaj-Grabysz A KMC 2: Fast and resource-frugal k-mer counting Bioinformatics 31(10):1569â1576 (2015) -
Kraken2
The second version of the Kraken taxonomic sequence classification system
Wood DE, Lu J, Langmead B Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. (2019) -
Krona
Interactively explore metagenomes and more from a web browser
Ondov BD, Bergman NH, and Phillippy AM Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 12, 385. (2011) -
legsta
In silico Legionella pneumophila Sequence Based Typing
Seemann T legsta: In silico Legionella pneumophila Sequence Based Typing (GitHub) -
Lighter
Fast and memory-efficient sequencing error corrector
Song L, Florea L, Langmead B Lighter: Fast and Memory-efficient Sequencing Error Correction without Counting. Genome Biol. 15(11):509 (2014) -
LisSero
In silico serotype prediction for Listeria monocytogenes
Kwong J, Zhang J, Seeman T, Horan, K, Gonçalves da Silva A LisSero - In silico serotype prediction for Listeria monocytogenes (GitHub) -
MAFFT
Multiple alignment program for amino acid or nucleotide sequences
Katoh K, Standley DM MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772â780 (2013) -
Mash
Fast genome and metagenome distance estimation using MinHash
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016) -
Mash
High-throughput sequence containment estimation
Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM Mash Screen: high-throughput sequence containment estimation for genome discovery Genome Biol 20, 232 (2019) -
Mashtree
Create a tree using Mash distances
Katz LS, Griswold T, Morrison S, Caravas J, Zhang S, den Bakker HC, Deng X, Carleton HA Mashtree: a rapid comparison of whole genome sequence files. Journal of Open Source Software, 4(44), 1762 (2019) -
maskrc-svg
Masks recombination as detected by ClonalFrameML or Gubbins
Kwong J maskrc-svg - Masks recombination as detected by ClonalFrameML or Gubbins and draws an SVG. (GitHub) -
McCortex
De novo genome assembly and multisample variant calling
Turner I, Garimella KV, Iqbal Z, McVean G Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34, 2556â2565 (2018) -
mcroni
Scripts for finding and processing promoter variants upstream of mcr-1
Shaw L mcroni: Scripts for finding and processing promoter variants upstream of mcr-1 (GitHub) -
Medaka
Sequence correction provided by ONT Research
ONT Research Medaka: Sequence correction provided by ONT Research (GitHub) -
meningotype
In silico serotyping, finetyping and Bexsero antigen sequence typing of Neisseria meningitidis
Kwong JC, Gonçalves da Silva A, Stinear TP, Howden BP, & Seemann T meningotype: in silico typing for Neisseria meningitidis. (GitHub) -
MEGAHIT
Ultra-fast and memory-efficient (meta-)genome assembler
Li D, Liu C-M, Luo R, Sadakane K, Lam T-W MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31.10 1674-1676 (2015) -
mlst
Scan contig files against PubMLST typing schemes
Seemann T mlst: scan contig files against PubMLST typing schemes (GitHub) -
MIDAS
An integrated pipeline for estimating strain-level genomic variation from metagenomic data
Nayfach S, Rodriguez-Mueller B, Garud N, and Pollard KS An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Research, 26(11), 1612â1625. (2016) -
MinCED
Mining CRISPRs in Environmental Datasets
Skennerton C MinCED: Mining CRISPRs in Environmental Datasets (GitHub) -
Miniasm
Ultrafast de novo assembly for long noisy reads (though having no consensus step)
Li H Miniasm: Ultrafast de novo assembly for long noisy reads (GitHub) -
Minimap2
A versatile pairwise aligner for genomic and spliced nucleotide sequences
Li H Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094-3100 (2018) -
MOB-suite
Software tools for clustering, reconstruction and typing of plasmids from draft assemblies
Robertson J, Nash JHE MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microbial Genomics 4(8). (2018) -
Mykrobe
Antibiotic resistance prediction in minutes
Hunt M, Bradley P, Lapierre SG, Heys S, Thomsit M, Hall MB, Malone KM, Wintringer P, Walker TM, Cirillo DM, Comas I, Farhat MR, Fowler P, Gardy J, Ismail N, Kohl TA, Mathys V, Merker M, Niemann S, Omar SV, Sintchenko V, Smith G, Supply P, Tahseen S, Wilcox M, Arandjelovic I, Peto TEA, Crook, DW, Iqbal Z Antibiotic resistance prediction for Mycobacterium tuberculosis from genome sequence data with Mykrobe Wellcome Open Research 4, 191. (2019) -
NanoPlot
Plotting scripts for long read sequencing data
De Coster W, DâHert S, Schultz DT, Cruts M, Van Broeckhoven C NanoPack: visualizing and processing long-read sequencing data Bioinformatics Volume 34, Issue 15 (2018) -
Nanoq
Minimal but speedy quality control for nanopore reads in Rust
Steinig E Nanoq: Minimal but speedy quality control for nanopore reads in Rust (GitHub) -
ncbi-genome-download
Scripts to download genomes from the NCBI FTP servers
Blin K ncbi-genome-download: Scripts to download genomes from the NCBI FTP servers (GitHub) -
Nextflow
A DSL for data-driven computational pipelines.
Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316â319 (2017) -
ngmaster
In silico multi-antigen sequence typing for Neisseria gonorrhoeae (NG-MAST)
Kwong J, Gonçalves da Silva A, Schultz M, Seeman T ngmaster - In silico multi-antigen sequence typing for Neisseria gonorrhoeae (NG-MAST) (GitHub) -
nhmmer
DNA homology search with profile HMMs.
Wheeler TJ, Eddy SR nhmmer: DNA homology search with profile HMMs. Bioinformatics 29, 2487â2489 (2013) -
Panaroo
An updated pipeline for pangenome investigation
Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biology 21(1), 180. (2020) -
pasty
in silico serogrouping of Pseudomonas aeruginosa isolates
Petit III RA pasty: in silico serogrouping of Pseudomonas aeruginosa isolates (GitHub) -
pbptyper
Penicillin Binding Protein (PBP) typer for Streptococcus pneumoniae assemblies
Petit III RA pbptyper: In silico Penicillin Binding Protein (PBP) typer for Streptococcus pneumoniae assemblies (GitHub) -
PhiSpy
Prediction of prophages from bacterial genomes
Akhter S, Aziz RK, and Edwards RA PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Research, 40(16), e126. (2012) -
phyloFlash
A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
Gruber-Vodicka HR, Seah BKB, Pruesse E phyloFlash: Rapid Small-Subunit rRNA Profiling and Targeted Assembly from Metagenomes mSystems 5 (2020) -
Pigz
A parallel implementation of gzip for modern multi-processor, multi-core machines.
Adler M. pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory (2015) -
Pilon
An automated genome assembly improvement and variant detection tool
Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS one 9.11 e112963 (2014) -
PIRATE
A toolbox for pangenome analysis and threshold evaluation.
Bayliss SC, Thorpe HA, Coyle NM, Sheppard SK, Feil EJ PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria. Gigascience 8 (2019) -
PlasmidFinder
Identifies plasmids in total or partial sequenced isolates of bacteria
Carattoli A, Zankari E, GarcĂa-FernĂĄndez A, Voldby Larsen M, Lund O, Villa L, MĂžller Aarestrup F, Hasman H In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrobial Agents and Chemotherapy 58(7), 3895â3903. (2014) -
PneumoCaT
Pneumococcal Capsular Typing tool for NGS data
Kapatai G, Sheppard CL, Al-Shahib A, Litt DJ, Underwood AP, Harrison TG, and Fry NK Whole genome sequencing of Streptococcus pneumoniae: development, evaluation and verification of targets for serogroup and serotype prediction using an automated pipeline. PeerJ, 4, e2477. (2016) -
Porechop
adapter trimmer for Oxford Nanopore reads
Wick RR, Judd LM, Gorrie CL, Holt KE. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom. 3(10):e000132 (2017) -
pplacer
Phylogenetic placement and downstream analysis
Matsen FA, Kodner RB, Armbrust EV pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010) -
Prodigal
Fast, reliable protein-coding gene prediction for prokaryotic genomes.
Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11.1 119 (2010) -
Prokka
Rapid prokaryotic genome annotation
Seemann T Prokka: rapid prokaryotic genome annotation Bioinformatics 30, 2068â2069 (2014) -
QUAST
Quality Assessment Tool for Genome
Gurevich A, Saveliev V, Vyahhi N, Tesler G QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072â1075 (2013) -
Racon
Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads
Vaser R, SoviÄ I, Nagarajan N, Ć ikiÄ M Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27, 737â746 (2017) -
Rasusa
Randomly subsample sequencing reads to a specified coverage
Hall MB Rasusa: Randomly subsample sequencing reads to a specified coverage. (2019). -
Raven
De novo genome assembler for long uncorrected reads
Vaser R, Ć ikiÄ M Time- and memory-efficient genome assembly with Raven. Nat Comput Sci 1, 332â336 (2021) -
Resistance Gene Identifier (RGI)
Software to predict resistomes from protein or nucleotide data, based on homology and SNP models.
Alcock BP, Raphenya AR, Lau TTY, Tsang KK, Bouchard M, Edalatmand A, Huynh W, Nguyen A-L V, Cheng AA, Liu S, Min SY, Miroshnichenko A, Tran H-K, Werfalli RE, Nasir JA, Oloni M, Speicher DJ, Florescu A, Singh B, Faltyn M, Hernandez-Koutoucheva A, Sharma AN, Bordeleau E, Pawlowski AC, Zubyk HL, Dooley D, Griffiths E, Maguire F, Winsor GL, Beiko RG, Brinkman FSL, Hsiao WWL, Domselaar GV, McArthur AG CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic acids research 48.D1, D517-D525 (2020) -
RNAmmer
Consistent and rapid annotation of ribosomal RNA genes
Lagesen K, Hallin P, RĂždland EA, StĂŠrfeldt H-H, Rognes T, Ussery DW RNAmmer: consistent annotation of rRNA genes in genomic sequences. Nucleic Acids Res 35.9: 3100-3108 (2007) -
Roary
Rapid large-scale prokaryote pan genome analysis
Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691â3693 (2015) -
samclip
Filter SAM file for soft and hard clipped alignments
Seemann T Samclip: Filter SAM file for soft and hard clipped alignments (GitHub) -
Samtools
Tools for manipulating next-generation sequencing data
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078â2079 (2009) -
Scoary
Pan-genome wide association studies
Brynildsrud O, Bohlin J, Scheffer L, Eldholm V Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol. 17:238 (2016) -
SeqSero2
Salmonella serotype prediction from genome sequencing data
Zhang S, Den-Bakker HC, Li S, Dinsmore BA, Lane C, Lauer AC, Fields PI, Deng X. SeqSero2: rapid and improved Salmonella serotype determination using whole genome sequencing data. Appl Environ Microbiology 85(23):e01746-19 (2019) -
Seqtk
A fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
Li H Toolkit for processing sequences in FASTA/Q formats (GitHub) -
Seroba
k-mer based pipeline to identify the serotype of Streptococcus pneumoniae from Illumina NGS reads
Epping L, van Tonder AJ, Gladstone RA, The Global Pneumococcal Sequencing Consortium, Bentley SD, Page AJ, Keane JA SeroBA: rapid high-throughput serotyping of Streptococcus pneumoniae from whole genome sequence data. Microbial Genomics, 4(7) (2018) -
ShigaTyper
Shigella serotype from Illumina or Oxford Nanopore reads
Wu Y, Lau HK, Lee T, Lau DK, Payne J In Silico Serotyping Based on Whole-Genome Sequencing Improves the Accuracy of Shigella Identification. Applied and Environmental Microbiology, 85(7). (2019) -
ShigEiFinder
Cluster informed Shigella and EIEC serotyping tool from Illumina reads and assemblies
Zhang X, Payne M, Nguyen T, Kaur S, Lan R Cluster-specific gene markers enhance Shigella and enteroinvasive Escherichia coli in silico serotyping. Microbial Genomics, 7(12). (2021) -
Shovill
Faster assembly of Illumina reads
Seemann T Shovill: De novo assembly pipeline for Illumina paired reads (GitHub) -
Shovill-SE
A fork of Shovill that includes support for single end reads.
Petit III RA Shovill-SE: A fork of Shovill that includes support for single end reads. (GitHub) -
SignalP
SISTR (Salmonella In Silico Typing Resource) command-line tool
Petersen TN, Brunak S, von Heijne G, Nielsen H SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature methods 8.10: 785 (2011) -
SISTR
Finds signal peptide features in CDS
Yoshida CE, Kruczkiewicz P, Laing CR, Lingohr EJ, Gannon VPJ, Nash JHE, Taboada EN The Salmonella In Silico Typing Resource (SISTR): An Open Web-Accessible Tool for Rapidly Typing and Subtyping Draft Salmonella Genome Assemblies. PloS One, 11(1), e0147101. (2016) -
SKESA
Strategic Kmer Extension for Scrupulous Assemblies
Souvorov A, Agarwala R, Lipman DJ SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biology 19:153 (2018) -
Snippy
Rapid haploid variant calling and core genome alignment
Seemann T Snippy: fast bacterial variant calling from NGS reads (GitHub) -
SnpEff
Genomic variant annotations and functional effect prediction toolbox.
Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Douglas M A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6(2), 80-92 (2012) -
snp-dists
Pairwise SNP distance matrix from a FASTA sequence alignment
Seemann T snp-dists - Pairwise SNP distance matrix from a FASTA sequence alignment. (GitHub) -
SNP-sites
Rapidly extracts SNPs from a multi-FASTA alignment.
Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microbial Genomics 2.4â (2016) -
Sourmash
Compute and compare MinHash signatures for DNA data sets.
Brown CT, Irber L sourmash: a library for MinHash sketching of DNA. JOSS 1, 27 (2016) -
SPAdes
An assembly toolkit containing various assembly pipelines.
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology 19.5 455-477 (2012) -
spaTyper
Computational method for finding spa types.
Sanchez-Herrero JF, and Sullivan M spaTyper: Staphylococcal protein A (spa) characterization pipeline. Zenodo. (2020) -
spaTyper Database
Database used by spaTyper
Harmsen D, Claus H, Witte W, RothgÀnger J, Claus H, Turnwald D, and Vogel U Typing of methicillin-resistant Staphylococcus aureus in a university hospital setting using a novel software for spa-repeat determination and database management. J. Clin. Microbiol. 41:5442-5448 (2003) -
SRA Human Scrubber
An SRA tool that takes as input local fastq file from a clinical infection sample, identifies and removes any significant human read, and outputs the edited (cleaned) fastq file that can safely be used for SRA submission
Katz KS, Shutov O, Lapoint R, Kimelman M, Brister JR, and OâSullivan C STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions. Genome Biology, 22(1), 270 (2021) -
SsuisSero
Rapid Streptococcus suis serotyping
Lui J SsuisSero: Rapid Streptococcus suis serotyping (GitHub) -
staphopia-sccmec
A standalone version of Staphopia's SCCmec typing method.
Petit III RA, Read TD Staphylococcus aureus viewed from the perspective of 40,000+ genomes. PeerJ 6, e5261 (2018) -
STECFinder
Clustering and Serotyping of Shigatoxin producing E. coli (STEC) using genomic cluster specific markers
Zhang X, Payne M, Kaur S, and Lan R Improved Genomic Identification, Clustering, and Serotyping of Shiga Toxin-Producing Escherichia coli Using Cluster/Serotype-Specific Gene Markers. Frontiers in Cellular and Infection Microbiology, 11, 772574. (2021) -
TBProfiler
Profiling tool for Mycobacterium tuberculosis to detect resistance and strain type
Phelan JE, OâSullivan DM, Machado D, Ramos J, Oppong YEA, Campino S, OâGrady J, McNerney R, Hibberd ML, Viveiros M, Huggett JF, Clark TG Integrating informatics tools and portable sequencing technology for rapid detection of resistance to anti-tuberculous drugs. Genome Med 11, 41 (2019) -
Trimmomatic
A flexible read trimming tool for Illumina NGS data
Bolger AM, Lohse M, Usadel B Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30.15 2114-2120 (2014) -
Unicycler
Hybrid assembly pipeline for bacterial genomes
Wick RR, Judd LM, Gorrie CL, Holt KE Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 13, e1005595 (2017) -
VCF-Annotator
Add biological annotations to variants in a VCF file.
Petit III RA VCF-Annotator: Add biological annotations to variants in a VCF file. (GitHub) -
Vcflib
a simple C++ library for parsing and manipulating VCF files
Garrison E Vcflib: A C++ library for parsing and manipulating VCF files (GitHub) -
Velvet
Short read de novo assembler using de Bruijn graphs
Zerbino DR, Birney E Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18.5 821-829 (2008) -
VSEARCH
Versatile open-source tool for metagenomics
Rognes T, Flouri T, Nichols B, Quince C, Mahé F VSEARCH: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016) -
vt
A tool set for short variant discovery in genetic sequence data.
Tan A, Abecasis GR, Kang HM Unified representation of genetic variants. Bioinformatics 31(13), 2202-2204 (2015)
Bactopia Citation¶
If you use Bactopia in your analysis, please cite the following.
Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020)