Skip to content

Acknowledgements

Bactopia is truly a case of "standing upon the shoulders of giants". Bactopia currently integrates more than 157 datasets and software packages. Nearly every component utilized in Bactopia, from the workflow to the datasets to the software packages and even the framework of this site, was created by others and made freely accessible to the public.

I would like to personally extend my many thanks and gratitude to the authors of these software packages and public datasets. If you've made it this far, I owe you a beer đŸ» (or coffee ☕!) if we ever encounter one another in person. Really, thank you very much!

Please Cite Datasets and Tools

If you have used Bactopia in your work, please be sure to cite any datasets or software you may have used.

Funding

Support for this project came (in part) from an Emory Public Health Bioinformatics Fellowship funded by the CDC Emerging Infections Program (U50CK000485) PPHF/ACA: Enhancing Epidemiology and Laboratory Capacity, the Wyoming Public Health Division, and the Center for Applied Pathogen Epidemiology and Outbreak Control (CAPE).

Georgia Emerging Infections Program Wyoming Public Health Division Center for Applied 
Pathogen Epidemiology and Outbreak Control

Influences

nf-core

nf-core is a great group of individuals volunteering their time to create a set of curated Nextflow analysis pipelines. The nf-core Team has put together some amazing practices that I think really strengthen the Nextflow community as a whole!

I'm often asked: Will Bactopia ever be apart of nf-core?

The answer is: No, but...

Bactopia, was adapted from Staphopia which pre-dates the beginnings of nf-core. As both nf-core and Bactopia grew, it bacame clear adding Bactopia to nf-core was going to be a difficult task. The last opporunity to do so was probably when Bactopia was converted to DSL2, but Bactopia Tools would not likely ever fit into the nf-core mold.

However, where possible, I have tried to implement nf-core practices into Bactopia. Some examples include:

  1. Arguement parsing based on nf-core library
  2. All Bactopia Tools are adapted from nf-core/modules
  3. Testing implemented to follow nf-core/modules

By implementing these practices, Bactopia I believe is much better pipeline to use. For this I'm very grateful to the nf-core community! Thank you!

Ewels P, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. (2020)

Public Datasets

Below is a list of 19 public datasets that could have potentially been used through Bactopia or Bactopia Tools.

Ariba Reference Datasets

These datasets are available using Ariba's getref function. You can learn more about this function at Ariba's Wiki.

  1. ARG-ANNOT
    Gupta SK, Padmanabhan BR, Diene SM, Lopez-Rojas R, Kempf M, Landraud L, Rolain J-M ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob. Agents Chemother 58, 212–220 (2014)

  2. CARD
    Alcock BP, Raphenya AR, Lau TTY, Tsang KK, Bouchard M, Edalatmand A, Huynh W, Nguyen A-L V, Cheng AA, Liu S, Min SY, Miroshnichenko A, Tran H-K, Werfalli RE, Nasir JA, Oloni M, Speicher DJ, Florescu A, Singh B, Faltyn M, Hernandez-Koutoucheva A, Sharma AN, Bordeleau E, Pawlowski AC, Zubyk HL, Dooley D, Griffiths E, Maguire F, Winsor GL, Beiko RG, Brinkman FSL, Hsiao WWL, Domselaar GV, McArthur AG CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic acids research 48.D1, D517-D525 (2020)

  3. EcOH
    Ingle DJ, Valcanis M, Kuzevski A, Tauschek M, Inouye M, Stinear T, Levine MM, Robins-Browne RM, Holt KE In silico serotyping of E. coli from short read data identifies limited novel O-loci but extensive diversity of O:H serotype combinations within and between pathogenic lineages. Microbial Genomics, 2(7), e000064. (2016)

  4. MEGARes
    Lakin SM, Dean C, Noyes NR, Dettenwanger A, Ross AS, Doster E, Rovira P, Abdo Z, Jones KL, Ruiz J, Belk KE, Morley PS, Boucher C MEGARes: an antimicrobial resistance database for high throughput sequencing. Nucleic Acids Res. 45, D574–D580 (2017)

  5. MEGARes 2.0
    Doster E, Lakin SM, Dean CJ, Wolfe C, Young JG, Boucher C, Belk KE, Noyes NR, Morley PS MEGARes 2.0: a database for classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data. Nucleic Acids Research, 48(D1), D561–D569. (2020)

  6. NCBI Reference Gene Catalog
    Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S, Hsu C-H, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J, Folster JP, Klimke W Validating the NCBI AMRFinder Tool and Resistance Gene Database Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of NARMS Isolates. Antimicrob. Agents Chemother. (2019)

  7. PlasmidFinder
    Carattoli A, Zankari E, García-Fernández A, Larsen MV, Lund O, Villa L, Aarestrup FM, Hasman H In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob. Agents Chemother. 58, 3895–3903 (2014)

  8. ResFinder
    Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O, Aarestrup FM, Larsen MV Identification of acquired antimicrobial resistance genes. J. Antimicrob. Chemother. 67, 2640–2644 (2012)

  9. SRST2
    Inouye M, Dashnow H, Raven L-A, Schultz MB, Pope BJ, Tomita T, Zobel J, Holt KE SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Med. 6, 90 (2014)

  10. VFDB
    Chen L, Zheng D, Liu B, Yang J, Jin Q VFDB 2016: hierarchical and refined dataset for big data analysis--10 years on. Nucleic Acids Res. 44, D694–7 (2016)

  11. VirulenceFinder
    Joensen KG, Scheutz F, Lund O, Hasman H, Kaas RS, Nielsen EM, Aarestrup FM Real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic Escherichia coli. J. Clin. Microbiol. 52, 1501–1510 (2014)

Minmer Datasets

  1. Mash Refseq (release 88) Sketch
    Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM Mash Screen: high-throughput sequence containment estimation for genome discovery Genome Biol 20, 232 (2019)

  2. Sourmash Genbank LCA Signature
    Brown CT, Irber L sourmash: a library for MinHash sketching of DNA. JOSS 1, 27 (2016)

Everything Else

  1. eggNOG 5.0 Database
    Huerta-Cepas J, Szklarczyk D, Heller D, Hernández-Plaza A, Forslund SK, Cook H, Mende DR, Letunic I, Rattei T, Jensen LJ, von Mering C, Bork P eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019)

  2. Genome Taxonomy Database
    Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil P-A, Hugenholtz P GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy Nucleic Acids Research gkab776 (2021)

  3. MOB-suite Database
    Robertson J, Bessonov K, Schonfeld J, Nash JHE. Universal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance. Microbial Genomics, 6(10)(2020)

  4. NCBI RefSeq Database
    O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O0, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O'Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD, Pruitt KD Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–45 (2016)

  5. PubMLST.org
    Jolley KA, Bray JE, Maiden MCJ Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res 3, 124 (2018)

  6. SILVA rRNA Database
    Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–6 (2013)

Software Included In Bactopia

Below are 138 of software packages used (directly and indirectly) by Bactopia. A link to the software page as well as the citation (if available) have been included.

  1. Abricate
    Mass screening of contigs for antimicrobial and virulence genes
    Seemann T Abricate: mass screening of contigs for antimicrobial and virulence genes (GitHub)

  2. abriTAMR
    A pipeline for running AMRfinderPlus and collating results into functional classes
    Sherry NL, Horan KA, Ballard SA, GonÒ«alves da Silva A, Gorrie CL, Schultz MB, Stevens K, Valcanis M, Sait ML, Stinear TP, Howden BP, and Seemann T An ISO-certified genomics workflow for identification and surveillance of antimicrobial resistance. Nature Communications, 14(1), 60. (2023)

  3. AgrVATE
    Rapid identification of Staphylococcus aureus agr locus type and agr operon variants.
    Raghuram V. AgrVATE: Rapid identification of Staphylococcus aureus agr locus type and agr operon variants. (GitHub)

  4. AMRFinderPlus
    Find acquired antimicrobial resistance genes and some point mutations in protein or assembled nucleotide sequences.
    Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S, Hsu C-H, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J, Folster JP, Klimke W Validating the NCBI AMRFinder Tool and Resistance Gene Database Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of NARMS Isolates. Antimicrob. Agents Chemother. (2019)

  5. any2fasta
    Convert various sequence formats to FASTA
    Seemann T any2fasta: Convert various sequence formats to FASTA (GitHub)

  6. Aragorn
    Finds transfer RNA features (tRNA)
    Laslett D, Canback B ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 32(1):11-6 (2004)

  7. Ariba
    Antimicrobial Resistance Identification By Assembly
    Hunt M, Mather AE, SĂĄnchez-BusĂł L, Page AJ, Parkhill J, Keane JA, Harris SR ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads. Microb Genom 3, e000131 (2017)

  8. ART
    A set of simulation tools to generate synthetic next-generation sequencing reads
    Huang W, Li L, Myers JR, Marth GT ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012)

  9. assembly-scan
    Generate basic stats for an assembly.
    Petit III RA assembly-scan: generate basic stats for an assembly (GitHub)

  10. Bakta
    Rapid & standardized annotation of bacterial genomes & plasmids
    Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A Bakta - rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics 7(11) (2021)

  11. Barrnap
    Bacterial ribosomal RNA predictor
    Seemann T Barrnap: Bacterial ribosomal RNA predictor (GitHub)

  12. BBTools
    BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data.
    Bushnell B BBMap short read aligner, and other bioinformatic tools. (Link)

  13. BCFtools
    Utilities for variant calling and manipulating VCFs and BCFs.
    Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H Twelve years of SAMtools and BCFtools GigaScience Volume 10, Issue 2 (2021)

  14. Bedtools
    A powerful toolset for genome arithmetic.
    Quinlan AR, Hall IM BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010)

  15. BLAST
    Basic Local Alignment Search Tool
    Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009)

  16. Bowtie2
    A fast and sensitive gapped read aligner
    Langmead B, Salzberg SL Fast gapped-read alignment with Bowtie 2. Nat. Methods. 9, 357–359 (2012)

  17. Bracken
    Bracken a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample
    Lu J, Breitwieser FP, Thielen P, and Salzberg SL Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science, 3, e104. (2017)

  18. BTyper3
    In silico taxonomic classification of Bacillus cereus group genomes using whole-genome sequencing data
    Carroll LM, Wiedmann M, Kovac J Proposal of a Taxonomic Nomenclature for the Bacillus cereus Group Which Reconciles Genomic Definitions of Bacterial Species with Clinical and Industrial Phenotypes. mBio, 11(1). (2020)

  19. BTyper3
    In silico taxonomic classification of Bacillus cereus group genomes using whole-genome sequencing data
    Carroll LM, Cheng RA, Kovac J No Assembly Required: Using BTyper3 to Assess the Congruency of a Proposed Taxonomic Framework for the Bacillus cereus Group With Historical Typing Methods. Frontiers in Microbiology, 11, 580691. (2020)

  20. BUSCO
    Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCO)
    Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution 38(10), 4647–4654. (2021)

  21. BWA
    Burrow-Wheeler Aligner for short-read alignment
    Li H Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bio.GN] (2013)

  22. CD-HIT
    Accelerated for clustering the next-generation sequencing data
    Li W, Godzik A Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006)

  23. CD-HIT-EST
    Accelerated for clustering the next-generation sequencing data
    Fu L, Niu B, Zhu Z, Wu S, Li W CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012)

  24. CheckM
    Assess the quality of microbial genomes recovered from isolates, single cells, and metagenomes
    Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25, 1043–1055 (2015)

  25. ClonalFramML
    Efficient Inference of Recombination in Whole Bacterial Genomes
    Didelot X, Wilson DJ ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes. PLoS Comput Biol 11(2) e1004041 (2015)

  26. csvtk
    A cross-platform, efficient and practical CSV/TSV toolkit in Golang
    Shen, W csvtk: A cross-platform, efficient and practical CSV/TSV toolkit in Golang. (GitHub)

  27. DIAMOND
    Accelerated BLAST compatible local sequence aligner.
    Buchfink B, Xie C, Huson DH Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12, 59–60 (2015)

  28. Dragonflye
    Assemble bacterial isolate genomes from Nanopore reads.
    Petit III RA Dragonflye: Assemble bacterial isolate genomes from Nanopore reads. (GitHub)

  29. ECTyper
    In-silico prediction of Escherichia coli serotype
    Laing C, Bessonov K, Sung S, La Rose C ECTyper - In silico prediction of Escherichia coli serotype (GitHub)

  30. eggNOG-mapper
    Fast genome-wide functional annotation through orthology assignment
    Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, von Mering C, Bork P Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Mol. Biol. Evol. 34, 2115–2122 (2017)

  31. emmtyper
    emm Automatic Isolate Labeller
    Tan A, Seemann T, Lacey D, Davies M, Mcintyre L, Frost H, Williamson D, Gonçalves da Silva A emmtyper - emm Automatic Isolate Labeller (GitHub)

  32. FastANI
    Fast Whole-Genome Similarity (ANI) Estimation
    Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018)

  33. FastQC
    A quality control analysis tool for high throughput sequencing data.
    Andrews S FastQC: a quality control tool for high throughput sequence data. (WebLink)

  34. fastq-dl
    Download FASTQ files from SRA or ENA repositories.
    Petit III RA fastq-dl: Download FASTQ files from SRA or ENA repositories. (GitHub)

  35. fastq-scan
    Output FASTQ summary statistics in JSON format
    Petit III RA fastq-scan: generate summary statistics of input FASTQ sequences. (GitHub)

  36. fastp
    A tool designed to provide fast all-in-one preprocessing for FastQ files
    Chen S, Zhou Y, Chen Y, and Gu J fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. (2018)

  37. FastTree
    Approximately-maximum-likelihood phylogenetic trees
    Price MN, Dehal PS, Arkin AP FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One 5, e9490 (2010)

  38. FLASH
    A fast and accurate tool to merge paired-end reads.
    Magoč T, Salzberg SL FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27.21 2957-2963 (2011)

  39. Flye
    De novo assembler for single molecule sequencing reads using repeat graphs
    Kolmogorov M, Yuan J, Lin Y, Pevzner P Assembly of Long Error-Prone Reads Using Repeat Graphs Nature Biotechnology (2019)

  40. freebayes
    Bayesian haplotype-based genetic polymorphism discovery and genotyping
    Garrison E, Marth G Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] (2012)

  41. GAMMA
    Gene Allele Mutation Microbial Assessment
    Stanton RA, Vlachos N, Halpin AL GAMMA: a tool for the rapid identification, classification, and annotation of translated gene matches from sequencing data. Bioinformatics (2021)

  42. GenoTyphi
    Assign genotypes to Salmonella Typhi genomes based on Mykrobe results
    Wong VK, Baker S, Connor TR, Pickard D, Page AJ, Dave J, Murphy N, Holliman R, Sefton A, Millar M, Dyson ZA, Dougan G, Holt KE, & International Typhoid Consortium. An extended genotyping framework for Salmonella enterica serovar Typhi, the cause of human typhoid Nature Communications 7, 12827. (2016)

  43. GNU Parallel
    A shell tool for executing jobs in parallel
    Tange O GNU Parallel (2018)

  44. GTDB-Tk
    A toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes
    Chaumeil PA, Mussig AJ, Hugenholtz P, Parks DH GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics (2019)

  45. Gubbins
    Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences
    Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA, Bentley SD, Parkhill J, Harris SR Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Research 43(3), e15. (2015)

  46. hicap
    in silico typing of the H. influenzae cap locus
    Watts SC, Holt KE hicap: in silico serotyping of the Haemophilus influenzae capsule locus. Journal of Clinical Microbiology JCM.00190-19 (2019)

  47. HMMER
    Biosequence analysis using profile hidden Markov models
    Eddy SR Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195 (2011)

  48. HpsuisSero
    Rapid Haemophilus parasuis serotyping
    Lui J HpsuisSero: Rapid Haemophilus parasuis serotyping (GitHub)

  49. Infernal
    Searches DNA sequence databases for RNA structure and sequence similarities
    Nawrocki EP, Eddy SR Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29(22), 2933-2935 (2013)

  50. IQ-TREE
    Efficient phylogenomic software by maximum likelihood
    Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies. Mol. Biol. Evol. 32:268-274 (2015)

  51. ModelFinder
    Used for automatic model selection
    Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS ModelFinder - Fast model selection for accurate phylogenetic estimates. Nat. Methods 14:587-589 (2017)

  52. UFBoot2
    Used to conduct ultrafast bootstrapping
    Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35:518–522 (2018)

  53. ISMapper
    IS mapping software
    Hawkey J, Hamidian M, Wick RR, Edwards DJ, Billman-Jacobe H, Hall RM, Holt KE ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data. BMC Genomics 16, 667 (2015)

  54. Kaptive
    Surface polysaccharide loci for Klebsiella pneumoniae species complex and Acinetobacter baumannii genomes
    Wyres KL, Wick RR, Gorrie C, Jenney A, Follador R, Thomson NR, Holt KE Identification of Klebsiella capsule synthesis loci from whole genome data. Microbial genomics 2(12) (2016)

  55. Kleborate
    Genotyping tool for Klebsiella pneumoniae and its related species complex
    Lam MMC, Wick RR, Watts, SC, Cerdeira LT, Wyres KL, Holt KE A genomic surveillance framework and genotyping tool for Klebsiella pneumoniae and its related species complex. Nat Commun 12, 4188 (2021)

  56. KMC
    Fast and frugal disk based k-mer counter
    Deorowicz S, Kokot M, Grabowski Sz, Debudaj-Grabysz A KMC 2: Fast and resource-frugal k-mer counting Bioinformatics 31(10):1569–1576 (2015)

  57. Kraken2
    The second version of the Kraken taxonomic sequence classification system
    Wood DE, Lu J, Langmead B Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. (2019)

  58. Krona
    Interactively explore metagenomes and more from a web browser
    Ondov BD, Bergman NH, and Phillippy AM Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 12, 385. (2011)

  59. legsta
    In silico Legionella pneumophila Sequence Based Typing
    Seemann T legsta: In silico Legionella pneumophila Sequence Based Typing (GitHub)

  60. Lighter
    Fast and memory-efficient sequencing error corrector
    Song L, Florea L, Langmead B Lighter: Fast and Memory-efficient Sequencing Error Correction without Counting. Genome Biol. 15(11):509 (2014)

  61. LisSero
    In silico serotype prediction for Listeria monocytogenes
    Kwong J, Zhang J, Seeman T, Horan, K, Gonçalves da Silva A LisSero - In silico serotype prediction for Listeria monocytogenes (GitHub)

  62. MAFFT
    Multiple alignment program for amino acid or nucleotide sequences
    Katoh K, Standley DM MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013)

  63. Mash
    Fast genome and metagenome distance estimation using MinHash
    Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016)

  64. Mash
    High-throughput sequence containment estimation
    Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM Mash Screen: high-throughput sequence containment estimation for genome discovery Genome Biol 20, 232 (2019)

  65. Mashtree
    Create a tree using Mash distances
    Katz LS, Griswold T, Morrison S, Caravas J, Zhang S, den Bakker HC, Deng X, Carleton HA Mashtree: a rapid comparison of whole genome sequence files. Journal of Open Source Software, 4(44), 1762 (2019)

  66. maskrc-svg
    Masks recombination as detected by ClonalFrameML or Gubbins
    Kwong J maskrc-svg - Masks recombination as detected by ClonalFrameML or Gubbins and draws an SVG. (GitHub)

  67. McCortex
    De novo genome assembly and multisample variant calling
    Turner I, Garimella KV, Iqbal Z, McVean G Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34, 2556–2565 (2018)

  68. mcroni
    Scripts for finding and processing promoter variants upstream of mcr-1
    Shaw L mcroni: Scripts for finding and processing promoter variants upstream of mcr-1 (GitHub)

  69. Medaka
    Sequence correction provided by ONT Research
    ONT Research Medaka: Sequence correction provided by ONT Research (GitHub)

  70. meningotype
    In silico serotyping, finetyping and Bexsero antigen sequence typing of Neisseria meningitidis
    Kwong JC, Gonçalves da Silva A, Stinear TP, Howden BP, & Seemann T meningotype: in silico typing for Neisseria meningitidis. (GitHub)

  71. MEGAHIT
    Ultra-fast and memory-efficient (meta-)genome assembler
    Li D, Liu C-M, Luo R, Sadakane K, Lam T-W MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31.10 1674-1676 (2015)

  72. mlst
    Scan contig files against PubMLST typing schemes
    Seemann T mlst: scan contig files against PubMLST typing schemes (GitHub)

  73. MIDAS
    An integrated pipeline for estimating strain-level genomic variation from metagenomic data
    Nayfach S, Rodriguez-Mueller B, Garud N, and Pollard KS An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Research, 26(11), 1612–1625. (2016)

  74. MinCED
    Mining CRISPRs in Environmental Datasets
    Skennerton C MinCED: Mining CRISPRs in Environmental Datasets (GitHub)

  75. Miniasm
    Ultrafast de novo assembly for long noisy reads (though having no consensus step)
    Li H Miniasm: Ultrafast de novo assembly for long noisy reads (GitHub)

  76. Minimap2
    A versatile pairwise aligner for genomic and spliced nucleotide sequences
    Li H Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094-3100 (2018)

  77. MOB-suite
    Software tools for clustering, reconstruction and typing of plasmids from draft assemblies
    Robertson J, Nash JHE MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microbial Genomics 4(8). (2018)

  78. Mykrobe
    Antibiotic resistance prediction in minutes
    Hunt M, Bradley P, Lapierre SG, Heys S, Thomsit M, Hall MB, Malone KM, Wintringer P, Walker TM, Cirillo DM, Comas I, Farhat MR, Fowler P, Gardy J, Ismail N, Kohl TA, Mathys V, Merker M, Niemann S, Omar SV, Sintchenko V, Smith G, Supply P, Tahseen S, Wilcox M, Arandjelovic I, Peto TEA, Crook, DW, Iqbal Z Antibiotic resistance prediction for Mycobacterium tuberculosis from genome sequence data with Mykrobe Wellcome Open Research 4, 191. (2019)

  79. NanoPlot
    Plotting scripts for long read sequencing data
    De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C NanoPack: visualizing and processing long-read sequencing data Bioinformatics Volume 34, Issue 15 (2018)

  80. Nanoq
    Minimal but speedy quality control for nanopore reads in Rust
    Steinig E Nanoq: Minimal but speedy quality control for nanopore reads in Rust (GitHub)

  81. ncbi-genome-download
    Scripts to download genomes from the NCBI FTP servers
    Blin K ncbi-genome-download: Scripts to download genomes from the NCBI FTP servers (GitHub)

  82. Nextflow
    A DSL for data-driven computational pipelines.
    Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017)

  83. ngmaster
    In silico multi-antigen sequence typing for Neisseria gonorrhoeae (NG-MAST)
    Kwong J, Gonçalves da Silva A, Schultz M, Seeman T ngmaster - In silico multi-antigen sequence typing for Neisseria gonorrhoeae (NG-MAST) (GitHub)

  84. nhmmer
    DNA homology search with profile HMMs.
    Wheeler TJ, Eddy SR nhmmer: DNA homology search with profile HMMs. Bioinformatics 29, 2487–2489 (2013)

  85. Panaroo
    An updated pipeline for pangenome investigation
    Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biology 21(1), 180. (2020)

  86. pasty
    in silico serogrouping of Pseudomonas aeruginosa isolates
    Petit III RA pasty: in silico serogrouping of Pseudomonas aeruginosa isolates (GitHub)

  87. pbptyper
    Penicillin Binding Protein (PBP) typer for Streptococcus pneumoniae assemblies
    Petit III RA pbptyper: In silico Penicillin Binding Protein (PBP) typer for Streptococcus pneumoniae assemblies (GitHub)

  88. PhiSpy
    Prediction of prophages from bacterial genomes
    Akhter S, Aziz RK, and Edwards RA PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Research, 40(16), e126. (2012)

  89. phyloFlash
    A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
    Gruber-Vodicka HR, Seah BKB, Pruesse E phyloFlash: Rapid Small-Subunit rRNA Profiling and Targeted Assembly from Metagenomes mSystems 5 (2020)

  90. Pigz
    A parallel implementation of gzip for modern multi-processor, multi-core machines.
    Adler M. pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory (2015)

  91. Pilon
    An automated genome assembly improvement and variant detection tool
    Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS one 9.11 e112963 (2014)

  92. PIRATE
    A toolbox for pangenome analysis and threshold evaluation.
    Bayliss SC, Thorpe HA, Coyle NM, Sheppard SK, Feil EJ PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria. Gigascience 8 (2019)

  93. PlasmidFinder
    Identifies plasmids in total or partial sequenced isolates of bacteria
    Carattoli A, Zankari E, García-Fernández A, Voldby Larsen M, Lund O, Villa L, Mþller Aarestrup F, Hasman H In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrobial Agents and Chemotherapy 58(7), 3895–3903. (2014)

  94. PneumoCaT
    Pneumococcal Capsular Typing tool for NGS data
    Kapatai G, Sheppard CL, Al-Shahib A, Litt DJ, Underwood AP, Harrison TG, and Fry NK Whole genome sequencing of Streptococcus pneumoniae: development, evaluation and verification of targets for serogroup and serotype prediction using an automated pipeline. PeerJ, 4, e2477. (2016)

  95. Porechop
    adapter trimmer for Oxford Nanopore reads
    Wick RR, Judd LM, Gorrie CL, Holt KE. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom. 3(10):e000132 (2017)

  96. pplacer
    Phylogenetic placement and downstream analysis
    Matsen FA, Kodner RB, Armbrust EV pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010)

  97. Prodigal
    Fast, reliable protein-coding gene prediction for prokaryotic genomes.
    Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11.1 119 (2010)

  98. Prokka
    Rapid prokaryotic genome annotation
    Seemann T Prokka: rapid prokaryotic genome annotation Bioinformatics 30, 2068–2069 (2014)

  99. QUAST
    Quality Assessment Tool for Genome
    Gurevich A, Saveliev V, Vyahhi N, Tesler G QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013)

  100. Racon
    Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads
    Vaser R, Sović I, Nagarajan N, Ơikić M Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27, 737–746 (2017)

  101. Rasusa
    Randomly subsample sequencing reads to a specified coverage
    Hall MB Rasusa: Randomly subsample sequencing reads to a specified coverage. (2019).

  102. Raven
    De novo genome assembler for long uncorrected reads
    Vaser R, Ơikić M Time- and memory-efficient genome assembly with Raven. Nat Comput Sci 1, 332–336 (2021)

  103. Resistance Gene Identifier (RGI)
    Software to predict resistomes from protein or nucleotide data, based on homology and SNP models.
    Alcock BP, Raphenya AR, Lau TTY, Tsang KK, Bouchard M, Edalatmand A, Huynh W, Nguyen A-L V, Cheng AA, Liu S, Min SY, Miroshnichenko A, Tran H-K, Werfalli RE, Nasir JA, Oloni M, Speicher DJ, Florescu A, Singh B, Faltyn M, Hernandez-Koutoucheva A, Sharma AN, Bordeleau E, Pawlowski AC, Zubyk HL, Dooley D, Griffiths E, Maguire F, Winsor GL, Beiko RG, Brinkman FSL, Hsiao WWL, Domselaar GV, McArthur AG CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic acids research 48.D1, D517-D525 (2020)

  104. RNAmmer
    Consistent and rapid annotation of ribosomal RNA genes
    Lagesen K, Hallin P, RĂždland EA, StĂŠrfeldt H-H, Rognes T, Ussery DW RNAmmer: consistent annotation of rRNA genes in genomic sequences. Nucleic Acids Res 35.9: 3100-3108 (2007)

  105. Roary
    Rapid large-scale prokaryote pan genome analysis
    Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691–3693 (2015)

  106. samclip
    Filter SAM file for soft and hard clipped alignments
    Seemann T Samclip: Filter SAM file for soft and hard clipped alignments (GitHub)

  107. Samtools
    Tools for manipulating next-generation sequencing data
    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009)

  108. Scoary
    Pan-genome wide association studies
    Brynildsrud O, Bohlin J, Scheffer L, Eldholm V Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol. 17:238 (2016)

  109. SeqSero2
    Salmonella serotype prediction from genome sequencing data
    Zhang S, Den-Bakker HC, Li S, Dinsmore BA, Lane C, Lauer AC, Fields PI, Deng X. SeqSero2: rapid and improved Salmonella serotype determination using whole genome sequencing data. Appl Environ Microbiology 85(23):e01746-19 (2019)

  110. Seqtk
    A fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
    Li H Toolkit for processing sequences in FASTA/Q formats (GitHub)

  111. Seroba
    k-mer based pipeline to identify the serotype of Streptococcus pneumoniae from Illumina NGS reads
    Epping L, van Tonder AJ, Gladstone RA, The Global Pneumococcal Sequencing Consortium, Bentley SD, Page AJ, Keane JA SeroBA: rapid high-throughput serotyping of Streptococcus pneumoniae from whole genome sequence data. Microbial Genomics, 4(7) (2018)

  112. ShigaTyper
    Shigella serotype from Illumina or Oxford Nanopore reads
    Wu Y, Lau HK, Lee T, Lau DK, Payne J In Silico Serotyping Based on Whole-Genome Sequencing Improves the Accuracy of Shigella Identification. Applied and Environmental Microbiology, 85(7). (2019)

  113. ShigEiFinder
    Cluster informed Shigella and EIEC serotyping tool from Illumina reads and assemblies
    Zhang X, Payne M, Nguyen T, Kaur S, Lan R Cluster-specific gene markers enhance Shigella and enteroinvasive Escherichia coli in silico serotyping. Microbial Genomics, 7(12). (2021)

  114. Shovill
    Faster assembly of Illumina reads
    Seemann T Shovill: De novo assembly pipeline for Illumina paired reads (GitHub)

  115. Shovill-SE
    A fork of Shovill that includes support for single end reads.
    Petit III RA Shovill-SE: A fork of Shovill that includes support for single end reads. (GitHub)

  116. SignalP
    SISTR (Salmonella In Silico Typing Resource) command-line tool
    Petersen TN, Brunak S, von Heijne G, Nielsen H SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature methods 8.10: 785 (2011)

  117. SISTR
    Finds signal peptide features in CDS
    Yoshida CE, Kruczkiewicz P, Laing CR, Lingohr EJ, Gannon VPJ, Nash JHE, Taboada EN The Salmonella In Silico Typing Resource (SISTR): An Open Web-Accessible Tool for Rapidly Typing and Subtyping Draft Salmonella Genome Assemblies. PloS One, 11(1), e0147101. (2016)

  118. SKESA
    Strategic Kmer Extension for Scrupulous Assemblies
    Souvorov A, Agarwala R, Lipman DJ SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biology 19:153 (2018)

  119. Snippy
    Rapid haploid variant calling and core genome alignment
    Seemann T Snippy: fast bacterial variant calling from NGS reads (GitHub)

  120. SnpEff
    Genomic variant annotations and functional effect prediction toolbox.
    Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Douglas M A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6(2), 80-92 (2012)

  121. snp-dists
    Pairwise SNP distance matrix from a FASTA sequence alignment
    Seemann T snp-dists - Pairwise SNP distance matrix from a FASTA sequence alignment. (GitHub)

  122. SNP-sites
    Rapidly extracts SNPs from a multi-FASTA alignment.
    Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microbial Genomics 2.4 (2016)

  123. Sourmash
    Compute and compare MinHash signatures for DNA data sets.
    Brown CT, Irber L sourmash: a library for MinHash sketching of DNA. JOSS 1, 27 (2016)

  124. SPAdes
    An assembly toolkit containing various assembly pipelines.
    Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology 19.5 455-477 (2012)

  125. spaTyper
    Computational method for finding spa types.
    Sanchez-Herrero JF, and Sullivan M spaTyper: Staphylococcal protein A (spa) characterization pipeline. Zenodo. (2020)

  126. spaTyper Database
    Database used by spaTyper
    Harmsen D, Claus H, Witte W, RothgÀnger J, Claus H, Turnwald D, and Vogel U Typing of methicillin-resistant Staphylococcus aureus in a university hospital setting using a novel software for spa-repeat determination and database management. J. Clin. Microbiol. 41:5442-5448 (2003)

  127. SRA Human Scrubber
    An SRA tool that takes as input local fastq file from a clinical infection sample, identifies and removes any significant human read, and outputs the edited (cleaned) fastq file that can safely be used for SRA submission
    Katz KS, Shutov O, Lapoint R, Kimelman M, Brister JR, and O’Sullivan C STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions. Genome Biology, 22(1), 270 (2021)

  128. SsuisSero
    Rapid Streptococcus suis serotyping
    Lui J SsuisSero: Rapid Streptococcus suis serotyping (GitHub)

  129. staphopia-sccmec
    A standalone version of Staphopia's SCCmec typing method.
    Petit III RA, Read TD Staphylococcus aureus viewed from the perspective of 40,000+ genomes. PeerJ 6, e5261 (2018)

  130. STECFinder
    Clustering and Serotyping of Shigatoxin producing E. coli (STEC) using genomic cluster specific markers
    Zhang X, Payne M, Kaur S, and Lan R Improved Genomic Identification, Clustering, and Serotyping of Shiga Toxin-Producing Escherichia coli Using Cluster/Serotype-Specific Gene Markers. Frontiers in Cellular and Infection Microbiology, 11, 772574. (2021)

  131. TBProfiler
    Profiling tool for Mycobacterium tuberculosis to detect resistance and strain type
    Phelan JE, O’Sullivan DM, Machado D, Ramos J, Oppong YEA, Campino S, O’Grady J, McNerney R, Hibberd ML, Viveiros M, Huggett JF, Clark TG Integrating informatics tools and portable sequencing technology for rapid detection of resistance to anti-tuberculous drugs. Genome Med 11, 41 (2019)

  132. Trimmomatic
    A flexible read trimming tool for Illumina NGS data
    Bolger AM, Lohse M, Usadel B Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30.15 2114-2120 (2014)

  133. Unicycler
    Hybrid assembly pipeline for bacterial genomes
    Wick RR, Judd LM, Gorrie CL, Holt KE Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 13, e1005595 (2017)

  134. VCF-Annotator
    Add biological annotations to variants in a VCF file.
    Petit III RA VCF-Annotator: Add biological annotations to variants in a VCF file. (GitHub)

  135. Vcflib
    a simple C++ library for parsing and manipulating VCF files
    Garrison E Vcflib: A C++ library for parsing and manipulating VCF files (GitHub)

  136. Velvet
    Short read de novo assembler using de Bruijn graphs
    Zerbino DR, Birney E Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18.5 821-829 (2008)

  137. VSEARCH
    Versatile open-source tool for metagenomics
    Rognes T, Flouri T, Nichols B, Quince C, Mahé F VSEARCH: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016)

  138. vt
    A tool set for short variant discovery in genetic sequence data.
    Tan A, Abecasis GR, Kang HM Unified representation of genetic variants. Bioinformatics 31(13), 2202-2204 (2015)

Bactopia Citation

If you use Bactopia in your analysis, please cite the following.

Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020)