Skip to content

Build Datasets

Bactopia can make use of many existing public datasets, as well as private datasets. The process of downloading, building, and (or) configuring these datasets for Bactopia has been automated.

Highly recommended to complete this step!

This step is completely optional, but it is highly recommended that you do not. By skipping this step of setting up public datasets, Bactopia will be limited to analyses like quality control, assembly, and 31-mer counting.

Included Datasets

Some datasets included are applicable to all bacterial species and some are specific to a bacterial species. If specified at runtime, Bactopia will recognize the datasets and execute the appropriate analyses.

General

Ariba's getref Reference Datasets
Allows reference datasets (resistance, virulence, and plamids) to be automatically downloaded and configured for usage by Ariba

RefSeq Mash Sketch
~100,000 genomes and plasmids from NCBI RefSeq, used to give an idea of what is your sequencing data (e.g. Are the sequences what you expected?)

GenBank Sourmash Signatures
~87,000 microbial genomes (includes viral and fungal) from NCBI GenBank, also gives an idea of what is your sequencing data.

Species Specific

PubMLST.org MLST Schemas
Multi-locus sequence typing (MLST) allelic profiles and seqeunces for a many different bacterial species (and even a few eukaryotes!).

Clustered RefSeq Proteins
For the given bacterial species, completed RefSeq genomes are downloaded and then the proteins are clustered and formatted for usage with Prokka.

Minmer Sketch of RefSeq Genomes
Using the completed genomes downloaded for clustering proteins a Mash sketch and Sourmash signature is created for these genomes. These sketches can then be used for automatic selection of reference genomes for variant calling.

Optional User Populated Folders
A few folders for things such as calling variants, insertion sequences and primers are created that the user can manually populate. More information is available below!

Setting Up

Included in Bactopia is the setup-datasets.py script (located in the bin folder) to automate the process of downloading and/or building these datasets.

Quick Start

bactopia datasets
This will set up Ariba datasets (card and vfdb_core), RefSeq Mash sketch, and GenBank Sourmash Signatures in the newly created datasets folder. By default, datasets is used for the output directory, but this can be changed with --outdir .

A Single Bacterial Species

bactopia datasets --species "Haemophilus influenzae" --include_genus

Multiple Bacterial Species

You can also set up datasets for multiple bacterial species at a time. There are two options to do so.

Comma-Separated

At runtime, you can separate the the different species

bactopia datasets --species "Haemophilus influenzae,Staphylococcus aureus" --include_genus

Text File

In order to do so, you will need to create a text file where each line is the name of a species to set up.

For example, you could create a species.txt file and include the following species in it.

Haemophilus influenzae
Staphylococcus aureus
Mycobacterium tuberculosis

The new command becomes:

bactopia datasets --species species.txt --include_genus

This will setup the MLST schema (if available) and a protein cluster FASTA file for each species in species.txt.

Usage

usage: bactopia datasets [-h] [--outdir STR] [--skip_ariba] [--ariba STR] [--species STR] [--skip_mlst] [--skip_prokka] [--include_genus] [--assembly_level {all,complete,chromosome,scaffold,contig}] [--limit INT] [--accessions STR] [--identity FLOAT]
                         [--overlap FLOAT] [--max_memory INT] [--fast_cluster] [--skip_minmer] [--skip_amr] [--prodigal_tf STR] [--reference STR] [--mapping STR] [--genes STR] [--proteins STR] [--primers STR] [--force_optional] [--cpus INT]
                         [--clear_cache] [--force] [--force_ariba] [--force_mlst] [--force_prokka] [--force_minmer] [--force_amr] [--keep_files] [--available_datasets] [--depends] [--version] [--verbose] [--silent]
                         PUBMLST

bactopia datasets (v2.0.0) - Setup public datasets for Bactopia

positional arguments:
  PUBMLST               Bactopia config file with PubMLST schema mappings for Ariba.

optional arguments:
  -h, --help            show this help message and exit
  --outdir STR          Directory to write output. (Default ./datasets)

Ariba Reference Datasets:
  --skip_ariba          Skip setup of Ariba datasets
  --ariba STR           Comma separated list of Ariba datasets to download and setup. Available datasets include: argannot, card, ncbi, megares, plasmidfinder, resfinder, srst2_argannot, vfdb_core, vfdb_full, virulencefinder (Default:
                        "vfdb_core,card") Use --available_datasets to see the full list.

Bacterial Species:
  --species STR         Download available MLST schemas and completed genomes for a given species or a list of species in a text file.
  --skip_mlst           Skip setup of MLST schemas for each species

Custom Prokka Protein FASTA:
  --skip_prokka         Skip creation of a Prokka formatted fasta for each species
  --include_genus       Include all genus members in the Prokka proteins FASTA
  --assembly_level {all,complete,chromosome,scaffold,contig}
                        Assembly levels of genomes to download (Default: complete).
  --limit INT           If available completed genomes exceeds a given limit, a random subsample will be taken. (Default 1000)
  --accessions STR      A list of RefSeq accessions to download.
  --identity FLOAT      CD-HIT (-c) sequence identity threshold. (Default: 0.9)
  --overlap FLOAT       CD-HIT (-s) length difference cutoff. (Default: 0.8)
  --max_memory INT      CD-HIT (-M) memory limit (in MB). (Default: unlimited
  --fast_cluster        Use CD-HIT's (-g 0) fast clustering algorithm, instead of the accurate but slow algorithm.

Minmer Datasets:
  --skip_minmer         Skip download of pre-computed minmer datasets (mash, sourmash)

Antimicrobial Resistance Datasets:
  --skip_amr            Skip download of antimicrobial resistance databases (e.g. AMRFinder+)

Optional User Provided Datasets:
  --prodigal_tf STR     A pre-built Prodigal training file to add to the species annotation folder. Requires a single species (--species) and will replace existing training files.
  --reference STR       A reference genome (FASTA/GenBank (preferred)) file or directory to be added to the optional folder for variant calling. Requires a single species (--species).
  --mapping STR         A reference sequence (FASTA) file or directory to be added to the optional folder for mapping. Requires a single species (--species).
  --genes STR           A gene sequence (FASTA) file or directory to be added to the optional folder for BLAST. Requires a single species (--species).
  --proteins STR        A protein sequence (FASTA) file or directory to be added to the optional folder for BLAST. Requires a single species (--species).
  --primers STR         A primer sequence (FASTA) file or directory to be added to the optional folder for BLAST. Requires a single species (--species).
  --force_optional      Overwrite any existing files in the optional folders

Custom Options:
  --cpus INT            Number of cpus to use. (Default: 1)
  --clear_cache         Remove any existing cache.
  --force               Forcibly overwrite existing datasets.
  --force_ariba         Forcibly overwrite existing Ariba datasets.
  --force_mlst          Forcibly overwrite existing MLST datasets.
  --force_prokka        Forcibly overwrite existing Prokka datasets.
  --force_minmer        Forcibly overwrite existing minmer datasets.
  --force_amr           Forcibly overwrite existing antimicrobial resistance datasets.
  --keep_files          Keep all downloaded and intermediate files.
  --available_datasets  List Ariba reference datasets and MLST schemas available for setup.
  --depends             Verify dependencies are installed.

Adjust Verbosity:
  --version             show program's version number and exit
  --verbose             Print debug related text.
  --silent              Only critical errors will be printed.

example usage:
  bactopia datasets
  bactopia datasets --ariba 'vfdb_core'
  bactopia datasets --species 'Staphylococcus aureus' --include_genus

Useful Parameters

--clear_cache

To prevent a PubMLST.org query every run, a list of available schemas is cached to $HOME/.bactopia/datasets.json. The cache expires after 15 days, but in case a new species has been made available --clear_cache will force a query of PubMLST.org.

--cpus

Increasing --cpus (it defaults to 1) is useful for speeding up the download and clustering steps.

--force*

If a dataset exists, it will only be overwritten if one of the --force parameters are used.

--include_genus

Completed RefSeq genomes are downloaded for a given species to be used for protein clustering. --include_genus will also download completed RefSeq genomes for each genus member.

--assembly_level

By default, only completed genomes are downloaded. --assembly_level allows you to set the minimum assembly level (e.g. complete, scaffold, contigs, etc...) to download.

--limit

For some species of bacteria there might be thousands of completed genomes available. For dataset creation, downloading thousands of completed genomes will be time consuming and like take up a significant amount of storage. To help in such cases --limit can be used to limit the downloads to a random subset of genomes. The default value for --limit has been set to 1000 genomes. In cases where --include_genus is used, the random subsample will always include at least one genome from the given --species value.

--accessions

In cases where a random subset of completed genomes is not ideal, you can provide your own curated list of genomes to download with --accessions. The file should have a single NCBI RefSeq Assembly accession (E.g GCF_000008865) per line.

--keep_files

Many intermediate files are downloaded/created (e.g. completed genomes) and deleted during the building process, use --keep_files to retain these files.

Tweaking CD-HIT

There are parameters (--identity, --overlap, --max_memory, and --fast_cluster) to tweak CD-HIT if you find it necessary. Please keep in mind, the only goal of the protein clustering step is to help speed up Prokka, by providing a decent set of proteins to annotate against first.

Datasets Folder Overview

After creating datasets you will have a directory structure that Bactopia recognizes. Based on the available datasets Bactopia will queue up the associated analyses.

Here is the directory structure for the Bactopia Datasets. Some of these include files from public datasets that can be used directly, but there are also other folders you can populate yourself to fit your needs.

${DATASET_FOLDER}
├── antimicrobial-resistance
├── ariba
├── minmer
└── species-specific
    └── ${SPECIES}
        ├── annotation
        │   ├── cdhit-stats.txt
        │   ├── genome_size.json
        │   ├── ncbi-metadata.txt
        │   ├── proteins.faa
        │   ├── proteins.faa.clstr
        │   └── proteins-updated.txt
        ├── minmer
        │   ├── minmer-updated.txt
        │   └── refseq-genomes.msh
        ├── mlst
        │   └── ${SCHEMA}
        │       ├── ariba.tar.gz
        │       ├── blastdb.tar.gz
        │       └── mlst-updated.txt
        └── optional
            ├── blast
            │   ├── genes
            │   │   └── ${NAME}.fasta
            │   ├── primers
            │   │   └── ${NAME}.fasta
            │   └── proteins
            │       └── ${NAME}.fasta
            ├── insertion-sequences
            │   └── ${NAME}.fasta
            ├── mapping-sequences
            │   └── ${NAME}.fasta
            └── reference-genomes
                └── ${NAME}.{gbk|fasta}

General Datasets

General datasets can be used for all bacterial samples. There are three general dataset folders: antimicrobial-resistance, ariba, and minmer.

The antimicrobial-resistance folder contains pre-formatted datasets available from NCBI's AMRFinderPlus.

The ariba folder contains pre-formatted datasets available from Ariba's getref Reference Datasets.

Finally minmer folder contains a RefSeq Mash Sketch and GenBank Sourmash Signatures of more than 100,000 genomes.

Changing files in antimicrobial-resistance, ariba, and minmer is not recommended

These directories are for general analysis and have been precomputed. Modifying these files may cause errors during analysis.

AMRFinder+ databases can sometimes require specific versions

Occasionally, AMRFinder+ require a specic version of AMRFinder+ and things stop working. If this occurs you can press forward using --skip_amr in Bactopia, but also please submit a Issue for this.

Species Specific Datasets

Bactopia allows the datasets to be created for a specific species. The following sections outline the species specific datasets.

annotation

Completed RefSeq genomes are downloaded and then the proteins are clustered and formatted for usage with Prokka. The results from this clustering is stored in the annotation folder.

${DATASET_FOLDER}
└── species-specific
    └── ${SPECIES}
        └── annotation
            ├── cdhit-stats.txt
            ├── genome_size.json
            ├── ncbi-metadata.txt
            ├── prodigal.tf
            ├── proteins.faa
            ├── proteins.faa.clstr
            └── proteins-updated.txt

Filename Description
cdhit-stats.txt General statistics associated with CD-HIT clustering
genome_size.json A list of genome size for each downloaded RefSeq genome
ncbi-metadata.txt NCBI Assembly metadata associated with the downloaded RefSeq genomes
prodigal.tf A pre-built species specific Prodigal training file provided with --prodigal_tf
proteins.faa Set of Prokka formatted proteins
proteins.faa.clstr Description of the clusters created by CD-HIT
proteins-updated.txt Information on the last time the protein set was updated

You can add your curated protein set here

If you have a set of proteins you would like to use for annotation, you can name it proteins.faa and place it in the annotation folder. In order for your set of proteins to be used by Prokka, you must make sure you follow the Prokka FASTA database format.

An alternative is to use the --accessions parameter and give bactopia datasets the list of RefSeq accessions when the dataset is created. In doing so the custom protein set will be automatically formatted using the genomes you specified.

minmer

By default, a Mash sketch is created for the completed genomes downloaded for clustering proteins. These sketches are then be used for automatic selection of reference genomes for variant calling.

${DATASET_FOLDER}
└── species-specific
    └── ${SPECIES}
        └── minmer
            ├── minmer-updated.txt
            └── refseq-genomes.msh

Filename Description
minmer-updated.txt Information on the last time the mash sketch was updated
refseq-genomes.msh A Mash sketch (k=31) of the RefSeq completed genomes

You can add your curated RefSeq sketch here

You can replace refseq-genomes.msh with a custom set of RefSeq genomes to be used for automatic reference selection. The only requirements to do so are that only RefSeq genomes (start with GCF) are used and the mash sketch uses a k-mer length of 31 (-k 31). This will allow it to be compatible with Bactopia.

An alternative is to use the --accessions parameter and give bactopia datasets the list of RefSeq accessions when the dataset is created. In doing so the mash sketch will be automatically created.

mlst

The mlst folder contains MLST schemas that have been formatted to be used by Ariba and BLAST.

${DATASET_FOLDER}
└── species-specific
    └── ${SPECIES}
        └── mlst
            └── ${SCHEMA}
                ├── ariba.tar.gz
                ├── blastdb.tar.gz
                └── mlst-updated.txt

Filename Description
ariba.tar.gz An Ariba formatted MLST dataset for a given schema
blastdb.tar.gz A BLAST formatted MLST dataset for a given schema
mlst-updated.txt Contains time stamp for the last time the MSLT dataset was updated

How does Bactopia handle organisms with multiple MLST schemas?

In a few cases, an organism might have multiple MLST schemas available (Example: E. coli). In such cases, each MLST schema is downloaded and set up. Bactopia will also call sequence types against each schema.

Changing files in mlst is not recommended

The MLST schemas have been pre-formatted for your usage. There might be rare cases where you would like to provide your own schema. If this is the case it is recommended you take a look at: What about MLST not hosted at pubmlst.org? then follow the directory structure for mlst.

optional

Built into the Bactopia dataset structure is the optional folder that you, the user, can populate for species specific analysis. These could include specific genes you might want BLASTed against your samples or a specific reference you want all your samples mapped to and variants called.

blast
${DATASET_FOLDER}
└── species-specific
    └── ${SPECIES}
        └── optional
            └── blast
                ├── genes
                │   └── ${NAME}.fasta
                ├── primers
                │   └── ${NAME}.fasta
                └── proteins
                    └── ${NAME}.fasta

In the blast directory there are three more directories!

The genes folder is where you can place gene seqeunces (nucleotides) in FASTA format to query against assemblies using blastn.

The primers folder is where you can place primer sequences (nucleotides) in FASTA format to query against assemblies using blastn, but with primer-specific parameters and cut-offs.

Finally, the proteins (as you probably guessed!) is where you can place protein sequnces (amino acids) in FASTA format to query against assemblies using blastp.

mapping-sequences
${DATASET_FOLDER}
└── species-specific
    └── ${SPECIES}
        └── optional
            └── mapping-sequences
                └── ${NAME}.fasta

In the mapping-sequences directory you can place FASTA files of any nucleotide sequence you would like FASTQ reads to be mapped against using BWA. This can be useful if you are interested if whether a certain region or gene is covered or not.

reference-genomes
${DATASET_FOLDER}
└── species-specific
    └── ${SPECIES}
        └── optional
            └── reference-genomes
                └── ${NAME}.{gbk|fasta}

In the reference-genomes directory you can put a GenBank (preferred!) or FASTA file of a reference genome you would like variants to be called against using Snippy.