Overview of Bactopia Output¶

After a successful run, Bactopia will have produced numerous output files. Just how many output files depends on the input datasets used (e.g. none, general datasets, species specific datasets, user populated datasets).

Here is the complete directory structure that is possible (using all available dataset options) with Bactopia.

${SAMPLE_NAME}/
├── annotation
├── antimicrobial_resistance
├── ariba
├── assembly
├── blast
├── kmers
├── logs
├── mapping
├── minmers
├── mlst
├── quality-control
├── variants
└── ${SAMPLE_NAME}-genome-size.txt

For each type of analysis in Bactopia, a separate directory is created to hold the results. All samples processed by Bactopia will have this directory structure. The only difference is the usage of ${SAMPLE_NAME} as a prefix for naming some output files.

Directories¶

The following sections include a list of expected outputs as well as a brief description of each output file.

There are instances where additional files (e.g. --keep_all_files and --ariba_noclean) may be encountered. These files aren't described below, just the defaults. Also, using --compress will add a gz extension, but the original extension is maintained and its description still applies.

Developer Descriptions Take Priority

If a developer described their software's outputs, their description was used with a link back to the software's documentation (major thanks for taking the time to do that!). In some cases there may have been slight formatting modifications made. In any case, if descriptions are not original credit will be properly given to the source.

`annotation`¶

The annotation directory will contain the outputs from Prokka annotation. These outputs include FASTA (proteins and genes), GFF3, GenBank, and many more. By default the included Prokka databases are used for annotation. However, if a Species Specific Dataset was a created the RefSeq clustered proteins are used first for annotation.

File descriptions were directly taken from Prokka's Output Files section and slight modifications were made to the order of rows.

${SAMPLE_NAME}/
└── annotation
    ├── ${SAMPLE_NAME}.err
    ├── ${SAMPLE_NAME}.faa
    ├── ${SAMPLE_NAME}.ffn
    ├── ${SAMPLE_NAME}.fna
    ├── ${SAMPLE_NAME}.fsa
    ├── ${SAMPLE_NAME}.gbk
    ├── ${SAMPLE_NAME}.gff
    ├── ${SAMPLE_NAME}.log
    ├── ${SAMPLE_NAME}.sqn
    ├── ${SAMPLE_NAME}.tbl
    ├── ${SAMPLE_NAME}.tsv
    └── ${SAMPLE_NAME}.txt

Extension	Description
.err	Unacceptable annotations - the NCBI discrepancy report.
.faa	Protein FASTA file of the translated CDS sequences.
.ffn	Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.fna	Nucleotide FASTA file of the input contig sequences.
.fsa	Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.gbk	This is a standard GenBank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-GenBank, with one record for each sequence.
.gff	This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.log	Contains all the output that Prokka produced during its run. This is a record of what settings you used.
.sqn	An ASN1 format "Sequin" file for submission to GenBank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.tbl	Feature Table file, used by "tbl2asn" to create the .sqn file.
.tsv	Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product
.txt	Statistics relating to the annotated features found.

`antimicrobial_resistance`¶

The antimicrobial_resistance directory will contain the output from NCBI's AMRFinderPlus. The results of AMRFinderPlus using genes as and input, and proteins as an input are available. More information about the output format is available from the AMRFinderPlus Wiki.

${SAMPLE_NAME}/
└── antimicrobial_resistance/
    ├── ${SAMPLE_NAME}-gene-report.txt
    └── ${SAMPLE_NAME}-protein-report.txt

Extension	Description
-gene-report.txt	Results of using gene sequences as an input
-protein-report.txt	Results of using protein sequences as an input

`ariba`¶

The ariba directory will contain the results of any Ariba analysis (excluding MLST). Only the Ariba databases created during the dataset setup are used for analysis. For each Ariba database (e.g. card or vfdb), a separate folder with the name of the database is included in the ariba folder.

The file descriptions below were modified from Ariba's wiki entries for run and summary.

${SAMPLE_NAME}/
└── ariba
    └── ARIBA_DATABASE_NAME
        ├── assembled_genes.fa.gz
        ├── assembled_seqs.fa.gz
        ├── assemblies.fa.gz
        ├── debug.report.tsv
        ├── log.clusters.gz
        ├── report.tsv
        ├── summary.csv
        └── version_info.txt

Filename	Description
assembled_genes.fa.gz	A gzipped FASTA file of only assembled gene sequences (with extensions).
assembled_seqs.fa.gz	A gzipped FASTA of the assembled sequences (genes and non-coding).
assemblies.fa.gz	A gzipped FASTA file of the assemblies (complete, unedited, contigs).
debug.report.tsv	The complete list of clusters, including those that did not pass filtering.
log.clusters.gz	Detailed logging for the progress of each cluster.
report.tsv	A detailed report file of clusters which passed filtering.
summary.csv	A more condensed summary of the report.tsv
version_info.txt	Information on the versions of ARIBA and its dependencies at runtime.

`assembly`¶

The assembly folder contains the results of the sample's assembly.

standard¶

The standard assembly is managed by Shovill and by default SKESA is used for assembly. Alternative assemblers include SPAdes, MEGAHIT, and Velvet. Depending on the choice of assembler, additional output files (e.g. assembly graphs) may be given.

Files descriptions with some modifications were directly taken from Shovill's Output Files section as well as the FLASH usage.

${SAMPLE_NAME}/
└── assembly
    ├── cointigs.fa
    ├── flash.hist
    ├── flash.histogram
    ├── shovill.corrections
    ├── shovill.log
    ├── ${SAMPLE_NAME}.fna
    └── ${SAMPLE_NAME}.fna.json

Filename	Description
contigs.fa	Final assembly without renamed headers.
flash.hist	Numeric histogram of merged read lengths.
flash.histogram	Visual histogram of merged read lengths
shovill.log	Full log file for bug reporting
shovill.corrections	List of post-assembly corrections
${SAMPLE_NAME}.fna	The final assembly, with renamed header to include sample name
${SAMPLE_NAME}.fna.json	Summary statistics of the assembly

FASTA inputs are not reassembled by default

In the case where an assembly is given as an input, the only files that will be available are ${SAMPLE_NAME}.fna (the original unmodified assembly) and ${SAMPLE_NAME}.fna.json. If --reassemble is also given, then all the files seen above will be available.

hybrid¶

If long reads are available to supplement input paired-end Illumina reads, a hybrid assembly can be created using Unicycler.

Files descriptions with some modifications were directly taken from Unicycler's Output Files.

${SAMPLE_NAME}/
└── assembly
    ├── 001_best_spades_graph.gfa
    ├── 002_overlaps_removed.gfa
    ├── 003_long_read_assembly.gfa
    ├── 004_bridges_applied.gfa
    ├── 005_final_clean.gfa
    ├── 006_polished.gfa
    ├── 007_rotated.gfa
    ├── assembly.fasta
    ├── assembly.gfa
    ├── ${SAMPLE_NAME}.fna
    ├── ${SAMPLE_NAME}.fna.json
    └── unicycler.log

Filename	Description
001_best_spades_graph.gfa	The best SPAdes short-read assembly graph, with a bit of graph clean-up
002_overlaps_removed.gfa	Overlap-free version of the SPAdes graph, with some more graph clean-up
003_long_read_assembly.gfa	Assembly graph after long read assembly
004_bridges_applied.gfa	Bridges applied, before any cleaning or merging
005_final_clean.gfa	Assembly graph after more redundant contigs removed
006_polished.gfa	Assembly graph after a round of Pilon polishing
007_rotated.gfa	Assembly graph after ircular replicons rotated and/or flipped to a start position
assembly.fasta	The final assembly in FASTA format (same contigs names as in assembly.gfa)
assembly.gfa	The final assembly in GFA v1 graph format
${SAMPLE_NAME}.fna	The final assembly with renamed header to include sample name
${SAMPLE_NAME}.fna.json	Summary statistics of the assembly
unicycler.log	Unicycler log file (same info as stdout)

quality reports¶

Each assembly will have its biological and technical quality assessed with CheckM and QUAST. This assessment is done no matter the input type (paired, single, hybrid, or assembly).

Files descriptions with some modifications were directly taken from CheckM's Usage and QUAST's Output Files.

assembly/
├── checkm
│   ├── bins/
│   ├── checkm-genes.aln
│   ├── checkm.log
│   ├── checkm-results.txt
│   ├── lineage.ms
│   └── storage/
└── quast
    ├── basic_stats/
    ├── icarus.html
    ├── icarus_viewers
    │   └── contig_size_viewer.html
    ├── predicted_genes
    │   ├── GCF_003431365_glimmer_genes.gff.gz
    │   └── GCF_003431365_glimmer.stderr
    ├── quast.log
    ├── report.{html|pdf|tex|tsv|txt}
    ├── transposed_report.tex
    ├── transposed_report.tsv
    └── transposed_report.txt

CheckM Outputs

Filename	Description
bins/	A folder with inputs (e.g. proteins) for processing by CheckM
checkm-genes.aln	Alignment of multi-copy genes and their AAI identity
checkm.log	The output log of CheckM
checkm-results.txt	Final results of CheckM's lineage_wf
lineage.ms	Output file describing marker set for each bin
storage/	A folder with intermediate results from CheckM processing

QUAST Outputs

Filename	Description
basic_stats	A folder with plots of assembly metrics (e.g. GC content, NGx, Nx)
icarus.html	Icarus main menu with links to interactive viewers.
icarus_viewers/	Additional reports for Icarus
predicted_genes/	Predicted gene output
quast.log	Detailed information about the QUAST run
report.{html\|pdf }	Assessement summary including all tables and plots
report.{tex\|tsv\|txt}	Assessment summary in multiple different formats
transposed_report.{tex\|tsv\|txt}	Transposed version of the assessment summary

`blast`¶

The blast directory contains the BLAST results and a BLAST database of the sample assembly.

Each of the User Populated BLAST Sequences (gene, primer, or protein) is BLASTed against the sample assembly. Also if setup, annotated genes are BLASTed against the PLSDB BLAST database.

By default, results are returned in tabular format.

${SAMPLE_NAME}/
└── blast
    ├── blastdb
    │   ├── ${SAMPLE_NAME}.nhr
    │   ├── ${SAMPLE_NAME}.nin
    │   └── ${SAMPLE_NAME}.nsq
    ├── genes
    │   └── ${INPUT_NAME}.txt
    ├── primers
    │   └── ${INPUT_NAME}.txt
    ├── proteins
    │   └── ${INPUT_NAME}.txt
    └── ${SAMPLE_NAME}-plsdb.txt

Extension	Description
.nhr	Sample assembly BLAST database header file
.nin	Sample assembly BLAST database index file
.nsq	Sample assembly BLAST database sequence file
-plsdb.txt	The BLAST results against the PLSDB BALST database assembly.
.txt	The BLAST results of user input sequence(s) against the sample assembly.

`genome-size`¶

For every sample ${SAMPLE_NAME}-genome-size.txt file is created. This file contains the genome size that was used for analysis. Genome size is used throughout Bactopia for various tasks including error correction, subsampling, and assembly.

By default, the genome size is estimated with Mash, but users have the option to provide their own value or completely disable genome size related features. Learn more about changing the genome size at Basic Usage - Genome Size

`kmers`¶

The kmers directory contains McCortex 31-mer counts of the cleaned up FASTQ sequences.

${SAMPLE_NAME}/
└── kmers
    └── ${SAMPLE_NAME}.ctx

Extension	Description
.ctx	A Cortex graph file of the 31-mer counts

`logs`¶

The logs folder will contain useful files for debugging or reviewing what was executed. For each process (e.g. annotation or assembly) the STDOUT and STDERR is log, as well as the time of execution and program versions. These outputs are completely optional, and can be disabled using --skip_logs at runtime.

${SAMPLE_NAME}/
└── logs/
    ├── ${PROCESS_NAME}
    │   ├── ${PROCESS_NAME}.err
    │   ├── ${PROCESS_NAME}.out
    │   ├── ${PROCESS_NAME}.sh
    │   ├── ${PROCESS_NAME}.trace
    │   ├── ${PROCESS_NAME}.versions
    │   ├── ${PROGRAM}.err
    │   └── ${PROGRAM}.out
    └── bactopia.versions

Filename	Description
${PROCESS_NAME}.err	Any STDERR captured by the process.
${PROCESS_NAME}.out	Any STDERR captured by the process.
${PROCESS_NAME}.sh	The shell script that process used.
${PROCESS_NAME}.trace	Compute resource usage by the process (this will not always be available)
${PROCESS_NAME}.versions	Date and program versions used by the process
${PROGRAM}.err	STDERR captured for a specific program.
${PROGRAM}.err	STDOUT captured for a specific program.
bactopia.versions	Date and Bactopia/Nextflow versions used

Example `versions`¶

Here is an example of the bactopia.versions file.

# Timestamp
2020-11-11T11:12:31-05:00
# Bactopia Version
bactopia 1.4.11
# Nextflow Version
nextflow 20.07.1

All the .versions files will follow this format. The first line is always # Timestamp followed by the output of date. Then each line beginning with # will represent a new program and its version.

`mapping`¶

The mapping-sequences directory contains BWA (bwa-mem) mapping results for each of the User Populated Mapping Sequences.

${SAMPLE_NAME}/
└── mapping
    └── ${MAPPING_INPUT}
        ├── ${MAPPING_INPUT}.bam
        └── ${MAPPING_INPUT}.coverage.txt

Extension	Description
.bam	The alignments in BAM format.
.coverage.txt	The per-base coverage of the mapping results

`minmers`¶

The minmers directory contains Mash and Sourmash sketches of the cleaned FASTQs. If setup, it also contains the results of queries against RefSeq, GenBank and PLSDB

${SAMPLE_NAME}/
└── minmers
    ├── ${SAMPLE_NAME}-genbank-k21.txt
    ├── ${SAMPLE_NAME}-genbank-k31.txt
    ├── ${SAMPLE_NAME}-genbank-k51.txt
    ├── ${SAMPLE_NAME}-k21.msh
    ├── ${SAMPLE_NAME}-k31.msh
    ├── ${SAMPLE_NAME}-plsdb-k21.txt
    ├── ${SAMPLE_NAME}-refseq-k21.txt
    └── ${SAMPLE_NAME}.sig

Extension	Description
-genbank-k(21\|31\|51).txt	Sourmash LCA Gather results against Sourmash GenBank Signature (k=21,31,51)
-k(21\|31).msh	A Mash sketch (k=21,31) of the sample
-plsdb-k21.txt	Mash Screen results against PLSDB Mash Sketch
-refseq-k21.txt	Mash Screen results against Mash Refseq Sketch
.sig	A Sourmash signature (k=21,31,51) of the sample

`mlst`¶

If a Species Specific Dataset has been set up, the mlst directory will contain Ariba and BLAST results for a PubMLST.org schema. For most organisms there is only one MLST schema available, and it will be labeled as default. In cases where a organism has multiple schemas available they will be named following pubMLST's naming.

${SAMPLE_NAME}/
└── mlst
    └── ${SCHEMA}
       ├── ariba
       │   ├── assembled_genes.fa.gz
       │   ├── assembled_seqs.fa.gz
       │   ├── assemblies.fa.gz
       │   ├── debug.report.tsv
       │   ├── log.clusters.gz
       │   ├── mlst_report.details.tsv
       │   ├── mlst_report.tsv
       │   ├── report.tsv
       │   └── version_info.txt
       └── blast
           └── ${SAMPLE_NAME}-blast.json

Filename	Description
assembled_genes.fa.gz	A gzipped FASTA file of only assembled gene sequences (with extensions).
assembled_seqs.fa.gz	A gzipped FASTA of the assembled sequences (genes and non-coding).
assemblies.fa.gz	A gzipped FASTA file of the assemblies (complete, unedited, contigs).
debug.report.tsv	The complete list of clusters, including those that did not pass filtering.
log.clusters.gz	Detailed logging for the progress of each cluster.
mlst_report.details.tsv	A more detailed summary of the allele calls.
mlst_report.tsv	A summary of the allele calls and identified sequence type.
report.tsv	A detailed report file of clusters which passed filtering.
summary.csv	A more condensed summary of the report.tsv
version_info.txt	Information on the versions of ARIBA and its dependencies at runtime.
-blast.json	Allele calls and identified sequence type based on BLAST

`quality-control`¶

The quality-control directory contains the cleaned up FASTQs (BBTools and Lighter) and summary statitics (FastQC and Fastq-Scan) before and after cleanup.

${SAMPLE_NAME}/
└── quality-control
    ├── logs
    │   ├── bbduk-adapter.log
    │   └── bbduk-phix.log
    ├── ${SAMPLE_NAME}(|_R1|_R2).fastq.gz
    └── summary-(original|final)
        ├── ${SAMPLE_NAME}(|_R1|_R2)-(original|final)_fastqc.html
        ├── ${SAMPLE_NAME}(|_R1|_R2)-(original|final)_fastqc.zip
        └── ${SAMPLE_NAME}(|_R1|_R2)-(original|final).json

Extension	Description
-adapter.log	A description of how many reads were filtered during the adapter removal step
-phix.log	A description of how many reads were filtered during the PhiX removal step
.fastq.gz	The cleaned up FASTQ(s), `_R1` and `_R2` for paired-end reads, and an empty string for single-end reads.
_fastqc.html	The FastQC html report of the original and final FASTQ(s)
_fastqc.zip	The zipped FastQC full report of the original and final FASTQ(s)
.json	Summary statistics (e.g. qualities and read lengths) of the original and final FASTQ(s)

`variants`¶

The variants directory contains the results of Snippy variant calls against one or more reference genomes. There are two subdirectories auto and user.

The auto directory includes variants calls against automatically selected reference genome(s) based on nearest Mash distance to RefSeq completed genomes. This process only happens if a Species Specific Dataset was a created. By default, only a single reference genome (nearest) is selected. This feature can be disabled (--disable_auto_variants) or the number of genomes changed (--max_references INT).

The user directory contains variant calls against for each of the User Populated Reference Genomes.

The following description of files was directly taken from Snippy's Output Files section. Slight modifications were made to the order of rows.

${SAMPLE_NAME}/
└── variants
    └── (auto|user)
        └── ${REFERENCE_NAME}
            ├── ${SAMPLE_NAME}.aligned.fa
            ├── ${SAMPLE_NAME}.annotated.vcf
            ├── ${SAMPLE_NAME}.bam
            ├── ${SAMPLE_NAME}.bam.bai
            ├── ${SAMPLE_NAME}.bed
            ├── ${SAMPLE_NAME}.consensus.fa
            ├── ${SAMPLE_NAME}.consensus.subs.fa
            ├── ${SAMPLE_NAME}.consensus.subs.masked.fa
            ├── ${SAMPLE_NAME}.coverage.txt
            ├── ${SAMPLE_NAME}.csv
            ├── ${SAMPLE_NAME}.filt.vcf
            ├── ${SAMPLE_NAME}.gff
            ├── ${SAMPLE_NAME}.html
            ├── ${SAMPLE_NAME}.log
            ├── ${SAMPLE_NAME}.raw.vcf
            ├── ${SAMPLE_NAME}.subs.vcf
            ├── ${SAMPLE_NAME}.tab
            ├── ${SAMPLE_NAME}.txt
            └── ${SAMPLE_NAME}.vcf

Extension	Description
.aligned.fa	A version of the reference but with `-` at position with `depth=0` and `N` for `0 < depth < --mincov` (does not have variants)
.annotated.vcf	The final variant calls with additional annotations from Reference genome's GenBank file
.bam	The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates.
.bam.bai	Index for the .bam file
.bed	The variants in BED format
.consensus.fa	A version of the reference genome with all variants instantiated
.consensus.subs.fa	A version of the reference genome with only substitution variants instantiated
.consensus.subs.masked.fa	A version of the reference genome with only substitution variants instantiated and low-coverage regions masked
.coverage.txt	The per-base coverage of each position in the reference genome
.csv	A comma-separated version of the .tab file
.filt.vcf	The filtered variant calls from Freebayes
.gff	The variants in GFF3 format
.html	A HTML version of the .tab file
.log	A log file with the commands run and their outputs
.raw.vcf	The unfiltered variant calls from Freebayes
.subs.vcf	Only substitution variants from the final annotated variants
.tab	A simple tab-separated summary of all the variants
.txt	A summary of the Snippy run.
.vcf	The final annotated variants in VCF format

Overview of Bactopia Output¶

Directories¶

annotation¶

antimicrobial_resistance¶

ariba¶

assembly¶

standard¶

hybrid¶

quality reports¶

blast¶

genome-size¶

kmers¶

logs¶

Example versions¶

mapping¶

minmers¶

mlst¶

quality-control¶

variants¶