Overview of Bactopia Output¶
After a successful run, Bactopia will have produced numerous output files. Just how many output files depends on the input datasets used (e.g. none, general datasets, species specific datasets, user populated datasets).
Here is the complete directory structure that is possible (using all available dataset options) with Bactopia.
${SAMPLE_NAME}/
├── annotation
├── antimicrobial_resistance
├── ariba
├── assembly
├── blast
├── kmers
├── logs
├── mapping
├── minmers
├── mlst
├── quality-control
├── variants
└── ${SAMPLE_NAME}-genome-size.txt
For each type of analysis in Bactopia, a separate directory is created to hold the results. All samples processed by Bactopia will have this directory structure. The only difference is the usage of ${SAMPLE_NAME}
as a prefix for naming some output files.
Directories¶
The following sections include a list of expected outputs as well as a brief description of each output file.
There are instances where additional files (e.g. --keep_all_files
and --ariba_noclean
) may be encountered. These files aren't described below, just the defaults. Also, using --compress
will add a gz extension, but the original extension is maintained and its description still applies.
Developer Descriptions Take Priority
If a developer described their software's outputs, their description was used with a link back to the software's documentation (major thanks for taking the time to do that!). In some cases there may have been slight formatting modifications made. In any case, if descriptions are not original credit will be properly given to the source.
annotation
¶
The annotation
directory will contain the outputs from Prokka annotation. These outputs include FASTA (proteins and genes), GFF3, GenBank, and many more. By default the included Prokka databases are used for annotation. However, if a Species Specific Dataset was a created the RefSeq clustered proteins are used first for annotation.
File descriptions were directly taken from Prokka's Output Files section and slight modifications were made to the order of rows.
${SAMPLE_NAME}/
└── annotation
├── ${SAMPLE_NAME}.err
├── ${SAMPLE_NAME}.faa
├── ${SAMPLE_NAME}.ffn
├── ${SAMPLE_NAME}.fna
├── ${SAMPLE_NAME}.fsa
├── ${SAMPLE_NAME}.gbk
├── ${SAMPLE_NAME}.gff
├── ${SAMPLE_NAME}.log
├── ${SAMPLE_NAME}.sqn
├── ${SAMPLE_NAME}.tbl
├── ${SAMPLE_NAME}.tsv
└── ${SAMPLE_NAME}.txt
Extension | Description |
---|---|
.err | Unacceptable annotations - the NCBI discrepancy report. |
.faa | Protein FASTA file of the translated CDS sequences. |
.ffn | Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) |
.fna | Nucleotide FASTA file of the input contig sequences. |
.fsa | Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. |
.gbk | This is a standard GenBank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-GenBank, with one record for each sequence. |
.gff | This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. |
.log | Contains all the output that Prokka produced during its run. This is a record of what settings you used. |
.sqn | An ASN1 format "Sequin" file for submission to GenBank. It needs to be edited to set the correct taxonomy, authors, related publication etc. |
.tbl | Feature Table file, used by "tbl2asn" to create the .sqn file. |
.tsv | Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product |
.txt | Statistics relating to the annotated features found. |
antimicrobial_resistance
¶
The antimicrobial_resistance
directory will contain the output from NCBI's AMRFinderPlus. The results of AMRFinderPlus using genes as and input, and proteins as an input are available. More information about the output format is available from the AMRFinderPlus Wiki.
${SAMPLE_NAME}/
└── antimicrobial_resistance/
├── ${SAMPLE_NAME}-gene-report.txt
└── ${SAMPLE_NAME}-protein-report.txt
Extension | Description |
---|---|
-gene-report.txt | Results of using gene sequences as an input |
-protein-report.txt | Results of using protein sequences as an input |
ariba
¶
The ariba
directory will contain the results of any Ariba analysis (excluding MLST). Only the Ariba databases created during the dataset setup are used for analysis. For each Ariba database (e.g. card
or vfdb
), a separate folder with the name of the database is included in the ariba
folder.
The file descriptions below were modified from Ariba's wiki entries for run
and summary
.
${SAMPLE_NAME}/
└── ariba
└── ARIBA_DATABASE_NAME
├── assembled_genes.fa.gz
├── assembled_seqs.fa.gz
├── assemblies.fa.gz
├── debug.report.tsv
├── log.clusters.gz
├── report.tsv
├── summary.csv
└── version_info.txt
Filename | Description |
---|---|
assembled_genes.fa.gz | A gzipped FASTA file of only assembled gene sequences (with extensions). |
assembled_seqs.fa.gz | A gzipped FASTA of the assembled sequences (genes and non-coding). |
assemblies.fa.gz | A gzipped FASTA file of the assemblies (complete, unedited, contigs). |
debug.report.tsv | The complete list of clusters, including those that did not pass filtering. |
log.clusters.gz | Detailed logging for the progress of each cluster. |
report.tsv | A detailed report file of clusters which passed filtering. |
summary.csv | A more condensed summary of the report.tsv |
version_info.txt | Information on the versions of ARIBA and its dependencies at runtime. |
assembly
¶
The assembly
folder contains the results of the sample's assembly.
standard¶
The standard assembly is managed by Shovill and by default SKESA is used for assembly. Alternative assemblers include SPAdes, MEGAHIT, and Velvet. Depending on the choice of assembler, additional output files (e.g. assembly graphs) may be given.
Files descriptions with some modifications were directly taken from Shovill's Output Files section as well as the FLASH usage.
${SAMPLE_NAME}/
└── assembly
├── cointigs.fa
├── flash.hist
├── flash.histogram
├── shovill.corrections
├── shovill.log
├── ${SAMPLE_NAME}.fna
└── ${SAMPLE_NAME}.fna.json
Filename | Description |
---|---|
contigs.fa | Final assembly without renamed headers. |
flash.hist | Numeric histogram of merged read lengths. |
flash.histogram | Visual histogram of merged read lengths |
shovill.log | Full log file for bug reporting |
shovill.corrections | List of post-assembly corrections |
${SAMPLE_NAME}.fna | The final assembly, with renamed header to include sample name |
${SAMPLE_NAME}.fna.json | Summary statistics of the assembly |
FASTA inputs are not reassembled by default
In the case where an assembly is given as an input, the only files that will be available are ${SAMPLE_NAME}.fna
(the original unmodified assembly) and ${SAMPLE_NAME}.fna.json
. If --reassemble
is also given, then all the files seen above will be available.
hybrid¶
If long reads are available to supplement input paired-end Illumina reads, a hybrid assembly can be created using Unicycler.
Files descriptions with some modifications were directly taken from Unicycler's Output Files.
${SAMPLE_NAME}/
└── assembly
├── 001_best_spades_graph.gfa
├── 002_overlaps_removed.gfa
├── 003_long_read_assembly.gfa
├── 004_bridges_applied.gfa
├── 005_final_clean.gfa
├── 006_polished.gfa
├── 007_rotated.gfa
├── assembly.fasta
├── assembly.gfa
├── ${SAMPLE_NAME}.fna
├── ${SAMPLE_NAME}.fna.json
└── unicycler.log
Filename | Description |
---|---|
001_best_spades_graph.gfa | The best SPAdes short-read assembly graph, with a bit of graph clean-up |
002_overlaps_removed.gfa | Overlap-free version of the SPAdes graph, with some more graph clean-up |
003_long_read_assembly.gfa | Assembly graph after long read assembly |
004_bridges_applied.gfa | Bridges applied, before any cleaning or merging |
005_final_clean.gfa | Assembly graph after more redundant contigs removed |
006_polished.gfa | Assembly graph after a round of Pilon polishing |
007_rotated.gfa | Assembly graph after ircular replicons rotated and/or flipped to a start position |
assembly.fasta | The final assembly in FASTA format (same contigs names as in assembly.gfa) |
assembly.gfa | The final assembly in GFA v1 graph format |
${SAMPLE_NAME}.fna | The final assembly with renamed header to include sample name |
${SAMPLE_NAME}.fna.json | Summary statistics of the assembly |
unicycler.log | Unicycler log file (same info as stdout) |
quality reports¶
Each assembly will have its biological and technical quality assessed with CheckM and QUAST. This assessment is done no matter the input type (paired, single, hybrid, or assembly).
Files descriptions with some modifications were directly taken from CheckM's Usage and QUAST's Output Files.
assembly/
├── checkm
│ ├── bins/
│ ├── checkm-genes.aln
│ ├── checkm.log
│ ├── checkm-results.txt
│ ├── lineage.ms
│ └── storage/
└── quast
├── basic_stats/
├── icarus.html
├── icarus_viewers
│ └── contig_size_viewer.html
├── predicted_genes
│ ├── GCF_003431365_glimmer_genes.gff.gz
│ └── GCF_003431365_glimmer.stderr
├── quast.log
├── report.{html|pdf|tex|tsv|txt}
├── transposed_report.tex
├── transposed_report.tsv
└── transposed_report.txt
CheckM Outputs
Filename | Description |
---|---|
bins/ | A folder with inputs (e.g. proteins) for processing by CheckM |
checkm-genes.aln | Alignment of multi-copy genes and their AAI identity |
checkm.log | The output log of CheckM |
checkm-results.txt | Final results of CheckM's lineage_wf |
lineage.ms | Output file describing marker set for each bin |
storage/ | A folder with intermediate results from CheckM processing |
QUAST Outputs
Filename | Description |
---|---|
basic_stats | A folder with plots of assembly metrics (e.g. GC content, NGx, Nx) |
icarus.html | Icarus main menu with links to interactive viewers. |
icarus_viewers/ | Additional reports for Icarus |
predicted_genes/ | Predicted gene output |
quast.log | Detailed information about the QUAST run |
report.{html|pdf } | Assessement summary including all tables and plots |
report.{tex|tsv|txt} | Assessment summary in multiple different formats |
transposed_report.{tex|tsv|txt} | Transposed version of the assessment summary |
blast
¶
The blast
directory contains the BLAST results and a BLAST database of the sample assembly.
Each of the User Populated BLAST Sequences (gene, primer, or protein) is BLASTed against the sample assembly. Also if setup, annotated genes are BLASTed against the PLSDB BLAST database.
By default, results are returned in tabular format.
${SAMPLE_NAME}/
└── blast
├── blastdb
│ ├── ${SAMPLE_NAME}.nhr
│ ├── ${SAMPLE_NAME}.nin
│ └── ${SAMPLE_NAME}.nsq
├── genes
│ └── ${INPUT_NAME}.txt
├── primers
│ └── ${INPUT_NAME}.txt
├── proteins
│ └── ${INPUT_NAME}.txt
└── ${SAMPLE_NAME}-plsdb.txt
Extension | Description |
---|---|
.nhr | Sample assembly BLAST database header file |
.nin | Sample assembly BLAST database index file |
.nsq | Sample assembly BLAST database sequence file |
-plsdb.txt | The BLAST results against the PLSDB BALST database assembly. |
.txt | The BLAST results of user input sequence(s) against the sample assembly. |
genome-size
¶
For every sample ${SAMPLE_NAME}-genome-size.txt
file is created. This file contains the genome size that was used for analysis. Genome size is used throughout Bactopia for various tasks including error correction, subsampling, and assembly.
By default, the genome size is estimated with Mash, but users have the option to provide their own value or completely disable genome size related features. Learn more about changing the genome size at Basic Usage - Genome Size
kmers
¶
The kmers
directory contains McCortex 31-mer counts of the cleaned up FASTQ sequences.
${SAMPLE_NAME}/
└── kmers
└── ${SAMPLE_NAME}.ctx
Extension | Description |
---|---|
.ctx | A Cortex graph file of the 31-mer counts |
logs
¶
The logs
folder will contain useful files for debugging or reviewing what was executed. For each process (e.g. annotation or assembly) the STDOUT and STDERR is log, as well as the time of execution and program versions. These outputs are completely optional, and can be disabled using --skip_logs
at runtime.
${SAMPLE_NAME}/
└── logs/
├── ${PROCESS_NAME}
│ ├── ${PROCESS_NAME}.err
│ ├── ${PROCESS_NAME}.out
│ ├── ${PROCESS_NAME}.sh
│ ├── ${PROCESS_NAME}.trace
│ ├── ${PROCESS_NAME}.versions
│ ├── ${PROGRAM}.err
│ └── ${PROGRAM}.out
└── bactopia.versions
Filename | Description |
---|---|
${PROCESS_NAME}.err | Any STDERR captured by the process. |
${PROCESS_NAME}.out | Any STDERR captured by the process. |
${PROCESS_NAME}.sh | The shell script that process used. |
${PROCESS_NAME}.trace | Compute resource usage by the process (this will not always be available) |
${PROCESS_NAME}.versions | Date and program versions used by the process |
${PROGRAM}.err | STDERR captured for a specific program. |
${PROGRAM}.err | STDOUT captured for a specific program. |
bactopia.versions | Date and Bactopia/Nextflow versions used |
Example versions
¶
Here is an example of the bactopia.versions
file.
# Timestamp
2020-11-11T11:12:31-05:00
# Bactopia Version
bactopia 1.4.11
# Nextflow Version
nextflow 20.07.1
All the .versions
files will follow this format. The first line is always # Timestamp
followed by the output of date
. Then each line beginning with #
will represent a new program and its version.
mapping
¶
The mapping-sequences
directory contains BWA (bwa-mem) mapping results for each of the User Populated Mapping Sequences.
${SAMPLE_NAME}/
└── mapping
└── ${MAPPING_INPUT}
├── ${MAPPING_INPUT}.bam
└── ${MAPPING_INPUT}.coverage.txt
Extension | Description |
---|---|
.bam | The alignments in BAM format. |
.coverage.txt | The per-base coverage of the mapping results |
minmers
¶
The minmers
directory contains Mash and Sourmash sketches of the cleaned FASTQs. If setup, it also contains the results of queries against RefSeq, GenBank and PLSDB
${SAMPLE_NAME}/
└── minmers
├── ${SAMPLE_NAME}-genbank-k21.txt
├── ${SAMPLE_NAME}-genbank-k31.txt
├── ${SAMPLE_NAME}-genbank-k51.txt
├── ${SAMPLE_NAME}-k21.msh
├── ${SAMPLE_NAME}-k31.msh
├── ${SAMPLE_NAME}-plsdb-k21.txt
├── ${SAMPLE_NAME}-refseq-k21.txt
└── ${SAMPLE_NAME}.sig
Extension | Description |
---|---|
-genbank-k(21|31|51).txt | Sourmash LCA Gather results against Sourmash GenBank Signature (k=21,31,51) |
-k(21|31).msh | A Mash sketch (k=21,31) of the sample |
-plsdb-k21.txt | Mash Screen results against PLSDB Mash Sketch |
-refseq-k21.txt | Mash Screen results against Mash Refseq Sketch |
.sig | A Sourmash signature (k=21,31,51) of the sample |
mlst
¶
If a Species Specific Dataset has been set up, the mlst
directory will contain Ariba and BLAST results for a PubMLST.org schema. For most organisms there is only one MLST schema available, and it will be labeled as default
. In cases where a organism has multiple schemas available they will be named following pubMLST's naming.
${SAMPLE_NAME}/
└── mlst
└── ${SCHEMA}
├── ariba
│ ├── assembled_genes.fa.gz
│ ├── assembled_seqs.fa.gz
│ ├── assemblies.fa.gz
│ ├── debug.report.tsv
│ ├── log.clusters.gz
│ ├── mlst_report.details.tsv
│ ├── mlst_report.tsv
│ ├── report.tsv
│ └── version_info.txt
└── blast
└── ${SAMPLE_NAME}-blast.json
Filename | Description |
---|---|
assembled_genes.fa.gz | A gzipped FASTA file of only assembled gene sequences (with extensions). |
assembled_seqs.fa.gz | A gzipped FASTA of the assembled sequences (genes and non-coding). |
assemblies.fa.gz | A gzipped FASTA file of the assemblies (complete, unedited, contigs). |
debug.report.tsv | The complete list of clusters, including those that did not pass filtering. |
log.clusters.gz | Detailed logging for the progress of each cluster. |
mlst_report.details.tsv | A more detailed summary of the allele calls. |
mlst_report.tsv | A summary of the allele calls and identified sequence type. |
report.tsv | A detailed report file of clusters which passed filtering. |
summary.csv | A more condensed summary of the report.tsv |
version_info.txt | Information on the versions of ARIBA and its dependencies at runtime. |
-blast.json | Allele calls and identified sequence type based on BLAST |
quality-control
¶
The quality-control
directory contains the cleaned up FASTQs (BBTools and Lighter) and summary statitics (FastQC and Fastq-Scan) before and after cleanup.
${SAMPLE_NAME}/
└── quality-control
├── logs
│ ├── bbduk-adapter.log
│ └── bbduk-phix.log
├── ${SAMPLE_NAME}(|_R1|_R2).fastq.gz
└── summary-(original|final)
├── ${SAMPLE_NAME}(|_R1|_R2)-(original|final)_fastqc.html
├── ${SAMPLE_NAME}(|_R1|_R2)-(original|final)_fastqc.zip
└── ${SAMPLE_NAME}(|_R1|_R2)-(original|final).json
Extension | Description |
---|---|
-adapter.log | A description of how many reads were filtered during the adapter removal step |
-phix.log | A description of how many reads were filtered during the PhiX removal step |
.fastq.gz | The cleaned up FASTQ(s), _R1 and _R2 for paired-end reads, and an empty string for single-end reads. |
_fastqc.html | The FastQC html report of the original and final FASTQ(s) |
_fastqc.zip | The zipped FastQC full report of the original and final FASTQ(s) |
.json | Summary statistics (e.g. qualities and read lengths) of the original and final FASTQ(s) |
variants
¶
The variants
directory contains the results of Snippy variant calls against one or more reference genomes. There are two subdirectories auto
and user
.
The auto
directory includes variants calls against automatically selected reference genome(s) based on nearest Mash distance to RefSeq completed genomes. This process only happens if a Species Specific Dataset was a created. By default, only a single reference genome (nearest) is selected. This feature can be disabled (--disable_auto_variants
) or the number of genomes changed (--max_references INT
).
The user
directory contains variant calls against for each of the User Populated Reference Genomes.
The following description of files was directly taken from Snippy's Output Files section. Slight modifications were made to the order of rows.
${SAMPLE_NAME}/
└── variants
└── (auto|user)
└── ${REFERENCE_NAME}
├── ${SAMPLE_NAME}.aligned.fa
├── ${SAMPLE_NAME}.annotated.vcf
├── ${SAMPLE_NAME}.bam
├── ${SAMPLE_NAME}.bam.bai
├── ${SAMPLE_NAME}.bed
├── ${SAMPLE_NAME}.consensus.fa
├── ${SAMPLE_NAME}.consensus.subs.fa
├── ${SAMPLE_NAME}.consensus.subs.masked.fa
├── ${SAMPLE_NAME}.coverage.txt
├── ${SAMPLE_NAME}.csv
├── ${SAMPLE_NAME}.filt.vcf
├── ${SAMPLE_NAME}.gff
├── ${SAMPLE_NAME}.html
├── ${SAMPLE_NAME}.log
├── ${SAMPLE_NAME}.raw.vcf
├── ${SAMPLE_NAME}.subs.vcf
├── ${SAMPLE_NAME}.tab
├── ${SAMPLE_NAME}.txt
└── ${SAMPLE_NAME}.vcf
Extension | Description |
---|---|
.aligned.fa | A version of the reference but with - at position with depth=0 and N for 0 < depth < --mincov (does not have variants) |
.annotated.vcf | The final variant calls with additional annotations from Reference genome's GenBank file |
.bam | The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates. |
.bam.bai | Index for the .bam file |
.bed | The variants in BED format |
.consensus.fa | A version of the reference genome with all variants instantiated |
.consensus.subs.fa | A version of the reference genome with only substitution variants instantiated |
.consensus.subs.masked.fa | A version of the reference genome with only substitution variants instantiated and low-coverage regions masked |
.coverage.txt | The per-base coverage of each position in the reference genome |
.csv | A comma-separated version of the .tab file |
.filt.vcf | The filtered variant calls from Freebayes |
.gff | The variants in GFF3 format |
.html | A HTML version of the .tab file |
.log | A log file with the commands run and their outputs |
.raw.vcf | The unfiltered variant calls from Freebayes |
.subs.vcf | Only substitution variants from the final annotated variants |
.tab | A simple tab-separated summary of all the variants |
.txt | A summary of the Snippy run. |
.vcf | The final annotated variants in VCF format |