Skip to content

Bactopia Tool - pangenome

The pangenome subworkflow allows you to create a pan-genome with PIRATE, Panaroo, or Roary) of your samples.

You can further supplement your pan-genome by including completed genomes. This is possible using the --species or --accessions parameters. If used, ncbi-genome-download will download available completed genomes available from RefSeq. Any downloaded genomes will be annotated with Prokka to create compatible GFF3 files.

A phylogeny, based on the core-genome alignment, will be created by IQ-Tree. Optionally a recombination-masked core-genome alignment can be created with ClonalFrameML and maskrc-svg.

Finally, the core genome pair-wise SNP distance for each sample is also calculated with snp-dists and additional pan-genome wide association studies can be conducted using Scoary.

Example Usage

bactopia --wf pangenome \
  --bactopia /path/to/your/bactopia/results \ 
  --include includes.txt  

Output Overview

Below is the default output structure for the pangenome tool. Where possible the file descriptions below were modified from a tools description.

<BACTOPIA_DIR>
└── bactopia-runs
    └── pangenome-<TIMESTAMP>
        ├── clonalframeml
        │   ├── core-genome.ML_sequence.fasta
        │   ├── core-genome.em.txt
        │   ├── core-genome.emsim.txt
        │   ├── core-genome.importation_status.txt
        │   ├── core-genome.labelled_tree.newick
        │   ├── core-genome.position_cross_reference.txt
        │   └── logs
        │       ├── nf-clonalframeml.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        ├── core-genome.aln.gz
        ├── core-genome.distance.tsv
        ├── core-genome.iqtree
        ├── core-genome.masked.aln.gz
        ├── iqtree
        │   ├── core-genome.alninfo
        │   ├── core-genome.bionj
        │   ├── core-genome.ckp.gz
        │   ├── core-genome.contree
        │   ├── core-genome.mldist
        │   ├── core-genome.splits.nex
        │   ├── core-genome.treefile
        │   ├── core-genome.ufboot
        │   └── logs
        │       ├── core-genome.log
        │       ├── nf-iqtree.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        ├── iqtree-fast
        │   ├── logs
        │   │   ├── nf-iqtree-fast.{begin,err,log,out,run,sh,trace}
        │   │   ├── start-tree.log
        │   │   └── versions.yml
        │   ├── start-tree.bionj
        │   ├── start-tree.ckp.gz
        │   ├── start-tree.iqtree
        │   ├── start-tree.mldist
        │   ├── start-tree.model.gz
        │   └── start-tree.treefile
        ├── nf-reports
        │   ├── pangenome-dag.dot
        │   ├── pangenome-report.html
        │   ├── pangenome-timeline.html
        │   └── pangenome-trace.txt
        ├── panaroo
        │   ├── aligned_gene_sequences
        │   ├── alignment_entropy.csv
        │   ├── combined_DNA_CDS.fasta
        │   ├── combined_protein_CDS.fasta
        │   ├── combined_protein_cdhit_out.txt
        │   ├── combined_protein_cdhit_out.txt.clstr
        │   ├── core_alignment_filtered_header.embl
        │   ├── core_alignment_header.embl
        │   ├── core_gene_alignment_filtered.aln
        │   ├── final_graph.gml
        │   ├── gene_data.csv
        │   ├── gene_presence_absence.Rtab
        │   ├── gene_presence_absence.csv
        │   ├── gene_presence_absence_roary.csv
        │   ├── logs
        │   │   ├── nf-panaroo.{begin,err,log,out,run,sh,trace}
        │   │   └── versions.yml
        │   ├── pan_genome_reference.fa
        │   ├── pre_filt_graph.gml
        │   ├── struct_presence_absence.Rtab
        │   └── summary_statistics.txt
        ├── pirate
        │   ├── PIRATE.gene_families.ordered.tsv
        │   ├── PIRATE.gene_families.tsv
        │   ├── PIRATE.genomes_per_allele.tsv
        │   ├── PIRATE.pangenome_summary.txt
        │   ├── PIRATE.unique_alleles.tsv
        │   ├── binary_presence_absence.fasta.gz
        │   ├── binary_presence_absence.nwk
        │   ├── cluster_alleles.tab
        │   ├── co-ords
        │   │   └── <SAMPLE_NAME>.co-ords.tab
        │   ├── core_alignment.fasta.gz
        │   ├── core_alignment.gff
        │   ├── feature_sequences
        │   │   └── <GENE_FAMILY>.{aa|nucleotide|.fasta.gz
        │   ├── gene_presence_absence.csv
        │   ├── genome2loci.tab
        │   ├── genome_list.txt
        │   ├── loci_list.tab
        │   ├── loci_paralog_categories.tab
        │   ├── logs
        │   │   ├── nf-pirate.{begin,err,log,out,run,sh,trace}
        │   │   ├── results
        │   │   │   ├── PIRATE.log
        │   │   │   ├── link_clusters.log
        │   │   │   └── split_groups.log
        │   │   └── versions.yml
        │   ├── modified_gffs
        │   ├── pan_sequences.fasta.gz
        │   ├── pangenome.connected_blocks.tsv
        │   ├── pangenome.edges
        │   ├── pangenome.gfa
        │   ├── pangenome.order.tsv
        │   ├── pangenome.reversed.tsv
        │   ├── pangenome.syntenic_blocks.tsv
        │   ├── pangenome.temp
        │   ├── pangenome_alignment.fasta.gz
        │   ├── pangenome_alignment.gff
        │   ├── pangenome_iterations
        │   │   ├── pan_sequences.{50|60|70|80|90|95|98}.reclustered.reinflated
        │   │   ├── pan_sequences.blast.output
        │   │   ├── pan_sequences.cdhit_clusters
        │   │   ├── pan_sequences.core_clusters.tab
        │   │   ├── pan_sequences.mcl_log.txt
        │   │   └── pan_sequences.representative.fasta.gz
        │   ├── pangenome_log.txt
        │   ├── paralog_clusters.tab
        │   ├── representative_sequences.faa
        │   └── representative_sequences.ffn
        ├── roary
        │   ├── accessory.header.embl
        │   ├── accessory.tab
        │   ├── accessory_binary_genes.fa.gz
        │   ├── accessory_binary_genes.fa.newick
        │   ├── accessory_graph.dot
        │   ├── blast_identity_frequency.Rtab
        │   ├── clustered_proteins
        │   ├── core_accessory.header.embl
        │   ├── core_accessory.tab
        │   ├── core_accessory_graph.dot
        │   ├── core_alignment_header.embl
        │   ├── gene_presence_absence.Rtab
        │   ├── gene_presence_absence.csv
        │   ├── logs
        │   │   ├── nf-roary.{begin,err,log,out,run,sh,trace}
        │   │   └── versions.yml
        │   ├── number_of_conserved_genes.Rtab
        │   ├── number_of_genes_in_pan_genome.Rtab
        │   ├── number_of_new_genes.Rtab
        │   ├── number_of_unique_genes.Rtab
        │   ├── pan_genome_reference.fa.gz
        │   └── summary_statistics.txt
        └── snpdists
            └── logs
                ├── nf-snpdists.{begin,err,log,out,run,sh,trace}
                └── versions.yml

Results

Main Results

Below are main results of the pangenome Bactopia Tool.

Filename Description
core-genome.aln.gz A multiple sequence alignment FASTA of the core genome
core-genome.distance.tsv Core genome pair-wise SNP distance for each sample
core-genome.iqtree Full result of the IQ-TREE core genome phylogeny
core-genome.masked.aln.gz A core-genome alignment with the recombination masked

ClonalFrameML

Below is a description of the ClonalFrameML results. For more details about ClonalFrameML outputs see ClonalFrameML - Outputs.

Filename Description
core-genome.ML_sequence.fasta The sequence reconstructed by maximum likelihood for all internal nodes of the phylogeny, as well as for all missing data in the input sequences
core-genome.em.txt The point estimates for R/theta, nu, delta and the branch lengths
core-genome.emsim.txt The bootstrapped values for the three parameters R/theta, nu and delta
core-genome.importation_status.txt The list of reconstructed recombination events
core-genome.labelled_tree.newick The output tree with all nodes labelled so that they can be referred to in other files
core-genome.position_cross_reference.txt A vector of comma-separated values indicating for each location in the input sequence file the corresponding position in the sequences in the output ML_sequence.fasta file

IQ-TREE

Below is a description of the IQ-TREE results. If ClonalFrameML is executed, a fast tree is created and given the prefix start-tree, the final tree has the prefix core-genome. For more details about IQ-TREE outputs see IQ-TREE - Outputs.

Filename Description
core-genome.alninfo Alignment site statistics
{core-genome,start-tree}.bionj A neighbor joining tree produced by BIONJ
{core-genome,start-tree}.ckp.gz IQ-TREE writes a checkpoint file
core-genome.contree Consensus tree with assigned branch supports where branch lengths are optimized on the original alignment; printed if Ultrafast Bootstrap is selected
{core-genome,start-tree}.mldist Contains the likelihood distances
{core-genome,start-tree}.model.gz Information about all models tested
core-genome.splits.nex Support values in percentage for all splits (bipartitions), computed as the occurence frequencies in the bootstrap trees
{core-genome,start-tree}.treefile Maximum likelihood tree in NEWICK format, can be visualized with treeviewer programs
core-genome.ufboot Trees created during the bootstrap steps

PIRATE

Below is a description of the PIRATE results. For more details about PIRATE outputs see PIRATE - Output files.

Available by default

By default PIRATE is used to create the pan-genome. If --use_panaroo or --use_roary are given, pirate outputs will not be available only Panaroo or Roary outputs.

Filename Description
PIRATE.gene_families.ordered.tsv Tabular summary of all gene families ordered on syntenic regions in the pangenome graph
PIRATE.gene_families.tsv Tabular summary of all gene families
PIRATE.genomes_per_allele.tsv A list of genomes associated with each allele
PIRATE.pangenome_summary.txt Short summary of the number and frequency of genes in the pangenome
PIRATE.unique_alleles.tsv Tabular summary of all unique alleles of each gene family
binary_presence_absence.{fasta.gz,nwk} A tree (.nwk) generated by fasttree from binary gene_family presence-absence data and the fasta file used to create it
cluster_alleles.tab List of alleles in paralogous clusters
co-ords/${SAMPLE_NAME}.co-ords.tab Gene feature co-ordinates for each sample
core_alignment.fasta.gz Gene-by-gene nucleotide alignments of the core genome created using MAFFT
core_alignment.gff Annotation containing the position of the gene family within the core genome alignment
feature_sequences/${GENE_FAMILY}.{aa nucleotide}.fasta
gene_presence_absence.csv Lists each gene and which samples it is present in
genome2loci.tab List of loci for each genome
genome_list.txt List of genomes in the analysis
loci_list.tab List of loci and their associated genomes
loci_paralog_categories.tab Concatenation of classified paralogs
modified_gffs/${SAMPLE_NAME}.gff GFF3 files which have been standardised for PIRATE
pan_sequences.fasta.gz All representative sequences in the pangenome
pangenome.connected_blocks.tsv List of connected blocks in the pangenome graph
pangenome.edges List of classified edges in the pangenome graph
pangenome.gfa GFA network file representing all unique connections between gene families
pangenome.order.tsv Sorted list gene_families file on pangenome graph
pangenome.reversed.tsv List of reversed blocks in the pangenome graph
pangenome.syntenic_blocks.tsv List of syntenic blocks in the pangenome graph
pangenome.temp Temporary file used by PIRATE
pangenome_alignment.fasta.gz Gene-by-gene nucleotide alignments of the full pangenome created using MAFFT
pangenome_alignment.gff Annotation containing the position of the gene family within the pangenome alignment
pangenome_iterations/pan_sequences.{50 60
pangenome_iterations/pan_sequences.blast.output BLAST output of sequences against representatives and self hits.
pangenome_iterations/pan_sequences.cdhit_clusters A list of CDHIT representative clusters
pangenome_iterations/pan_sequences.core_clusters.tab A list of core clusters.
pangenome_iterations/pan_sequences.mcl_log.txt A log file from mcxdeblast and mcl
pangenome_iterations/pan_sequences.representative.fasta FASTA file with sequences for each representative cluster
pangenome_log.txt Log file from PIRATE
paralog_clusters.tab List of paralogous clusters
representative_sequences.{faa,ffn} Representative protein and gene sequences for each gene family

Panaroo

Below is a description of the Panaroo results. For more details about Panaroo outputs see Panaroo Documentation.

Only available when --use_panaroo is given

By default PIRATE is used to create the pan-genome, unless --use_panaroo is given.

Filename Description
aligned_gene_sequences A directory of per-gene alignments
combined_DNA_CDS.fasta.gz All nucleotide sequence for the annotated genes
combined_protein_CDS.fasta.gz All protein sequence for the annotated proeteins
combined_protein_cdhit_out.txt Log output from CD-HIT
combined_protein_cdhit_out.txt.clstr Cluster information from CD-HIT
core_alignment_header.embl The core/pan-genome alignment in EMBL format
core_gene_alignment.aln.gz The core/pan-genome alignment in FASTA format
final_graph.gml The final pan-genome graph generated by Panaroo
gene_data.csv CSV linking each gene sequnece and annotation to the internal representations
gene_presence_absence.Rtab A binary tab seperated version of the gene_presence_absence.csv
gene_presence_absence.csv Lists each gene and which samples it is present in
gene_presence_absence_roary.csv Lists each gene and which samples it is present in in the same format as Roary
pan_genome_reference.fa.gz FASTA file which contains a single representative nucleotide sequence from each of the clusters in the pan genome (core and accessory)
pre_filt_graph.gml An intermeadiate pan-genome graph generated by Panaroo
struct_presence_absence.Rtab A csv file which lists the presence and abscence of different genomic rearrangement events
summary_statistics.txt Number of genes in the core and accessory

Roary

Below is a description of the Roary results. For more details about Roary outputs see Roary Documentation.

Only available when --use_roary is given

By default PIRATE is used to create the pan-genome, unless --use_roary is given.

Filename Description
accessory.header.embl EMBL formatted file of accessory genes
accessory.tab Tab-delimited formatted file of accessory genes
accessory_binary_genes.fa A FASTA file with binary presence and absence of accessory genes
accessory_binary_genes.fa.newick A tree created using the binary presence and absence of accessory genes
accessory_graph.dot A graph in DOT format of how genes are linked together at the contig level in the accessory genome
blast_identity_frequency.Rtab Blast results for percentage idenity graph
clustered_proteins Groups file where each line lists the sequences in a cluster
core_accessory.header.embl EMBL formatted file of core genes
core_accessory.tab Tab-delimited formatted file of core genes
core_accessory_graph.dot A graph in DOT format of how genes are linked together at the contig level in the pan genome
core_alignment_header.embl EMBL formatted file of core genome alignment
gene_presence_absence.csv Lists each gene and which samples it is present in
gene_presence_absence.Rtab Tab delimited binary matrix with the presence and absence of each gene in each sample
number_of_conserved_genes.Rtab Graphs on how the pan genome varies as genomes are added (in random orders)
number_of_genes_in_pan_genome.Rtab Graphs on how the pan genome varies as genomes are added (in random orders)
number_of_new_genes.Rtab Graphs on how the pan genome varies as genomes are added (in random orders)
number_of_unique_genes.Rtab Graphs on how the pan genome varies as genomes are added (in random orders)
pan_genome_reference.fa.gz FASTA file which contains a single representative nucleotide sequence from each of the clusters in the pan genome (core and accessory)
summary_statistics.txt Number of genes in the core and accessory

Audit Trail

Below are files that can assist you in understanding which parameters and program versions were used.

Logs

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

Extension Description
.begin An empty file used to designate the process started
.err Contains STDERR outputs from the process
.log Contains both STDERR and STDOUT outputs from the process
.out Contains STDOUT outputs from the process
.run The script Nextflow uses to stage/unstage files and queue processes based on given profile
.sh The script executed by bash for the process
.trace The Nextflow Trace report for the process
versions.yml A YAML formatted file with program versions

Nextflow Reports

These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.

Filename Description
pangenome-dag.dot The Nextflow DAG visualisation
pangenome-report.html The Nextflow Execution Report
pangenome-timeline.html The Nextflow Timeline Report
pangenome-trace.txt The Nextflow Trace report

Program Versions

At the end of each run, each of the versions.yml files are merged into the files below.

Filename Description
software_versions.yml A complete list of programs and versions used by each process
software_versions_mqc.yml A complete list of programs and versions formatted for MultiQC

Parameters

Required Parameters

Define where the pipeline should find input data and save output data.

Parameter Description
--bactopia The path to bactopia results to use as inputs
Type: string

Filtering Parameters

Use these parameters to specify which samples to include or exclude.

Parameter Description
--include A text file containing sample names (one per line) to include from the analysis
Type: string
--exclude A text file containing sample names (one per line) to exclude from the analysis
Type: string

ClonalFrameML Parameters

Parameter Description
--emsim Number of simulations to estimate uncertainty in the EM results
Type: integer, Default: 100
--clonal_opts Extra ClonalFrameML options in quotes
Type: string
--skip_recombination Skip ClonalFrameML execution in subworkflows
Type: boolean

IQ-TREE Parameters

Parameter Description
--iqtree_model Substitution model name
Type: string, Default: HKY
--bb Ultrafast bootstrap replicates
Type: integer, Default: 1000
--alrt SH-like approximate likelihood ratio test replicates
Type: integer, Default: 1000
--asr Ancestral state reconstruction by empirical Bayes
Type: boolean
--iqtree_opts Extra IQ-TREE options in quotes.
Type: string
--skip_phylogeny Skip IQ-TREE execution in subworkflows
Type: boolean

NCBI Genome Download Parameters

Parameter Description
--species Name of the species to download assemblies
Type: string
--accession An NCBI Assembly accession to be downloaded
Type: string
--accessions An file of NCBI Assembly accessions (one per line) to be downloaded
Type: string
--format Comma separated list of formats to download
Type: string, Default: fasta
--section NCBI section to download
Type: string, Default: refseq
--assembly_level Comma separated list of assembly levels to download
Type: string, Default: complete
--kingdom Comma separated list of formats to download
Type: string, Default: bacteria
--limit Limit the number of assemblies to download
Type: string

PIRATE Parameters

Parameter Description
--steps Percent identity thresholds to use for pangenome construction
Type: string, Default: 50,60,70,80,90,95,98
--features Comma-delimited features to use for pangenome construction
Type: string, Default: CDS
--para_off Switch off paralog identification
Type: boolean
--z Retain all PIRATE intermediate files
Type: boolean
--pan_opt Additional arguments to pass to pangenome contruction.
Type: string

Prokka Parameters

Parameter Description
--proteins FASTA file of trusted proteins to first annotate from
Type: string
--prodigal_tf Training file to use for Prodigal
Type: string
--compliant Force Genbank/ENA/DDJB compliance
Type: boolean
--centre Sequencing centre ID
Type: string, Default: Bactopia
--prokka_coverage Minimum coverage on query protein
Type: integer, Default: 80
--prokka_evalue Similarity e-value cut-off
Type: string, Default: 1e-09
--prokka_opts Extra Prokka options in quotes.
Type: string

Panaroo Parameters

Parameter Description
--use_panaroo Use Panaroo instead of PIRATE in the 'pangenome' subworkflow
Type: boolean
--panaroo_mode The stringency mode at which to run panaroo
Type: string, Default: strict
--panaroo_alignment Output alignments of core genes or all genes
Type: string, Default: core
--panaroo_aligner Aligner to use for core/pan genome alignment
Type: string, Default: mafft
--panaroo_core_threshold Core-genome sample threshold
Type: number, Default: 0.95
--panaroo_threshold Sequence identity threshold
Type: number, Default: 0.98
--panaroo_family_threshold Protein family sequence identity threshold
Type: number, Default: 0.7
--len_dif_percent Length difference cutoff
Type: number, Default: 0.98
--merge_paralogs Do not split paralogs
Type: boolean
--panaroo_opts Additional options to pass to panaroo
Type: string

Roary Parameters

Parameter Description
--use_prank Use PRANK instead of MAFFT for core gene
Type: boolean
--use_roary Use Roary instead of PIRATE in the 'pangenome' subworkflow
Type: boolean
--i Minimum percentage identity for blastp
Type: integer, Default: 95
--cd Percentage of isolates a gene must be in to be core
Type: integer, Default: 99
--g Maximum number of clusters
Type: integer, Default: 50000
--s Do not split paralogs
Type: boolean
--ap Allow paralogs in core alignment
Type: boolean
--iv MCL inflation value
Type: number, Default: 1.5

Scoary Parameters

Parameter Description
--traits Input trait table (CSV) to test for associations
Type: string
--p_value_cutoff For statistical tests, genes with higher p-values will not be reported
Type: number, Default: 0.05
--correction Apply the indicated filtration measure.
Type: string, Default: I
--permute Perform N number of permutations of the significant results post-analysis
Type: integer
--start_col On which column in the gene presence/absence file do individual strain info start
Type: integer, Default: 15

SNP-Dists Parameters

Parameter Description
--a Count all differences not just [AGTC]
Type: boolean
--b Keep top left corner cell
Type: boolean
--csv Output CSV instead of TSV
Type: boolean
--k Keep case, don't uppercase all letters
Type: boolean

Optional Parameters

These optional parameters can be useful in certain settings.

Parameter Description
--outdir Base directory to write results to
Type: string, Default: ./
--run_name Name of the directory to hold results
Type: string, Default: bactopia
--skip_compression Ouput files will not be compressed
Type: boolean
--datasets The path to cache datasets to
Type: string
--keep_all_files Keeps all analysis files created
Type: boolean

Max Job Request Parameters

Set the top limit for requested resources for any single job.

Parameter Description
--max_retry Maximum times to retry a process before allowing it to fail.
Type: integer, Default: 3
--max_cpus Maximum number of CPUs that can be requested for any single job.
Type: integer, Default: 4
--max_memory Maximum amount of memory (in GB) that can be requested for any single job.
Type: integer, Default: 32
--max_time Maximum amount of time (in minutes) that can be requested for any single job.
Type: integer, Default: 120
--max_downloads Maximum number of samples to download at a time
Type: integer, Default: 3

Nextflow Configuration Parameters

Parameters to fine-tune your Nextflow setup.

Parameter Description
--nfconfig A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set.
Type: string
--publish_dir_mode Method used to save pipeline results to output directory.
Type: string, Default: copy
--infodir Directory to keep pipeline Nextflow logs and reports.
Type: string, Default: ${params.outdir}/pipeline_info
--force Nextflow will overwrite existing output files.
Type: boolean
--cleanup_workdir After Bactopia is successfully executed, the work directory will be deleted.
Type: boolean

Nextflow Profile Parameters

Parameters to fine-tune your Nextflow setup.

Parameter Description
--condadir Directory to Nextflow should use for Conda environments
Type: string
--registry Docker registry to pull containers from.
Type: string, Default: dockerhub
--datasets_cache Directory where downloaded datasets should be stored.
Type: string, Default: <BACTOPIA_DIR>/data/datasets
--singularity_cache Directory where remote Singularity images are stored.
Type: string
--singularity_pull_docker_container Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead.
Type: boolean
--force_rebuild Force overwrite of existing pre-built environments.
Type: boolean
--queue Comma-separated name of the queue(s) to be used by a job scheduler (e.g. AWS Batch or SLURM)
Type: string, Default: general,high-memory
--cluster_opts Additional options to pass to the executor. (e.g. SLURM: '--account=my_acct_name'
Type: string
--disable_scratch All intermediate files created on worker nodes of will be transferred to the head node.
Type: boolean

Helpful Parameters

Uncommonly used parameters that might be useful.

Parameter Description
--monochrome_logs Do not use coloured log outputs.
Type: boolean
--nfdir Print directory Nextflow has pulled Bactopia to
Type: boolean
--sleep_time The amount of time (seconds) Nextflow will wait after setting up datasets before execution.
Type: integer, Default: 5
--validate_params Boolean whether to validate parameters against the schema at runtime
Type: boolean, Default: True
--help Display help text.
Type: boolean
--wf Specify which workflow or Bactopia Tool to execute
Type: string, Default: bactopia
--list_wfs List the available workflows and Bactopia Tools to use with '--wf'
Type: boolean
--show_hidden_params Show all params when using --help
Type: boolean
--help_all An alias for --help --show_hidden_params
Type: boolean
--version Display version text.
Type: boolean

Citations

If you use Bactopia and pangenome in your analysis, please cite the following.