Bactopia Tool - `pangenome`¶

The pangenome subworkflow allows you to create a pan-genome with PIRATE, Panaroo, or Roary) of your samples.

You can further supplement your pan-genome by including completed genomes. This is possible using the --species or --accessions parameters. If used, ncbi-genome-download will download available completed genomes available from RefSeq. Any downloaded genomes will be annotated with Prokka to create compatible GFF3 files.

A phylogeny, based on the core-genome alignment, will be created by IQ-Tree. Optionally a recombination-masked core-genome alignment can be created with ClonalFrameML and maskrc-svg.

Finally, the core genome pair-wise SNP distance for each sample is also calculated with snp-dists and additional pan-genome wide association studies can be conducted using Scoary.

Example Usage¶

bactopia --wf pangenome \
  --bactopia /path/to/your/bactopia/results

Output Overview¶

Below is the default output structure for the pangenome tool. Where possible the file descriptions below were modified from a tools description.

<BACTOPIA_DIR>
└── bactopia-runs
    └── pangenome-<TIMESTAMP>
        ├── clonalframeml
        │   ├── core-genome.ML_sequence.fasta
        │   ├── core-genome.em.txt
        │   ├── core-genome.emsim.txt
        │   ├── core-genome.importation_status.txt
        │   ├── core-genome.labelled_tree.newick
        │   ├── core-genome.position_cross_reference.txt
        │   └── logs
        │       ├── nf-clonalframeml.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        ├── core-genome.aln.gz
        ├── core-genome.distance.tsv
        ├── core-genome.iqtree
        ├── core-genome.masked.aln.gz
        ├── iqtree
        │   ├── core-genome.alninfo
        │   ├── core-genome.bionj
        │   ├── core-genome.ckp.gz
        │   ├── core-genome.contree
        │   ├── core-genome.mldist
        │   ├── core-genome.splits.nex
        │   ├── core-genome.treefile
        │   ├── core-genome.ufboot
        │   └── logs
        │       ├── core-genome.log
        │       ├── nf-iqtree.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        ├── iqtree-fast
        │   ├── logs
        │   │   ├── nf-iqtree-fast.{begin,err,log,out,run,sh,trace}
        │   │   ├── start-tree.log
        │   │   └── versions.yml
        │   ├── start-tree.bionj
        │   ├── start-tree.ckp.gz
        │   ├── start-tree.iqtree
        │   ├── start-tree.mldist
        │   ├── start-tree.model.gz
        │   └── start-tree.treefile
        ├── nf-reports
        │   ├── pangenome-dag.dot
        │   ├── pangenome-report.html
        │   ├── pangenome-timeline.html
        │   └── pangenome-trace.txt
        ├── panaroo
        │   ├── aligned_gene_sequences
        │   ├── alignment_entropy.csv
        │   ├── combined_DNA_CDS.fasta
        │   ├── combined_protein_CDS.fasta
        │   ├── combined_protein_cdhit_out.txt
        │   ├── combined_protein_cdhit_out.txt.clstr
        │   ├── core_alignment_filtered_header.embl
        │   ├── core_alignment_header.embl
        │   ├── core_gene_alignment_filtered.aln
        │   ├── final_graph.gml
        │   ├── gene_data.csv
        │   ├── gene_presence_absence.Rtab
        │   ├── gene_presence_absence.csv
        │   ├── gene_presence_absence_roary.csv
        │   ├── logs
        │   │   ├── nf-panaroo.{begin,err,log,out,run,sh,trace}
        │   │   └── versions.yml
        │   ├── pan_genome_reference.fa
        │   ├── pre_filt_graph.gml
        │   ├── struct_presence_absence.Rtab
        │   └── summary_statistics.txt
        ├── pirate
        │   ├── PIRATE.gene_families.ordered.tsv
        │   ├── PIRATE.gene_families.tsv
        │   ├── PIRATE.genomes_per_allele.tsv
        │   ├── PIRATE.pangenome_summary.txt
        │   ├── PIRATE.unique_alleles.tsv
        │   ├── binary_presence_absence.fasta.gz
        │   ├── binary_presence_absence.nwk
        │   ├── cluster_alleles.tab
        │   ├── co-ords
        │   │   └── <SAMPLE_NAME>.co-ords.tab
        │   ├── core_alignment.fasta.gz
        │   ├── core_alignment.gff
        │   ├── feature_sequences
        │   │   └── <GENE_FAMILY>.{aa|nucleotide|.fasta.gz
        │   ├── gene_presence_absence.csv
        │   ├── genome2loci.tab
        │   ├── genome_list.txt
        │   ├── loci_list.tab
        │   ├── loci_paralog_categories.tab
        │   ├── logs
        │   │   ├── nf-pirate.{begin,err,log,out,run,sh,trace}
        │   │   ├── results
        │   │   │   ├── PIRATE.log
        │   │   │   ├── link_clusters.log
        │   │   │   └── split_groups.log
        │   │   └── versions.yml
        │   ├── modified_gffs
        │   ├── pan_sequences.fasta.gz
        │   ├── pangenome.connected_blocks.tsv
        │   ├── pangenome.edges
        │   ├── pangenome.gfa
        │   ├── pangenome.order.tsv
        │   ├── pangenome.reversed.tsv
        │   ├── pangenome.syntenic_blocks.tsv
        │   ├── pangenome.temp
        │   ├── pangenome_alignment.fasta.gz
        │   ├── pangenome_alignment.gff
        │   ├── pangenome_iterations
        │   │   ├── pan_sequences.{50|60|70|80|90|95|98}.reclustered.reinflated
        │   │   ├── pan_sequences.blast.output
        │   │   ├── pan_sequences.cdhit_clusters
        │   │   ├── pan_sequences.core_clusters.tab
        │   │   ├── pan_sequences.mcl_log.txt
        │   │   └── pan_sequences.representative.fasta.gz
        │   ├── pangenome_log.txt
        │   ├── paralog_clusters.tab
        │   ├── representative_sequences.faa
        │   └── representative_sequences.ffn
        ├── roary
        │   ├── accessory.header.embl
        │   ├── accessory.tab
        │   ├── accessory_binary_genes.fa.gz
        │   ├── accessory_binary_genes.fa.newick
        │   ├── accessory_graph.dot
        │   ├── blast_identity_frequency.Rtab
        │   ├── clustered_proteins
        │   ├── core_accessory.header.embl
        │   ├── core_accessory.tab
        │   ├── core_accessory_graph.dot
        │   ├── core_alignment_header.embl
        │   ├── gene_presence_absence.Rtab
        │   ├── gene_presence_absence.csv
        │   ├── logs
        │   │   ├── nf-roary.{begin,err,log,out,run,sh,trace}
        │   │   └── versions.yml
        │   ├── number_of_conserved_genes.Rtab
        │   ├── number_of_genes_in_pan_genome.Rtab
        │   ├── number_of_new_genes.Rtab
        │   ├── number_of_unique_genes.Rtab
        │   ├── pan_genome_reference.fa.gz
        │   └── summary_statistics.txt
        └── snpdists
            └── logs
                ├── nf-snpdists.{begin,err,log,out,run,sh,trace}
                └── versions.yml

Results¶

Main Results¶

Below are main results of the pangenome Bactopia Tool.

Filename	Description
core-genome.aln.gz	A multiple sequence alignment FASTA of the core genome
core-genome.distance.tsv	Core genome pair-wise SNP distance for each sample
core-genome.iqtree	Full result of the IQ-TREE core genome phylogeny
core-genome.masked.aln.gz	A core-genome alignment with the recombination masked

ClonalFrameML¶

Below is a description of the ClonalFrameML results. For more details about ClonalFrameML outputs see ClonalFrameML - Outputs.

Filename	Description
core-genome.ML_sequence.fasta	The sequence reconstructed by maximum likelihood for all internal nodes of the phylogeny, as well as for all missing data in the input sequences
core-genome.em.txt	The point estimates for R/theta, nu, delta and the branch lengths
core-genome.emsim.txt	The bootstrapped values for the three parameters R/theta, nu and delta
core-genome.importation_status.txt	The list of reconstructed recombination events
core-genome.labelled_tree.newick	The output tree with all nodes labelled so that they can be referred to in other files
core-genome.position_cross_reference.txt	A vector of comma-separated values indicating for each location in the input sequence file the corresponding position in the sequences in the output ML_sequence.fasta file

IQ-TREE¶

Below is a description of the IQ-TREE results. If ClonalFrameML is executed, a fast tree is created and given the prefix start-tree, the final tree has the prefix core-genome. For more details about IQ-TREE outputs see IQ-TREE - Outputs.

Filename	Description
core-genome.alninfo	Alignment site statistics
{core-genome,start-tree}.bionj	A neighbor joining tree produced by BIONJ
{core-genome,start-tree}.ckp.gz	IQ-TREE writes a checkpoint file
core-genome.contree	Consensus tree with assigned branch supports where branch lengths are optimized on the original alignment; printed if Ultrafast Bootstrap is selected
{core-genome,start-tree}.mldist	Contains the likelihood distances
{core-genome,start-tree}.model.gz	Information about all models tested
core-genome.splits.nex	Support values in percentage for all splits (bipartitions), computed as the occurence frequencies in the bootstrap trees
{core-genome,start-tree}.treefile	Maximum likelihood tree in NEWICK format, can be visualized with treeviewer programs
core-genome.ufboot	Trees created during the bootstrap steps

PIRATE¶

Below is a description of the PIRATE results. For more details about PIRATE outputs see PIRATE - Output files.

Available by default

By default PIRATE is used to create the pan-genome. If --use_panaroo or --use_roary are given, pirate outputs will not be available only Panaroo or Roary outputs.

Filename	Description
PIRATE.gene_families.ordered.tsv	Tabular summary of all gene families ordered on syntenic regions in the pangenome graph
PIRATE.gene_families.tsv	Tabular summary of all gene families
PIRATE.genomes_per_allele.tsv	A list of genomes associated with each allele
PIRATE.pangenome_summary.txt	Short summary of the number and frequency of genes in the pangenome
PIRATE.unique_alleles.tsv	Tabular summary of all unique alleles of each gene family
binary_presence_absence.{fasta.gz,nwk}	A tree (.nwk) generated by fasttree from binary gene_family presence-absence data and the fasta file used to create it
cluster_alleles.tab	List of alleles in paralogous clusters
co-ords/${SAMPLE_NAME}.co-ords.tab	Gene feature co-ordinates for each sample
core_alignment.fasta.gz	Gene-by-gene nucleotide alignments of the core genome created using MAFFT
core_alignment.gff	Annotation containing the position of the gene family within the core genome alignment
feature_sequences/${GENE_FAMILY}.{aa	nucleotide}.fasta
gene_presence_absence.csv	Lists each gene and which samples it is present in
genome2loci.tab	List of loci for each genome
genome_list.txt	List of genomes in the analysis
loci_list.tab	List of loci and their associated genomes
loci_paralog_categories.tab	Concatenation of classified paralogs
modified_gffs/${SAMPLE_NAME}.gff	GFF3 files which have been standardised for PIRATE
pan_sequences.fasta.gz	All representative sequences in the pangenome
pangenome.connected_blocks.tsv	List of connected blocks in the pangenome graph
pangenome.edges	List of classified edges in the pangenome graph
pangenome.gfa	GFA network file representing all unique connections between gene families
pangenome.order.tsv	Sorted list gene_families file on pangenome graph
pangenome.reversed.tsv	List of reversed blocks in the pangenome graph
pangenome.syntenic_blocks.tsv	List of syntenic blocks in the pangenome graph
pangenome.temp	Temporary file used by PIRATE
pangenome_alignment.fasta.gz	Gene-by-gene nucleotide alignments of the full pangenome created using MAFFT
pangenome_alignment.gff	Annotation containing the position of the gene family within the pangenome alignment
pangenome_iterations/pan_sequences.{50	60
pangenome_iterations/pan_sequences.blast.output	BLAST output of sequences against representatives and self hits.
pangenome_iterations/pan_sequences.cdhit_clusters	A list of CDHIT representative clusters
pangenome_iterations/pan_sequences.core_clusters.tab	A list of core clusters.
pangenome_iterations/pan_sequences.mcl_log.txt	A log file from `mcxdeblast` and `mcl`
pangenome_iterations/pan_sequences.representative.fasta	FASTA file with sequences for each representative cluster
pangenome_log.txt	Log file from PIRATE
paralog_clusters.tab	List of paralogous clusters
representative_sequences.{faa,ffn}	Representative protein and gene sequences for each gene family

Panaroo¶

Below is a description of the Panaroo results. For more details about Panaroo outputs see Panaroo Documentation.

Only available when --use_panaroo is given

By default PIRATE is used to create the pan-genome, unless --use_panaroo is given.

Filename	Description
aligned_gene_sequences	A directory of per-gene alignments
combined_DNA_CDS.fasta.gz	All nucleotide sequence for the annotated genes
combined_protein_CDS.fasta.gz	All protein sequence for the annotated proeteins
combined_protein_cdhit_out.txt	Log output from CD-HIT
combined_protein_cdhit_out.txt.clstr	Cluster information from CD-HIT
core_alignment_header.embl	The core/pan-genome alignment in EMBL format
core_gene_alignment.aln.gz	The core/pan-genome alignment in FASTA format
final_graph.gml	The final pan-genome graph generated by Panaroo
gene_data.csv	CSV linking each gene sequnece and annotation to the internal representations
gene_presence_absence.Rtab	A binary tab seperated version of the `gene_presence_absence.csv`
gene_presence_absence.csv	Lists each gene and which samples it is present in
gene_presence_absence_roary.csv	Lists each gene and which samples it is present in in the same format as Roary
pan_genome_reference.fa.gz	FASTA file which contains a single representative nucleotide sequence from each of the clusters in the pan genome (core and accessory)
pre_filt_graph.gml	An intermeadiate pan-genome graph generated by Panaroo
struct_presence_absence.Rtab	A csv file which lists the presence and abscence of different genomic rearrangement events
summary_statistics.txt	Number of genes in the core and accessory

Roary¶

Below is a description of the Roary results. For more details about Roary outputs see Roary Documentation.

Only available when --use_roary is given

By default PIRATE is used to create the pan-genome, unless --use_roary is given.

Filename	Description
accessory.header.embl	EMBL formatted file of accessory genes
accessory.tab	Tab-delimited formatted file of accessory genes
accessory_binary_genes.fa	A FASTA file with binary presence and absence of accessory genes
accessory_binary_genes.fa.newick	A tree created using the binary presence and absence of accessory genes
accessory_graph.dot	A graph in DOT format of how genes are linked together at the contig level in the accessory genome
blast_identity_frequency.Rtab	Blast results for percentage idenity graph
clustered_proteins	Groups file where each line lists the sequences in a cluster
core_accessory.header.embl	EMBL formatted file of core genes
core_accessory.tab	Tab-delimited formatted file of core genes
core_accessory_graph.dot	A graph in DOT format of how genes are linked together at the contig level in the pan genome
core_alignment_header.embl	EMBL formatted file of core genome alignment
gene_presence_absence.csv	Lists each gene and which samples it is present in
gene_presence_absence.Rtab	Tab delimited binary matrix with the presence and absence of each gene in each sample
number_of_conserved_genes.Rtab	Graphs on how the pan genome varies as genomes are added (in random orders)
number_of_genes_in_pan_genome.Rtab	Graphs on how the pan genome varies as genomes are added (in random orders)
number_of_new_genes.Rtab	Graphs on how the pan genome varies as genomes are added (in random orders)
number_of_unique_genes.Rtab	Graphs on how the pan genome varies as genomes are added (in random orders)
pan_genome_reference.fa.gz	FASTA file which contains a single representative nucleotide sequence from each of the clusters in the pan genome (core and accessory)
summary_statistics.txt	Number of genes in the core and accessory

Audit Trail¶

Below are files that can assist you in understanding which parameters and program versions were used.

Logs¶

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

Extension	Description
.begin	An empty file used to designate the process started
.err	Contains STDERR outputs from the process
.log	Contains both STDERR and STDOUT outputs from the process
.out	Contains STDOUT outputs from the process
.run	The script Nextflow uses to stage/unstage files and queue processes based on given profile
.sh	The script executed by bash for the process
.trace	The Nextflow Trace report for the process
versions.yml	A YAML formatted file with program versions

Nextflow Reports¶

These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.

Filename	Description
pangenome-dag.dot	The Nextflow DAG visualisation
pangenome-report.html	The Nextflow Execution Report
pangenome-timeline.html	The Nextflow Timeline Report
pangenome-trace.txt	The Nextflow Trace report

Program Versions¶

At the end of each run, each of the versions.yml files are merged into the files below.

Filename	Description
software_versions.yml	A complete list of programs and versions used by each process
software_versions_mqc.yml	A complete list of programs and versions formatted for MultiQC

Parameters¶

Required Parameters¶

Define where the pipeline should find input data and save output data.

Parameter	Description
`--bactopia`	The path to bactopia results to use as inputs Type: `string`

Filtering Parameters¶

Use these parameters to specify which samples to include or exclude.

Parameter	Description
`--include`	A text file containing sample names (one per line) to include from the analysis Type: `string`
`--exclude`	A text file containing sample names (one per line) to exclude from the analysis Type: `string`

ClonalFrameML Parameters¶

Parameter	Description
`--emsim`	Number of simulations to estimate uncertainty in the EM results Type: `integer`, Default: `100`
`--clonal_opts`	Extra ClonalFrameML options in quotes Type: `string`
`--skip_recombination`	Skip ClonalFrameML execution in subworkflows Type: `boolean`

IQ-TREE Parameters¶

Parameter	Description
`--iqtree_model`	Substitution model name Type: `string`, Default: `HKY`
`--bb`	Ultrafast bootstrap replicates Type: `integer`, Default: `1000`
`--alrt`	SH-like approximate likelihood ratio test replicates Type: `integer`, Default: `1000`
`--asr`	Ancestral state reconstruction by empirical Bayes Type: `boolean`
`--iqtree_opts`	Extra IQ-TREE options in quotes. Type: `string`
`--skip_phylogeny`	Skip IQ-TREE execution in subworkflows Type: `boolean`

NCBI Genome Download Parameters¶

Parameter	Description
`--species`	Name of the species to download assemblies Type: `string`
`--accession`	An NCBI Assembly accession to be downloaded Type: `string`
`--accessions`	An file of NCBI Assembly accessions (one per line) to be downloaded Type: `string`
`--format`	Comma separated list of formats to download Type: `string`, Default: `fasta`
`--section`	NCBI section to download Type: `string`, Default: `refseq`
`--assembly_level`	Comma separated list of assembly levels to download Type: `string`, Default: `complete`
`--kingdom`	Comma separated list of formats to download Type: `string`, Default: `bacteria`
`--limit`	Limit the number of assemblies to download Type: `string`

PIRATE Parameters¶

Parameter	Description
`--use_pirate`	Use PIRATE instead of panaroo in the 'pangenome' subworkflow Type: `boolean`
`--steps`	Percent identity thresholds to use for pangenome construction Type: `string`, Default: `50,60,70,80,90,95,98`
`--features`	Comma-delimited features to use for pangenome construction Type: `string`, Default: `CDS`
`--para_off`	Switch off paralog identification Type: `boolean`
`--z`	Retain all PIRATE intermediate files Type: `boolean`
`--pan_opt`	Additional arguments to pass to pangenome contruction. Type: `string`

Prokka Parameters¶

Parameter	Description
`--proteins`	FASTA file of trusted proteins to first annotate from Type: `string`
`--prodigal_tf`	Training file to use for Prodigal Type: `string`
`--compliant`	Force Genbank/ENA/DDJB compliance Type: `boolean`
`--centre`	Sequencing centre ID Type: `string`, Default: `Bactopia`
`--prokka_coverage`	Minimum coverage on query protein Type: `integer`, Default: `80`
`--prokka_evalue`	Similarity e-value cut-off Type: `string`, Default: `1e-09`
`--prokka_opts`	Extra Prokka options in quotes. Type: `string`

Panaroo Parameters¶

Parameter	Description
`--panaroo_mode`	The stringency mode at which to run panaroo Type: `string`, Default: `strict`
`--panaroo_alignment`	Output alignments of core genes or all genes Type: `string`, Default: `core`
`--panaroo_aligner`	Aligner to use for core/pan genome alignment Type: `string`, Default: `mafft`
`--panaroo_core_threshold`	Core-genome sample threshold Type: `number`, Default: `0.95`
`--panaroo_threshold`	Sequence identity threshold Type: `number`, Default: `0.98`
`--panaroo_family_threshold`	Protein family sequence identity threshold Type: `number`, Default: `0.7`
`--len_dif_percent`	Length difference cutoff Type: `number`, Default: `0.98`
`--merge_paralogs`	Do not split paralogs Type: `boolean`
`--panaroo_opts`	Additional options to pass to panaroo Type: `string`

Roary Parameters¶

Parameter	Description
`--use_prank`	Use PRANK instead of MAFFT for core gene Type: `boolean`
`--use_roary`	Use Roary instead of PIRATE in the 'pangenome' subworkflow Type: `boolean`
`--i`	Minimum percentage identity for blastp Type: `integer`, Default: `95`
`--cd`	Percentage of isolates a gene must be in to be core Type: `integer`, Default: `99`
`--g`	Maximum number of clusters Type: `integer`, Default: `50000`
`--s`	Do not split paralogs Type: `boolean`
`--ap`	Allow paralogs in core alignment Type: `boolean`
`--iv`	MCL inflation value Type: `number`, Default: `1.5`

Scoary Parameters¶

Parameter	Description
`--traits`	Input trait table (CSV) to test for associations Type: `string`
`--p_value_cutoff`	For statistical tests, genes with higher p-values will not be reported Type: `number`, Default: `0.05`
`--correction`	Apply the indicated filtration measure. Type: `string`, Default: `I`
`--permute`	Perform N number of permutations of the significant results post-analysis Type: `integer`
`--start_col`	On which column in the gene presence/absence file do individual strain info start Type: `integer`, Default: `15`

SNP-Dists Parameters¶

Parameter	Description
`--a`	Count all differences not just [AGTC] Type: `boolean`
`--b`	Keep top left corner cell Type: `boolean`
`--csv`	Output CSV instead of TSV Type: `boolean`
`--k`	Keep case, don't uppercase all letters Type: `boolean`

Optional Parameters¶

These optional parameters can be useful in certain settings.

Parameter	Description
`--outdir`	Base directory to write results to Type: `string`, Default: `bactopia`
`--skip_compression`	Ouput files will not be compressed Type: `boolean`
`--datasets`	The path to cache datasets to Type: `string`
`--keep_all_files`	Keeps all analysis files created Type: `boolean`

Max Job Request Parameters¶

Set the top limit for requested resources for any single job.

Parameter	Description
`--max_retry`	Maximum times to retry a process before allowing it to fail. Type: `integer`, Default: `3`
`--max_cpus`	Maximum number of CPUs that can be requested for any single job. Type: `integer`, Default: `4`
`--max_memory`	Maximum amount of memory that can be requested for any single job. Type: `string`, Default: `128.GB`
`--max_time`	Maximum amount of time that can be requested for any single job. Type: `string`, Default: `240.h`
`--max_downloads`	Maximum number of samples to download at a time Type: `integer`, Default: `3`

Nextflow Configuration Parameters¶

Parameters to fine-tune your Nextflow setup.

Parameter	Description
`--nfconfig`	A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set. Type: `string`
`--publish_dir_mode`	Method used to save pipeline results to output directory. Type: `string`, Default: `copy`
`--infodir`	Directory to keep pipeline Nextflow logs and reports. Type: `string`, Default: `${params.outdir}/pipeline_info`
`--force`	Nextflow will overwrite existing output files. Type: `boolean`
`--cleanup_workdir`	After Bactopia is successfully executed, the `work` directory will be deleted. Type: `boolean`

Institutional config options¶

Parameters used to describe centralized config profiles. These should not be edited.

Parameter	Description
`--custom_config_version`	Git commit id for Institutional configs. Type: `string`, Default: `master`
`--custom_config_base`	Base directory for Institutional configs. Type: `string`, Default: `https://raw.githubusercontent.com/nf-core/configs/master`
`--config_profile_name`	Institutional config name. Type: `string`
`--config_profile_description`	Institutional config description. Type: `string`
`--config_profile_contact`	Institutional config contact information. Type: `string`
`--config_profile_url`	Institutional config URL link. Type: `string`

Nextflow Profile Parameters¶