Bactopia Tool - `pangenome`¶

The pangenome subworkflow allows you to create a pan-genome with PIRATE, Panaroo, or Roary) of your samples.

You can further supplement your pan-genome by including completed genomes. This is possible using the --species or --accessions parameters. If used, ncbi-genome-download will download available completed genomes available from RefSeq. Any downloaded genomes will be annotated with Prokka to create compatible GFF3 files.

A phylogeny, based on the core-genome alignment, will be created by IQ-Tree. Optionally a recombination-masked core-genome alignment can be created with ClonalFrameML and maskrc-svg.

Finally, the core genome pair-wise SNP distance for each sample is also calculated with snp-dists and additional pan-genome wide association studies can be conducted using Scoary.

Example Usage¶

bactopia --wf pangenome \
  --bactopia /path/to/your/bactopia/results \ 
  --include includes.txt

Output Overview¶

Below is the default output structure for the pangenome tool. Where possible the file descriptions below were modified from a tools description.

pangenome/
├── clonalframeml
│   ├── core-genome.ML_sequence.fasta
│   ├── core-genome.em.txt
│   ├── core-genome.emsim.txt
│   ├── core-genome.importation_status.txt
│   ├── core-genome.labelled_tree.newick
│   └── core-genome.position_cross_reference.txt
├── {iqtree,iqtree-fast}
│   ├── core-genome.alninfo
│   ├── {core-genome,start-tree}.bionj
│   ├── {core-genome,start-tree}.ckp.gz
│   ├── core-genome.contree
│   ├── {core-genome,start-tree}.mldist
│   ├── {core-genome,start-tree}.model.gz
│   ├── core-genome.splits.nex
│   ├── {core-genome,start-tree}.treefile
│   └── core-genome.ufboot
├── logs
│   ├── clonalframeml
│   │   ├── nf-clonalframeml.{begin,err,log,out,run,sh,trace}
│   │   └── versions.yml
│   ├── custom_dumpsoftwareversions
│   │   ├── nf-custom_dumpsoftwareversions.{begin,err,log,out,run,sh,trace}
│   │   └── versions.yml
│   ├── {iqtree,iqtree-fast}
│   │   ├── core-genome.log
│   │   ├── nf-iqtree.{begin,err,log,out,run,sh,trace}
│   │   └── versions.yml
│   ├── pirate
│   │   ├── nf-pirate.{begin,err,log,out,run,sh,trace}
│   │   └── versions.yml
│   ├── roary
│   │   ├── nf-roary.{begin,err,log,out,run,sh,trace}
│   │   └── versions.yml
│   └── snpdists
│       ├── nf-snpdists.{begin,err,log,out,run,sh,trace}
│       └── versions.yml
├── nf-reports
│   ├── pangenome-dag.dot
│   ├── pangenome-report.html
│   ├── pangenome-timeline.html
│   └── pangenome-trace.txt
├── panaroo
│   ├── aligned_gene_sequences
│   ├── combined_DNA_CDS.fasta.gz
│   ├── combined_protein_CDS.fasta.gz
│   ├── combined_protein_cdhit_out.txt
│   ├── combined_protein_cdhit_out.txt.clstr
│   ├── core_alignment_header.embl
│   ├── core_gene_alignment.aln
│   ├── final_graph.gml
│   ├── gene_data.csv
│   ├── gene_presence_absence.Rtab
│   ├── gene_presence_absence.csv
│   ├── gene_presence_absence_roary.csv
│   ├── pan_genome_reference.fa
│   ├── pre_filt_graph.gml
│   ├── struct_presence_absence.Rtab
│   └── summary_statistics.txt
├── pirate
│   ├── PIRATE.gene_families.ordered.tsv
│   ├── PIRATE.gene_families.tsv
│   ├── PIRATE.genomes_per_allele.tsv
│   ├── PIRATE.pangenome_summary.txt
│   ├── PIRATE.unique_alleles.tsv
│   ├── binary_presence_absence.fasta.gz
│   ├── binary_presence_absence.nwk
│   ├── cluster_alleles.tab
│   ├── co-ords
│   │   └── <SAMPLE_NAME>.co-ords.tab
│   ├── core_alignment.fasta.gz
│   ├── core_alignment.gff
│   ├── feature_sequences
│   │   └── <GENE_FAMILY>.{aa|nucleotide|.fasta
│   ├── gene_presence_absence.csv
│   ├── genome2loci.tab
│   ├── genome_list.txt
│   ├── loci_list.tab
│   ├── loci_paralog_categories.tab
│   ├── modified_gffs
│   │   └── <SAMPLE_NAME>.gff
│   ├── pan_sequences.fasta.gz
│   ├── pangenome.connected_blocks.tsv
│   ├── pangenome.edges
│   ├── pangenome.gfa
│   ├── pangenome.order.tsv
│   ├── pangenome.reversed.tsv
│   ├── pangenome.syntenic_blocks.tsv
│   ├── pangenome.temp
│   ├── pangenome_alignment.fasta.gz
│   ├── pangenome_alignment.gff
│   ├── pangenome_iterations
│   │   ├── pan_sequences.{50|60|70|80|90|95|98}.reclustered.reinflated
│   │   ├── pan_sequences.blast.output
│   │   ├── pan_sequences.cdhit_clusters
│   │   ├── pan_sequences.core_clusters.tab
│   │   ├── pan_sequences.mcl_log.txt
│   │   └── pan_sequences.representative.fasta
│   ├── paralog_clusters.tab
│   ├── representative_sequences.faa
│   └── representative_sequences.ffn
├── roary
│   ├── accessory.header.embl
│   ├── accessory.tab
│   ├── accessory_binary_genes.fa.gz
│   ├── accessory_binary_genes.fa.newick
│   ├── accessory_graph.dot
│   ├── blast_identity_frequency.Rtab
│   ├── clustered_proteins
│   ├── core_accessory.header.embl
│   ├── core_accessory.tab
│   ├── core_accessory_graph.dot
│   ├── core_alignment_header.embl
│   ├── gene_presence_absence.Rtab
│   ├── gene_presence_absence.csv
│   ├── number_of_conserved_genes.Rtab
│   ├── number_of_genes_in_pan_genome.Rtab
│   ├── number_of_new_genes.Rtab
│   ├── number_of_unique_genes.Rtab
│   ├── pan_genome_reference.fa.gz
│   └── summary_statistics.txt
├── core-genome.aln.gz
├── core-genome.distance.tsv
├── core-genome.iqtree
├── core-genome.masked.aln.gz
├── software_versions.yml
└── software_versions_mqc.yml

Results¶

Top Level¶

Below are results that are in the base directory.

Filename	Description
core-genome.aln.gz	A multiple sequence alignment FASTA of the core genome
core-genome.distance.tsv	Core genome Pair-wise SNP distance for each sample
core-genome.iqtree	Full result of the IQ-TREE core genome phylogeny
core-genome.masked.aln.gz	A core-genome alignment with the recomination masked

ClonalFrameML¶

Below is a description of the ClonalFrameML results. For more details about ClonalFrameML outputs see ClonalFrameML - Outputs.

Filename	Description
core-genome.ML_sequence.fasta	The sequence reconstructed by maximum likelihood for all internal nodes of the phylogeny, as well as for all missing data in the input sequences
core-genome.em.txt	The point estimates for R/theta, nu, delta and the branch lengths
core-genome.emsim.txt	The bootstrapped values for the three parameters R/theta, nu and delta
core-genome.importation_status.txt	The list of reconstructed recombination events
core-genome.labelled_tree.newick	The output tree with all nodes labelled so that they can be referred to in other files
core-genome.position_cross_reference.txt	A vector of comma-separated values indicating for each location in the input sequence file the corresponding position in the sequences in the output ML_sequence.fasta file

IQ-TREE¶

Below is a description of the IQ-TREE results. If ClonalFrameML is executed, a fast tree is created and given the prefix start-tree, the final tree has the prefix core-genome. For more details about IQ-TREE outputs see IQ-TREE - Outputs.

Filename	Description
core-genome.alninfo	Alignment site statistics
{core-genome,start-tree}.bionj	A neighbor joining tree produced by BIONJ
{core-genome,start-tree}.ckp.gz	IQ-TREE writes a checkpoint file
core-genome.contree	Consensus tree with assigned branch supports where branch lengths are optimized on the original alignment; printed if Ultrafast Bootstrap is selected
{core-genome,start-tree}.mldist	Contains the likelihood distances
{core-genome,start-tree}.model.gz	Information about all models tested
core-genome.splits.nex	Support values in percentage for all splits (bipartitions), computed as the occurence frequencies in the bootstrap trees
{core-genome,start-tree}.treefile	Maximum likelihood tree in NEWICK format, can be visualized with treeviewer programs
core-genome.ufboot	Trees created during the bootstrap steps

PIRATE¶

Below is a description of the PIRATE results. For more details about PIRATE outputs see PIRATE - Output files.

Available by default

By default PIRATE is used to create the pan-genome. If --use_panaroo or --use_roary are given, pirate outputs will not be available only Panaroo or Roary outputs.

Filename	Description
PIRATE.gene_families.ordered.tsv	Tabular summary of all gene families ordered on syntenic regions in the pangenome graph
PIRATE.gene_families.tsv	Tabular summary of all gene families
PIRATE.genomes_per_allele.tsv	A list of genomes associated with each allele
PIRATE.pangenome_summary.txt	Short summary of the number and frequency of genes in the pangenome
PIRATE.unique_alleles.tsv	Tabular summary of all unique alleles of each gene family
binary_presence_absence.{fasta.gz,nwk}	A tree (.nwk) generated by fasttree from binary gene_family presence-absence data and the fasta file used to create it
cluster_alleles.tab	List of alleles in paralogous clusters
co-ords/${SAMPLE_NAME}.co-ords.tab	Gene feature co-ordinates for each sample
core_alignment.fasta.gz	Gene-by-gene nucleotide alignments of the core genome created using MAFFT
core_alignment.gff	Annotation containing the position of the gene family within the core genome alignment
feature_sequences/${GENE_FAMILY}.{aa\|nucleotide}.fasta	Amino acid and nucleotide sequences for each gene family
gene_presence_absence.csv	Lists each gene and which samples it is present in
genome2loci.tab	List of loci for each genome
genome_list.txt	List of genomes in the analysis
loci_list.tab	List of loci and their associated genomes
loci_paralog_categories.tab	Concatenation of classified paralogs
modified_gffs/${SAMPLE_NAME}.gff	GFF3 files which have been standardised for PIRATE
pan_sequences.fasta.gz	All representative sequences in the pangenome
pangenome.connected_blocks.tsv	List of connected blocks in the pangenome graph
pangenome.edges	List of classified edges in the pangenome graph
pangenome.gfa	GFA network file representing all unique connections between gene families
pangenome.order.tsv	Sorted list gene_families file on pangenome graph
pangenome.reversed.tsv	List of reversed blocks in the pangenome graph
pangenome.syntenic_blocks.tsv	List of syntenic blocks in the pangenome graph
pangenome_alignment.fasta.gz	Gene-by-gene nucleotide alignments of the full pangenome created using MAFFT
pangenome_alignment.gff	Annotation containing the position of the gene family within the pangenome alignment
pangenome_iterations/pan_sequences.{50\|60\|70\|80\|90\|95\|98}.reclustered.reinflated	List of clusters for each reinflation threshold
pangenome_iterations/pan_sequences.blast.output	BLAST output of sequences against representatives and self hits.
pangenome_iterations/pan_sequences.cdhit_clusters	A list of CDHIT representative clusters
pangenome_iterations/pan_sequences.core_clusters.tab	A list of core clusters.
pangenome_iterations/pan_sequences.mcl_log.txt	A log file from `mcxdeblast` and `mcl`
pangenome_iterations/pan_sequences.representative.fasta	FASTA file with sequences for each representative cluster
paralog_clusters.tab	List of paralogous clusters
representative_sequences.faa	Representative protein sequences for each gene family
representative_sequences.ffn	Representative gene sequences for each gene family

Panaroo¶

Below is a description of the Panaroo results. For more details about Panaroo outputs see Panaroo Documentation.

Only available when --use_panaroo is given

By default PIRATE is used to create the pan-genome, unless --use_panaroo is given.

Filename	Description
aligned_gene_sequences	A directory of per-gene alignments
combined_DNA_CDS.fasta.gz	All nucleotide sequence for the annotated genes
combined_protein_CDS.fasta.gz	All protein sequence for the annotated proeteins
combined_protein_cdhit_out.txt	Log output from CD-HIT
combined_protein_cdhit_out.txt.clstr	Cluster information from CD-HIT
core_alignment_header.embl	The core/pan-genome alignment in EMBL format
core_gene_alignment.aln.gz	The core/pan-genome alignment in FASTA format
final_graph.gml	The final pan-genome graph generated by Panaroo
gene_data.csv	CSV linking each gene sequnece and annotation to the internal representations
gene_presence_absence.Rtab	A binary tab seperated version of the `gene_presence_absence.csv``
gene_presence_absence.csv	Lists each gene and which samples it is present in
gene_presence_absence_roary.csv	Lists each gene and which samples it is present in in the same format as Roary
pan_genome_reference.fa.gz	FASTA file which contains a single representative nucleotide sequence from each of the clusters in the pan genome (core and accessory)
pre_filt_graph.gml	An intermeadiate pan-genome graph generated by Panaroo
struct_presence_absence.Rtab	A csv file which lists the presence and abscence of different genomic rearrangement events
summary_statistics.txt	Number of genes in the core and accessory

Roary¶

Below is a description of the Roary results. For more details about Roary outputs see Roary Documentation.

Only available when --use_roary is given

By default PIRATE is used to create the pan-genome, unless --use_roary is given.

Filename	Description
accessory.header.embl	Tab/EMBL formatted file of accessory genes
accessory.tab	Tab/EMBL formatted file of accessory genes
accessory_binary_genes.fa	A FASTA file with binary presence and absence of accessory genes
accessory_binary_genes.fa.newick	A tree created using the binary presence and absence of accessory genes
accessory_graph.dot	A graph in DOT format of how genes are linked together at the contig level in the accessory genome
blast_identity_frequency.Rtab	Blast results for percentage idenity graph
clustered_proteins	Groups file where each line lists the sequences in a cluster
core_accessory.header.embl	Tab/EMBL formatted file of core genes
core_accessory.tab	Tab/EMBL formatted file of core genes
core_accessory_graph.dot	A graph in DOT format of how genes are linked together at the contig level in the pan genome
core_alignment_header.embl	Tab/EMBL formatted file of core genome alignment
gene_presence_absence.csv	Lists each gene and which samples it is present in
gene_presence_absence.Rtab	Tab delimited binary matrix with the presence and absence of each gene in each sample
number_of_conserved_genes.Rtab	Graphs on how the pan genome varies as genomes are added (in random orders)
number_of_genes_in_pan_genome.Rtab	Graphs on how the pan genome varies as genomes are added (in random orders)
number_of_new_genes.Rtab	Graphs on how the pan genome varies as genomes are added (in random orders)
number_of_unique_genes.Rtab	Graphs on how the pan genome varies as genomes are added (in random orders)
pan_genome_reference.fa.gz	FASTA file which contains a single representative nucleotide sequence from each of the clusters in the pan genome (core and accessory)
summary_statistics.txt	Number of genes in the core and accessory

Audit Trail¶

Below are files that can assist you in understanding which parameters and program versions were used.

Logs¶

Each process that is executed will have a logs folder containing helpful files for you to review if the need ever arises.

Filename	Description
nf-<PROCESS_NAME>.begin	An empty file used to designate the process started
nf-<PROCESS_NAME>.err	Contains STDERR outputs from the process
nf-<PROCESS_NAME>.log	Contains both STDERR and STDOUT outputs from the process
nf-<PROCESS_NAME>.out	Contains STDOUT outputs from the process
nf-<PROCESS_NAME>.run	The script Nextflow uses to stage/unstage files and queue processes based on given profile
nf-<PROCESS_NAME>.sh	The script executed by bash for the process
nf-<PROCESS_NAME>.trace	The Nextflow Trace report for the process
versions.yml	A YAML formatted file with program versions

Nextflow Reports¶

These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.

Filename	Description
pangenome-dag.dot	The Nextflow DAG visualisation
pangenome-report.html	The Nextflow Execution Report
pangenome-timeline.html	The Nextflow Timeline Report
pangenome-trace.txt	The Nextflow Trace report

Program Versions¶

At the end of each run, each of the versions.yml files are merged into the files below.

Filename	Description
software_versions.yml	A complete list of programs and versions used by each process
software_versions_mqc.yml	A complete list of programs and versions formatted for MultiQC

Parameters¶

Required Parameters¶

Define where the pipeline should find input data and save output data.

Parameter	Description	Default
`--bactopia`	The path to bactopia results to use as inputs

Filtering Parameters¶

Use these parameters to specify which samples to include or exclude.

Parameter	Description	Default
`--include`	A text file containing sample names (one per line) to include from the analysis
`--exclude`	A text file containing sample names (one per line) to exclude from the analysis

ClonalFrameML Parameters¶

Parameter	Description	Default
`--emsim`	Number of simulations to estimate uncertainty in the EM results	100
`--clonal_opts`	Extra ClonalFrameML options in quotes
`--skip_recombination`	Skip ClonalFrameML execution in subworkflows	False

IQ-TREE Parameters¶

Parameter	Description	Default
`--iqtree_model`	Substitution model name	HKY
`--bb`	Ultrafast bootstrap replicates	1000
`--alrt`	SH-like approximate likelihood ratio test replicates	1000
`--asr`	Ancestral state reconstruction by empirical Bayes	False
`--iqtree_opts`	Extra IQ-TREE options in quotes.
`--skip_phylogeny`	Skip IQ-TREE execution in subworkflows	False

NCBI Genome Download Parameters¶

Parameter	Description	Default
`--species`	Name of the species to download assemblies
`--accession`	An NCBI Assembly accession to be downloaded
`--accessions`	An file of NCBI Assembly accessions (one per line) to be downloaded
`--format`	Comma separated list of formats to download	fasta
`--section`	NCBI section to download	refseq
`--assembly_level`	Comma separated list of assembly levels to download	complete
`--kingdom`	Comma separated list of formats to download	bacteria
`--limit`	Limit the number of assemblies to download

PIRATE Parameters¶

Parameter	Description	Default
`--steps`	Percent identity thresholds to use for pangenome construction	50,60,70,80,90,95,98
`--features`	Comma-delimited features to use for pangenome construction	CDS
`--para_off`	Switch off paralog identification	False
`--z`	Retain all PIRATE intermediate files	False
`--pan_opt`	Additional arguments to pass to pangenome contruction.

Prokka Parameters¶

Parameter	Description	Default
`--proteins`	FASTA file of trusted proteins to first annotate from
`--prodigal_tf`	Training file to use for Prodigal
`--compliant`	Force Genbank/ENA/DDJB compliance	False
`--centre`	Sequencing centre ID	Bactopia
`--prokka_coverage`	Minimum coverage on query protein	80
`--prokka_evalue`	Similarity e-value cut-off	1e-09
`--prokka_opts`	Extra Prokka options in quotes.

Panaroo Parameters¶

Parameter	Description	Default
`--use_panaroo`	Use Panaroo instead of PIRATE in the 'pangenome' subworkflow	False
`--panaroo_mode`	The stringency mode at which to run panaroo	strict
`--panaroo_alignment`	Output alignments of core genes or all genes	core
`--panaroo_aligner`	Aligner to use for core/pan genome alignment	mafft
`--panaroo_core_threshold`	Core-genome sample threshold	0.95
`--panaroo_threshold`	Sequence identity threshold	0.98
`--panaroo_family_threshold`	Protein family sequence identity threshold	0.7
`--len_dif_percent`	Length difference cutoff	0.98
`--merge_paralogs`	Do not split paralogs	False
`--panaroo_opts`	Additional options to pass to panaroo

Roary Parameters¶

Parameter	Description	Default
`--use_prank`	Use PRANK instead of MAFFT for core gene	False
`--use_roary`	Use Roary instead of PIRATE in the 'pangenome' subworkflow	False
`--i`	Minimum percentage identity for blastp	95
`--cd`	Percentage of isolates a gene must be in to be core	99
`--g`	Maximum number of clusters	50000
`--s`	Do not split paralogs	False
`--ap`	Allow paralogs in core alignment	False
`--iv`	MCL inflation value	1.5

Scoary Parameters¶

Parameter	Description	Default
`--traits`	Input trait table (CSV) to test for associations
`--p_value_cutoff`	For statistical tests, genes with higher p-values will not be reported	0.05
`--correction`	Apply the indicated filtration measure.	I
`--permute`	Perform N number of permutations of the significant results post-analysis	0
`--start_col`	On which column in the gene presence/absence file do individual strain info start	15

SNP-Dists Parameters¶

Parameter	Description	Default
`--a`	Count all differences not just [AGTC]	False
`--b`	Keep top left corner cell	False
`--csv`	Output CSV instead of TSV	False
`--k`	Keep case, don't uppercase all letters	False

Optional Parameters¶

These optional parameters can be useful in certain settings.

Parameter	Description	Default
`--outdir`	Base directory to write results to	./
`--run_name`	Name of the directory to hold results	bactopia
`--skip_compression`	Ouput files will not be compressed	False
`--keep_all_files`	Keeps all analysis files created	False

Max Job Request Parameters¶

Set the top limit for requested resources for any single job.

Parameter	Description	Default
`--max_retry`	Maximum times to retry a process before allowing it to fail.	3
`--max_cpus`	Maximum number of CPUs that can be requested for any single job.	4
`--max_memory`	Maximum amount of memory (in GB) that can be requested for any single job.	32
`--max_time`	Maximum amount of time (in minutes) that can be requested for any single job.	120
`--max_downloads`	Maximum number of samples to download at a time	3

Nextflow Configuration Parameters¶

Parameters to fine-tune your Nextflow setup.

Parameter	Description	Default
`--nfconfig`	A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set.
`--publish_dir_mode`	Method used to save pipeline results to output directory.	copy
`--infodir`	Directory to keep pipeline Nextflow logs and reports.	${params.outdir}/pipeline_info
`--force`	Nextflow will overwrite existing output files.	False
`--cleanup_workdir`	After Bactopia is successfully executed, the `work` directory will be deleted.	False

Nextflow Profile Parameters¶