Bactopia Tool - `snippy`¶

The snippy subworkflow allows you to call SNPs and InDels against a reference with Snippy. With the called SNPs/InDels, snippy-core a core-SNP alignment is created.

A phylogeny, based on the core-SNP alignment, will be created by IQ-Tree. Optionally a recombination-masked core-SNP alignment can be created with Gubbins.

Finally, the pair-wise SNP distance for each sample is also calculated with snp-dists.

Example Usage¶

bactopia --wf snippy \
  --bactopia /path/to/your/bactopia/results \ 
  --include includes.txt

Output Overview¶

Below is the default output structure for the snippy tool. Where possible the file descriptions below were modified from a tools description.

snippy/
├── gubbins
│   ├── core-snp.branch_base_reconstruction.embl.gz
│   ├── core-snp.filtered_polymorphic_sites.fasta.gz
│   ├── core-snp.filtered_polymorphic_sites.phylip
│   ├── core-snp.final_tree.tre
│   ├── core-snp.node_labelled.final_tree.tre
│   ├── core-snp.per_branch_statistics.csv
│   ├── core-snp.recombination_predictions.embl.gz
│   ├── core-snp.recombination_predictions.gff.gz
│   └── core-snp.summary_of_snp_distribution.vcf.gz
├── iqtree
│   ├── core-snp.alninfo
│   ├── core-snp.bionj
│   ├── core-snp.ckp.gz
│   ├── core-snp.contree
│   ├── core-snp.mldist
│   ├── core-snp.splits.nex
│   ├── core-snp.treefile
│   └── core-snp.ufboot
├── logs
│   ├── gubbins
│   │   ├── core-snp.log
│   │   ├── nf-gubbins.{begin,err,log,out,run,sh,trace}
│   │   └── versions.yml
│   ├── iqtree
│   │   ├── core-snp.log
│   │   ├── nf-iqtree.{begin,err,log,out,run,sh,trace}
│   │   └── versions.yml
│   ├── snippy-core
│   │   ├── nf-snippy-core.{begin,err,log,out,run,sh,trace}
│   │   └── versions.yml
│   └── snpdists
│       ├── nf-snpdists.{begin,err,log,out,run,sh,trace}
│       └── versions.yml
├── nf-reports
│   ├── snippy-dag.dot
│   ├── snippy-report.html
│   ├── snippy-timeline.html
│   └── snippy-trace.txt
├── snippy
│   └── <SAMPLE_NAME>
│       ├── logs
│       │   └── snippy
│       │       ├── nf-snippy.{begin,err,log,out,run,sh,trace}
│       │       ├── <SAMPLE_NAME>.log
│       │       └── versions.yml
│       ├── <SAMPLE_NAME>.aligned.fa.gz
│       ├── <SAMPLE_NAME>.annotated.vcf.gz
│       ├── <SAMPLE_NAME>.bam
│       ├── <SAMPLE_NAME>.bam.bai
│       ├── <SAMPLE_NAME>.bed.gz
│       ├── <SAMPLE_NAME>.consensus.fa.gz
│       ├── <SAMPLE_NAME>.consensus.subs.fa.gz
│       ├── <SAMPLE_NAME>.consensus.subs.masked.fa.gz
│       ├── <SAMPLE_NAME>.coverage.txt.gz
│       ├── <SAMPLE_NAME>.csv.gz
│       ├── <SAMPLE_NAME>.filt.vcf.gz
│       ├── <SAMPLE_NAME>.gff.gz
│       ├── <SAMPLE_NAME>.html
│       ├── <SAMPLE_NAME>.raw.vcf.gz
│       ├── <SAMPLE_NAME>.subs.vcf.gz
│       ├── <SAMPLE_NAME>.tab
│       ├── <SAMPLE_NAME>.txt
│       └── <SAMPLE_NAME>.vcf.gz
├── snippy-core
│   ├── core-snp.aln.gz
│   ├── core-snp.tab.gz
│   ├── core-snp.txt
│   └── core-snp.vcf.gz
├── core-snp-clean.full.aln.gz
├── core-snp.distance.tsv
├── core-snp.full.aln.gz
├── core-snp.iqtree
└── core-snp.masked.aln.gz

Results¶

Top Level¶

Below are results that are in the base directory.

Filename	Description
core-snp-clean.full.aln.gz	Same as `core-snp.full.aln.gz` with unusual characters replaced with `N`
core-snp.distance.tsv	Core genome Pair-wise SNP distance for each sample
core-snp.full.aln.gz	A whole genome SNP alignment (includes invariant sites)
core-genome.iqtree	Full result of the IQ-TREE core genome phylogeny
core-genome.masked.aln.gz	A core-SNP alignment with the recomination masked

Gubbins¶

Below is a description of the Gubbins results. For more details about Gubbins outputs see Gubbins - Outputs.

Filename	Description
core-snp.branch_base_reconstruction.embl.gz	Base substitution reconstruction in EMBL format
core-snp.filtered_polymorphic_sites.fasta.gz	FASTA format alignment of filtered polymorphic sites
core-snp.filtered_polymorphic_sites.phylip	Phylip format alignment of filtered polymorphic sites
core-snp.final_tree.tre	Final phylogeny in Newick format (branch lengths are in point mutations)
core-snp.node_labelled.final_tree.tre	Final phylogeny in Newick format but with internal node labels
core-snp.per_branch_statistics.csv	Per-branch reporting of the base substitutions inside and outside recombination events
core-snp.recombination_predictions.embl.gz	Recombination predictions in EMBL file format
core-snp.recombination_predictions.gff.gz	Recombination predictions in GFF file format
core-snp.summary_of_snp_distribution.vcf.gz	VCF file summarising the distribution of point mutations

IQ-TREE¶

Below is a description of the IQ-TREE results. If ClonalFrameML is executed, a fast tree is created and given the prefix start-tree, the final tree has the prefix core-genome. For more details about IQ-TREE outputs see IQ-TREE - Outputs.

Filename	Description
core-snp.alninfo	Alignment site statistics
core-snp.bionj	A neighbor joining tree produced by BIONJ
core-snp.ckp.gz	IQ-TREE writes a checkpoint file
core-snp.contree	Consensus tree with assigned branch supports where branch lengths are optimized on the original alignment; printed if Ultrafast Bootstrap is selected
core-snp.mldist	Contains the likelihood distances
core-snp.splits.nex	Support values in percentage for all splits (bipartitions), computed as the occurence frequencies in the bootstrap trees
core-snp.treefile	Maximum likelihood tree in NEWICK format, can be visualized with treeviewer programs
core-snp.ufboot	Trees created during the bootstrap steps

Snippy¶

Below is a description of the per-sample Snippy results. For more details about Snippy outputs see Snippy - Outputs.

Filename	Description
<SAMPLE_NAME>.aligned.fa.gz	A version of the reference but with `-` at position with `depth=0` and `N` for `0 < depth < --mincov` (does not have variants)
<SAMPLE_NAME>.annotated.vcf.gz	The final variant calls with additional annotations from Reference genome's GenBank file
<SAMPLE_NAME>.bam	The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates
<SAMPLE_NAME>.bam.bai	Index for the .bam file
<SAMPLE_NAME>.bed.gz	The variants in BED format
<SAMPLE_NAME>.consensus.fa.gz	A version of the reference genome with all variants instantiated
<SAMPLE_NAME>.consensus.subs.fa.gz	A version of the reference genome with only substitution variants instantiated
<SAMPLE_NAME>.consensus.subs.masked.fa.gz	A version of the reference genome with only substitution variants instantiated and low-coverage regions masked
<SAMPLE_NAME>.coverage.txt.gz	The per-base coverage of each position in the reference genome
<SAMPLE_NAME>.csv.gz	A comma-separated version of the .tab file
<SAMPLE_NAME>.filt.vcf.gz	The filtered variant calls from Freebayes
<SAMPLE_NAME>.gff.gz	The variants in GFF3 format
<SAMPLE_NAME>.html	A HTML version of the .tab file
<SAMPLE_NAME>.raw.vcf.gz	The unfiltered variant calls from Freebayes
<SAMPLE_NAME>.subs.vcf.gz	Only substitution variants from the final annotated variants
<SAMPLE_NAME>.tab	A simple tab-separated summary of all the variants
<SAMPLE_NAME>.txt	A summary of the Snippy run
<SAMPLE_NAME>.vcf.gz	The final annotated variants in VCF format

Snippy-Core¶

Below is a description of the Snippy-Core results. For more details about Snippy-Core outputs see Snippy-Core - Outputs.

Filename	Description
core-snp.aln.gz	A core SNP alignment in FASTA format
core-snp.tab.gz	Tab-separated columnar list of core SNP sites with alleles but NO annotations
core-snp.txt	Tab-separated columnar list of alignment/core-size statistics
core-snp.vcf.gz	Multi-sample VCF file with genotype GT tags for all discovered alleles

Audit Trail¶

Below are files that can assist you in understanding which parameters and program versions were used.

Logs¶

Each process that is executed will have a logs folder containing helpful files for you to review if the need ever arises.

Filename	Description
nf-<PROCESS_NAME>.begin	An empty file used to designate the process started
nf-<PROCESS_NAME>.err	Contains STDERR outputs from the process
nf-<PROCESS_NAME>.log	Contains both STDERR and STDOUT outputs from the process
nf-<PROCESS_NAME>.out	Contains STDOUT outputs from the process
nf-<PROCESS_NAME>.run	The script Nextflow uses to stage/unstage files and queue processes based on given profile
nf-<PROCESS_NAME>.sh	The script executed by bash for the process
nf-<PROCESS_NAME>.trace	The Nextflow Trace report for the process
versions.yml	A YAML formatted file with program versions

Nextflow Reports¶

These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.

Filename	Description
snippy-dag.dot	The Nextflow DAG visualisation
snippy-report.html	The Nextflow Execution Report
snippy-timeline.html	The Nextflow Timeline Report
snippy-trace.txt	The Nextflow Trace report

Program Versions¶

At the end of each run, each of the versions.yml files are merged into the files below.

Filename	Description
software_versions.yml	A complete list of programs and versions used by each process
software_versions_mqc.yml	A complete list of programs and versions formatted for MultiQC

Parameters¶

Required Parameters¶

Define where the pipeline should find input data and save output data.

Parameter	Description	Default
`--bactopia`	The path to bactopia results to use as inputs

Filtering Parameters¶

Use these parameters to specify which samples to include or exclude.

Parameter	Description	Default
`--include`	A text file containing sample names (one per line) to include from the analysis
`--exclude`	A text file containing sample names (one per line) to exclude from the analysis

Snippy Parameters¶

Parameter	Description	Default
`--reference`	Reference genome in GenBank format
`--mapqual`	Minimum read mapping quality to consider	60
`--basequal`	Minimum base quality to consider	13
`--mincov`	Minimum site depth to for calling alleles	10
`--minfrac`	Minimum proportion for variant evidence (0=AUTO)	0
`--minqual`	Minimum QUALITY in VCF column 6	100
`--maxsoft`	Maximum soft clipping to allow	10
`--bwaopt`	Extra BWA MEM options, eg. -x pacbio
`--fbopt`	Extra Freebayes options, eg. --theta 1E-6 --read-snp-limit 2

Snippy-Core Parameters¶

Parameter	Description	Default
`--maxhap`	Largest haplotype to decompose	100

Gubbins Parameters¶

Parameter	Description	Default
`--iterations`	Maximum number of iterations	5
`--min_snps`	Min SNPs to identify a recombination block	3
`--min_window_size`	Minimum window size	100
`--max_window_size`	Maximum window size	10000
`--filter_percentage`	Filter out taxa with more than this percentage of gaps	25.0
`--remove_identical_sequences`	Remove identical sequences	False
`--gubbin_opts`	Extra Gubbins options in quotes
`--skip_recombination`	Skip Gubbins execution in subworkflows	False

IQ-TREE Parameters¶

Parameter	Description	Default
`--iqtree_model`	Substitution model name	HKY
`--bb`	Ultrafast bootstrap replicates	1000
`--alrt`	SH-like approximate likelihood ratio test replicates	1000
`--asr`	Ancestral state reconstruction by empirical Bayes	False
`--iqtree_opts`	Extra IQ-TREE options in quotes.
`--skip_phylogeny`	Skip IQ-TREE execution in subworkflows	False

SNP-Dists Parameters¶

Parameter	Description	Default
`--a`	Count all differences not just [AGTC]	False
`--b`	Keep top left corner cell	False
`--csv`	Output CSV instead of TSV	False
`--k`	Keep case, don't uppercase all letters	False

Optional Parameters¶

These optional parameters can be useful in certain settings.

Parameter	Description	Default
`--outdir`	Base directory to write results to	./
`--run_name`	Name of the directory to hold results	bactopia
`--skip_compression`	Ouput files will not be compressed	False
`--keep_all_files`	Keeps all analysis files created	False

Max Job Request Parameters¶

Set the top limit for requested resources for any single job.

Parameter	Description	Default
`--max_retry`	Maximum times to retry a process before allowing it to fail.	3
`--max_cpus`	Maximum number of CPUs that can be requested for any single job.	4
`--max_memory`	Maximum amount of memory (in GB) that can be requested for any single job.	32
`--max_time`	Maximum amount of time (in minutes) that can be requested for any single job.	120
`--max_downloads`	Maximum number of samples to download at a time	3

Nextflow Configuration Parameters¶

Parameters to fine-tune your Nextflow setup.

Parameter	Description	Default
`--nfconfig`	A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set.
`--publish_dir_mode`	Method used to save pipeline results to output directory.	copy
`--infodir`	Directory to keep pipeline Nextflow logs and reports.	${params.outdir}/pipeline_info
`--force`	Nextflow will overwrite existing output files.	False
`--cleanup_workdir`	After Bactopia is successfully executed, the `work` directory will be deleted.	False

Nextflow Profile Parameters¶