Bactopia Tool - `snippy`¶

The snippy subworkflow allows you to call SNPs and InDels against a reference with Snippy. With the called SNPs/InDels, snippy-core a core-SNP alignment is created.

A phylogeny, based on the core-SNP alignment, will be created by IQ-Tree. Optionally a recombination-masked core-SNP alignment can be created with Gubbins.

Finally, the pair-wise SNP distance for each sample is also calculated with snp-dists.

Example Usage¶

bactopia --wf snippy \
  --bactopia /path/to/your/bactopia/results \ 
  --include includes.txt

Output Overview¶

Below is the default output structure for the snippy tool. Where possible the file descriptions below were modified from a tools description.

<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│   └── tools
│       └── snippy
│           └── <REFERENCE_NAME>
│               ├── logs
│               │   ├── nf-snippy.{begin,err,log,out,run,sh,trace}
│               │   ├── <SAMPLE_NAME>.log
│               │   └── versions.yml
│               ├── <SAMPLE_NAME>.aligned.fa.gz
│               ├── <SAMPLE_NAME>.annotated.vcf.gz
│               ├── <SAMPLE_NAME>.bam
│               ├── <SAMPLE_NAME>.bam.bai
│               ├── <SAMPLE_NAME>.bed.gz
│               ├── <SAMPLE_NAME>.consensus.fa.gz
│               ├── <SAMPLE_NAME>.consensus.subs.fa.gz
│               ├── <SAMPLE_NAME>.consensus.subs.masked.fa.gz
│               ├── <SAMPLE_NAME>.coverage.txt.gz
│               ├── <SAMPLE_NAME>.csv.gz
│               ├── <SAMPLE_NAME>.filt.vcf.gz
│               ├── <SAMPLE_NAME>.gff.gz
│               ├── <SAMPLE_NAME>.html
│               ├── <SAMPLE_NAME>.raw.vcf.gz
│               ├── <SAMPLE_NAME>.subs.vcf.gz
│               ├── <SAMPLE_NAME>.tab
│               ├── <SAMPLE_NAME>.txt
│               └── <SAMPLE_NAME>.vcf.gz
└── bactopia-runs
    └── snippy-<TIMESTAMP>
        ├── core-snp-clean.full.aln.gz
        ├── core-snp.full.aln.gz
        ├── gubbins
        │   ├── core-snp.branch_base_reconstruction.embl.gz
        │   ├── core-snp.filtered_polymorphic_sites.fasta.gz
        │   ├── core-snp.filtered_polymorphic_sites.phylip
        │   ├── core-snp.final_tree.tre
        │   ├── core-snp.node_labelled.final_tree.tre
        │   ├── core-snp.per_branch_statistics.csv
        │   ├── core-snp.recombination_predictions.embl.gz
        │   ├── core-snp.recombination_predictions.gff.gz
        │   ├── core-snp.summary_of_snp_distribution.vcf.gz
        │   └── logs
        │       ├── core-snp.log
        │       ├── nf-gubbins.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        ├── iqtree
        │   ├── core-snp.alninfo
        │   ├── core-snp.bionj
        │   ├── core-snp.ckp.gz
        │   ├── core-snp.contree
        │   ├── core-snp.iqtree
        │   ├── core-snp.mldist
        │   ├── core-snp.splits.nex
        │   ├── core-snp.treefile
        │   ├── core-snp.ufboot
        │   └── logs
        │       ├── core-snp.log
        │       ├── nf-iqtree.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        ├── nf-reports
        │   ├── snippy-dag.dot
        │   ├── snippy-report.html
        │   ├── snippy-timeline.html
        │   └── snippy-trace.txt
        ├── snippy-core
        │   ├── core-snp.aln.gz
        │   ├── core-snp.tab.gz
        │   ├── core-snp.txt
        │   ├── core-snp.vcf.gz
        │   └── logs
        │       ├── nf-snippy-core.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        └── snpdists
            ├── core-snp.distance.tsv
            └── logs
                ├── nf-snpdists.{begin,err,log,out,run,sh,trace}
                └── versions.yml

Results¶

Main Results¶

Below are the main results from the snippy Bactopia Tool.

Filename	Description
core-snp-clean.full.aln.gz	Same as `core-snp.full.aln.gz` with unusual characters replaced with `N`
core-snp.distance.tsv	Core genome Pair-wise SNP distance for each sample
core-snp.full.aln.gz	A whole genome SNP alignment (includes invariant sites)
core-genome.iqtree	Full result of the IQ-TREE core genome phylogeny
core-genome.masked.aln.gz	A core-SNP alignment with the recombination masked

Gubbins¶

Below is a description of the Gubbins results. For more details about Gubbins outputs see Gubbins - Outputs.

Filename	Description
core-snp.branch_base_reconstruction.embl.gz	Base substitution reconstruction in EMBL format
core-snp.filtered_polymorphic_sites.fasta.gz	FASTA format alignment of filtered polymorphic sites
core-snp.filtered_polymorphic_sites.phylip	Phylip format alignment of filtered polymorphic sites
core-snp.final_tree.tre	Final phylogeny in Newick format (branch lengths are in point mutations)
core-snp.node_labelled.final_tree.tre	Final phylogeny in Newick format but with internal node labels
core-snp.per_branch_statistics.csv	Per-branch reporting of the base substitutions inside and outside recombination events
core-snp.recombination_predictions.embl.gz	Recombination predictions in EMBL file format
core-snp.recombination_predictions.gff.gz	Recombination predictions in GFF file format
core-snp.summary_of_snp_distribution.vcf.gz	VCF file summarising the distribution of point mutations

IQ-TREE¶

Below is a description of the IQ-TREE results. If ClonalFrameML is executed, a fast tree is created and given the prefix start-tree, the final tree has the prefix core-genome. For more details about IQ-TREE outputs see IQ-TREE - Outputs.

Filename	Description
core-snp.alninfo	Alignment site statistics
core-snp.bionj	A neighbor joining tree produced by BIONJ
core-snp.ckp.gz	IQ-TREE writes a checkpoint file
core-snp.contree	Consensus tree with assigned branch supports where branch lengths are optimized on the original alignment; printed if Ultrafast Bootstrap is selected
core-snp.mldist	Contains the likelihood distances
core-snp.splits.nex	Support values in percentage for all splits (bipartitions), computed as the occurrence frequencies in the bootstrap trees
core-snp.treefile	Maximum likelihood tree in NEWICK format, can be visualized with treeviewer programs
core-snp.ufboot	Trees created during the bootstrap steps

Snippy¶

Below is a description of the per-sample Snippy results. For more details about Snippy outputs see Snippy - Outputs.

Filename	Description
<SAMPLE_NAME>.aligned.fa.gz	A version of the reference but with `-` at position with `depth=0` and `N` for `0 < depth < --mincov` (does not have variants)
<SAMPLE_NAME>.annotated.vcf.gz	The final variant calls with additional annotations from Reference genome's GenBank file
<SAMPLE_NAME>.bam	The alignments in BAM format. Includes unmapped, multimapped reads. Excludes duplicates
<SAMPLE_NAME>.bam.bai	Index for the .bam file
<SAMPLE_NAME>.bed.gz	The variants in BED format
<SAMPLE_NAME>.consensus.fa.gz	A version of the reference genome with all variants instantiated
<SAMPLE_NAME>.consensus.subs.fa.gz	A version of the reference genome with only substitution variants instantiated
<SAMPLE_NAME>.consensus.subs.masked.fa.gz	A version of the reference genome with only substitution variants instantiated and low-coverage regions masked
<SAMPLE_NAME>.coverage.txt.gz	The per-base coverage of each position in the reference genome
<SAMPLE_NAME>.csv.gz	A comma-separated version of the .tab file
<SAMPLE_NAME>.filt.vcf.gz	The filtered variant calls from Freebayes
<SAMPLE_NAME>.gff.gz	The variants in GFF3 format
<SAMPLE_NAME>.html	A HTML version of the .tab file
<SAMPLE_NAME>.raw.vcf.gz	The unfiltered variant calls from Freebayes
<SAMPLE_NAME>.subs.vcf.gz	Only substitution variants from the final annotated variants
<SAMPLE_NAME>.tab	A simple tab-separated summary of all the variants
<SAMPLE_NAME>.txt	A summary of the Snippy run
<SAMPLE_NAME>.vcf.gz	The final annotated variants in VCF format

Snippy-Core¶

Below is a description of the Snippy-Core results. For more details about Snippy-Core outputs see Snippy-Core - Outputs.

Filename	Description
core-snp.aln.gz	A core SNP alignment in FASTA format
core-snp.tab.gz	Tab-separated columnar list of core SNP sites with alleles but NO annotations
core-snp.txt	Tab-separated columnar list of alignment/core-size statistics
core-snp.vcf.gz	Multi-sample VCF file with genotype GT tags for all discovered alleles

Audit Trail¶

Below are files that can assist you in understanding which parameters and program versions were used.

Logs¶

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

Extension	Description
.begin	An empty file used to designate the process started
.err	Contains STDERR outputs from the process
.log	Contains both STDERR and STDOUT outputs from the process
.out	Contains STDOUT outputs from the process
.run	The script Nextflow uses to stage/unstage files and queue processes based on given profile
.sh	The script executed by bash for the process
.trace	The Nextflow Trace report for the process
versions.yml	A YAML formatted file with program versions

Nextflow Reports¶

These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.

Filename	Description
snippy-dag.dot	The Nextflow DAG visualisation
snippy-report.html	The Nextflow Execution Report
snippy-timeline.html	The Nextflow Timeline Report
snippy-trace.txt	The Nextflow Trace report

Program Versions¶

At the end of each run, each of the versions.yml files are merged into the files below.

Filename	Description
software_versions.yml	A complete list of programs and versions used by each process
software_versions_mqc.yml	A complete list of programs and versions formatted for MultiQC

Parameters¶

Required Parameters¶

Define where the pipeline should find input data and save output data.

Parameter	Description
`--bactopia`	The path to bactopia results to use as inputs Type: `string`

Filtering Parameters¶

Use these parameters to specify which samples to include or exclude.

Parameter	Description
`--include`	A text file containing sample names (one per line) to include from the analysis Type: `string`
`--exclude`	A text file containing sample names (one per line) to exclude from the analysis Type: `string`

Snippy Parameters¶

Parameter	Description
`--reference`	Reference genome in GenBank format Type: `string`
`--mapqual`	Minimum read mapping quality to consider Type: `integer`, Default: `60`
`--basequal`	Minimum base quality to consider Type: `integer`, Default: `13`
`--mincov`	Minimum site depth to for calling alleles Type: `integer`, Default: `10`
`--minfrac`	Minimum proportion for variant evidence (0=AUTO) Type: `integer`
`--minqual`	Minimum QUALITY in VCF column 6 Type: `integer`, Default: `100`
`--maxsoft`	Maximum soft clipping to allow Type: `integer`, Default: `10`
`--bwaopt`	Extra BWA MEM options, eg. -x pacbio Type: `string`
`--fbopt`	Extra Freebayes options, eg. --theta 1E-6 --read-snp-limit 2 Type: `string`
`--snippy_opts`	Extra options in quotes for Snippy Type: `string`

Snippy-Core Parameters¶

Parameter	Description
`--maxhap`	Largest haplotype to decompose Type: `integer`, Default: `100`
`--mask`	BED file of sites to mask Type: `string`
`--mask_char`	Masking character Type: `string`, Default: `X`
`--snippy_core_opts`	Extra options in quotes for snippy-core Type: `string`

Gubbins Parameters¶

Parameter	Description
`--iterations`	Maximum number of iterations Type: `integer`, Default: `5`
`--min_snps`	Min SNPs to identify a recombination block Type: `integer`, Default: `3`
`--min_window_size`	Minimum window size Type: `integer`, Default: `100`
`--max_window_size`	Maximum window size Type: `integer`, Default: `10000`
`--filter_percentage`	Filter out taxa with more than this percentage of gaps Type: `number`, Default: `25.0`
`--remove_identical_sequences`	Remove identical sequences Type: `boolean`
`--gubbin_opts`	Extra Gubbins options in quotes Type: `string`
`--skip_recombination`	Skip Gubbins execution in subworkflows Type: `boolean`

IQ-TREE Parameters¶

Parameter	Description
`--iqtree_model`	Substitution model name Type: `string`, Default: `HKY`
`--bb`	Ultrafast bootstrap replicates Type: `integer`, Default: `1000`
`--alrt`	SH-like approximate likelihood ratio test replicates Type: `integer`, Default: `1000`
`--asr`	Ancestral state reconstruction by empirical Bayes Type: `boolean`
`--iqtree_opts`	Extra IQ-TREE options in quotes. Type: `string`
`--skip_phylogeny`	Skip IQ-TREE execution in subworkflows Type: `boolean`

SNP-Dists Parameters¶

Parameter	Description
`--a`	Count all differences not just [AGTC] Type: `boolean`
`--b`	Keep top left corner cell Type: `boolean`
`--csv`	Output CSV instead of TSV Type: `boolean`
`--k`	Keep case, don't uppercase all letters Type: `boolean`

Optional Parameters¶

These optional parameters can be useful in certain settings.

Parameter	Description
`--outdir`	Base directory to write results to Type: `string`, Default: `./`
`--run_name`	Name of the directory to hold results Type: `string`, Default: `bactopia`
`--skip_compression`	Ouput files will not be compressed Type: `boolean`
`--datasets`	The path to cache datasets to Type: `string`
`--keep_all_files`	Keeps all analysis files created Type: `boolean`

Max Job Request Parameters¶

Set the top limit for requested resources for any single job.

Parameter	Description
`--max_retry`	Maximum times to retry a process before allowing it to fail. Type: `integer`, Default: `3`
`--max_cpus`	Maximum number of CPUs that can be requested for any single job. Type: `integer`, Default: `4`
`--max_memory`	Maximum amount of memory (in GB) that can be requested for any single job. Type: `integer`, Default: `32`
`--max_time`	Maximum amount of time (in minutes) that can be requested for any single job. Type: `integer`, Default: `120`
`--max_downloads`	Maximum number of samples to download at a time Type: `integer`, Default: `3`

Nextflow Configuration Parameters¶

Parameters to fine-tune your Nextflow setup.

Parameter	Description
`--nfconfig`	A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set. Type: `string`
`--publish_dir_mode`	Method used to save pipeline results to output directory. Type: `string`, Default: `copy`
`--infodir`	Directory to keep pipeline Nextflow logs and reports. Type: `string`, Default: `${params.outdir}/pipeline_info`
`--force`	Nextflow will overwrite existing output files. Type: `boolean`
`--cleanup_workdir`	After Bactopia is successfully executed, the `work` directory will be deleted. Type: `boolean`

Nextflow Profile Parameters¶