Skip to content

Bactopia Tool - snippy

The snippy subworkflow allows you to call SNPs and InDels against a reference with Snippy. With the called SNPs/InDels, snippy-core a core-SNP alignment is created.

A phylogeny, based on the core-SNP alignment, will be created by IQ-Tree. Optionally a recombination-masked core-SNP alignment can be created with Gubbins.

Finally, the pair-wise SNP distance for each sample is also calculated with snp-dists.

Example Usage

bactopia --wf snippy \
  --bactopia /path/to/your/bactopia/results \ 
  --include includes.txt  

Output Overview

Below is the default output structure for the snippy tool. Where possible the file descriptions below were modified from a tools description.

<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│   └── tools
│       └── snippy
│           └── <REFERENCE_NAME>
│               ├── logs
│               │   ├── nf-snippy.{begin,err,log,out,run,sh,trace}
│               │   ├── <SAMPLE_NAME>.log
│               │   └── versions.yml
│               ├── <SAMPLE_NAME>.aligned.fa.gz
│               ├── <SAMPLE_NAME>.annotated.vcf.gz
│               ├── <SAMPLE_NAME>.bam
│               ├── <SAMPLE_NAME>.bam.bai
│               ├── <SAMPLE_NAME>.bed.gz
│               ├── <SAMPLE_NAME>.consensus.fa.gz
│               ├── <SAMPLE_NAME>.consensus.subs.fa.gz
│               ├── <SAMPLE_NAME>.consensus.subs.masked.fa.gz
│               ├── <SAMPLE_NAME>.coverage.txt.gz
│               ├── <SAMPLE_NAME>.csv.gz
│               ├── <SAMPLE_NAME>.filt.vcf.gz
│               ├── <SAMPLE_NAME>.gff.gz
│               ├── <SAMPLE_NAME>.html
│               ├── <SAMPLE_NAME>.raw.vcf.gz
│               ├── <SAMPLE_NAME>.subs.vcf.gz
│               ├── <SAMPLE_NAME>.tab
│               ├── <SAMPLE_NAME>.txt
│               └── <SAMPLE_NAME>.vcf.gz
└── bactopia-runs
    └── snippy-<TIMESTAMP>
        ├── core-snp-clean.full.aln.gz
        ├── core-snp.full.aln.gz
        ├── gubbins
        │   ├── core-snp.branch_base_reconstruction.embl.gz
        │   ├── core-snp.filtered_polymorphic_sites.fasta.gz
        │   ├── core-snp.filtered_polymorphic_sites.phylip
        │   ├── core-snp.final_tree.tre
        │   ├── core-snp.node_labelled.final_tree.tre
        │   ├── core-snp.per_branch_statistics.csv
        │   ├── core-snp.recombination_predictions.embl.gz
        │   ├── core-snp.recombination_predictions.gff.gz
        │   ├── core-snp.summary_of_snp_distribution.vcf.gz
        │   └── logs
        │       ├── core-snp.log
        │       ├── nf-gubbins.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        ├── iqtree
        │   ├── core-snp.alninfo
        │   ├── core-snp.bionj
        │   ├── core-snp.ckp.gz
        │   ├── core-snp.contree
        │   ├── core-snp.iqtree
        │   ├── core-snp.mldist
        │   ├── core-snp.splits.nex
        │   ├── core-snp.treefile
        │   ├── core-snp.ufboot
        │   └── logs
        │       ├── core-snp.log
        │       ├── nf-iqtree.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        ├── nf-reports
        │   ├── snippy-dag.dot
        │   ├── snippy-report.html
        │   ├── snippy-timeline.html
        │   └── snippy-trace.txt
        ├── snippy-core
        │   ├── core-snp.aln.gz
        │   ├── core-snp.tab.gz
        │   ├── core-snp.txt
        │   ├── core-snp.vcf.gz
        │   └── logs
        │       ├── nf-snippy-core.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        └── snpdists
            ├── core-snp.distance.tsv
            └── logs
                ├── nf-snpdists.{begin,err,log,out,run,sh,trace}
                └── versions.yml

Results

Main Results

Below are the main results from the snippy Bactopia Tool.

Filename Description
core-snp-clean.full.aln.gz Same as core-snp.full.aln.gz with unusual characters replaced with N
core-snp.distance.tsv Core genome Pair-wise SNP distance for each sample
core-snp.full.aln.gz A whole genome SNP alignment (includes invariant sites)
core-genome.iqtree Full result of the IQ-TREE core genome phylogeny
core-genome.masked.aln.gz A core-SNP alignment with the recombination masked

Gubbins

Below is a description of the Gubbins results. For more details about Gubbins outputs see Gubbins - Outputs.

Filename Description
core-snp.branch_base_reconstruction.embl.gz Base substitution reconstruction in EMBL format
core-snp.filtered_polymorphic_sites.fasta.gz FASTA format alignment of filtered polymorphic sites
core-snp.filtered_polymorphic_sites.phylip Phylip format alignment of filtered polymorphic sites
core-snp.final_tree.tre Final phylogeny in Newick format (branch lengths are in point mutations)
core-snp.node_labelled.final_tree.tre Final phylogeny in Newick format but with internal node labels
core-snp.per_branch_statistics.csv Per-branch reporting of the base substitutions inside and outside recombination events
core-snp.recombination_predictions.embl.gz Recombination predictions in EMBL file format
core-snp.recombination_predictions.gff.gz Recombination predictions in GFF file format
core-snp.summary_of_snp_distribution.vcf.gz VCF file summarising the distribution of point mutations

IQ-TREE

Below is a description of the IQ-TREE results. If ClonalFrameML is executed, a fast tree is created and given the prefix start-tree, the final tree has the prefix core-genome. For more details about IQ-TREE outputs see IQ-TREE - Outputs.

Filename Description
core-snp.alninfo Alignment site statistics
core-snp.bionj A neighbor joining tree produced by BIONJ
core-snp.ckp.gz IQ-TREE writes a checkpoint file
core-snp.contree Consensus tree with assigned branch supports where branch lengths are optimized on the original alignment; printed if Ultrafast Bootstrap is selected
core-snp.mldist Contains the likelihood distances
core-snp.splits.nex Support values in percentage for all splits (bipartitions), computed as the occurrence frequencies in the bootstrap trees
core-snp.treefile Maximum likelihood tree in NEWICK format, can be visualized with treeviewer programs
core-snp.ufboot Trees created during the bootstrap steps

Snippy

Below is a description of the per-sample Snippy results. For more details about Snippy outputs see Snippy - Outputs.

Filename Description
<SAMPLE_NAME>.aligned.fa.gz A version of the reference but with - at position with depth=0 and N for 0 < depth < --mincov (does not have variants)
<SAMPLE_NAME>.annotated.vcf.gz The final variant calls with additional annotations from Reference genome's GenBank file
<SAMPLE_NAME>.bam The alignments in BAM format. Includes unmapped, multimapped reads. Excludes duplicates
<SAMPLE_NAME>.bam.bai Index for the .bam file
<SAMPLE_NAME>.bed.gz The variants in BED format
<SAMPLE_NAME>.consensus.fa.gz A version of the reference genome with all variants instantiated
<SAMPLE_NAME>.consensus.subs.fa.gz A version of the reference genome with only substitution variants instantiated
<SAMPLE_NAME>.consensus.subs.masked.fa.gz A version of the reference genome with only substitution variants instantiated and low-coverage regions masked
<SAMPLE_NAME>.coverage.txt.gz The per-base coverage of each position in the reference genome
<SAMPLE_NAME>.csv.gz A comma-separated version of the .tab file
<SAMPLE_NAME>.filt.vcf.gz The filtered variant calls from Freebayes
<SAMPLE_NAME>.gff.gz The variants in GFF3 format
<SAMPLE_NAME>.html A HTML version of the .tab file
<SAMPLE_NAME>.raw.vcf.gz The unfiltered variant calls from Freebayes
<SAMPLE_NAME>.subs.vcf.gz Only substitution variants from the final annotated variants
<SAMPLE_NAME>.tab A simple tab-separated summary of all the variants
<SAMPLE_NAME>.txt A summary of the Snippy run
<SAMPLE_NAME>.vcf.gz The final annotated variants in VCF format

Snippy-Core

Below is a description of the Snippy-Core results. For more details about Snippy-Core outputs see Snippy-Core - Outputs.

Filename Description
core-snp.aln.gz A core SNP alignment in FASTA format
core-snp.tab.gz Tab-separated columnar list of core SNP sites with alleles but NO annotations
core-snp.txt Tab-separated columnar list of alignment/core-size statistics
core-snp.vcf.gz Multi-sample VCF file with genotype GT tags for all discovered alleles

Audit Trail

Below are files that can assist you in understanding which parameters and program versions were used.

Logs

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

Extension Description
.begin An empty file used to designate the process started
.err Contains STDERR outputs from the process
.log Contains both STDERR and STDOUT outputs from the process
.out Contains STDOUT outputs from the process
.run The script Nextflow uses to stage/unstage files and queue processes based on given profile
.sh The script executed by bash for the process
.trace The Nextflow Trace report for the process
versions.yml A YAML formatted file with program versions

Nextflow Reports

These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.

Filename Description
snippy-dag.dot The Nextflow DAG visualisation
snippy-report.html The Nextflow Execution Report
snippy-timeline.html The Nextflow Timeline Report
snippy-trace.txt The Nextflow Trace report

Program Versions

At the end of each run, each of the versions.yml files are merged into the files below.

Filename Description
software_versions.yml A complete list of programs and versions used by each process
software_versions_mqc.yml A complete list of programs and versions formatted for MultiQC

Parameters

Required Parameters

Define where the pipeline should find input data and save output data.

Parameter Description
--bactopia The path to bactopia results to use as inputs
Type: string

Filtering Parameters

Use these parameters to specify which samples to include or exclude.

Parameter Description
--include A text file containing sample names (one per line) to include from the analysis
Type: string
--exclude A text file containing sample names (one per line) to exclude from the analysis
Type: string

Snippy Parameters

Parameter Description
--reference Reference genome in GenBank format
Type: string
--mapqual Minimum read mapping quality to consider
Type: integer, Default: 60
--basequal Minimum base quality to consider
Type: integer, Default: 13
--mincov Minimum site depth to for calling alleles
Type: integer, Default: 10
--minfrac Minimum proportion for variant evidence (0=AUTO)
Type: integer
--minqual Minimum QUALITY in VCF column 6
Type: integer, Default: 100
--maxsoft Maximum soft clipping to allow
Type: integer, Default: 10
--bwaopt Extra BWA MEM options, eg. -x pacbio
Type: string
--fbopt Extra Freebayes options, eg. --theta 1E-6 --read-snp-limit 2
Type: string
--snippy_opts Extra options in quotes for Snippy
Type: string

Snippy-Core Parameters

Parameter Description
--maxhap Largest haplotype to decompose
Type: integer, Default: 100
--mask BED file of sites to mask
Type: string
--mask_char Masking character
Type: string, Default: X
--snippy_core_opts Extra options in quotes for snippy-core
Type: string

Gubbins Parameters

Parameter Description
--iterations Maximum number of iterations
Type: integer, Default: 5
--min_snps Min SNPs to identify a recombination block
Type: integer, Default: 3
--min_window_size Minimum window size
Type: integer, Default: 100
--max_window_size Maximum window size
Type: integer, Default: 10000
--filter_percentage Filter out taxa with more than this percentage of gaps
Type: number, Default: 25.0
--remove_identical_sequences Remove identical sequences
Type: boolean
--gubbin_opts Extra Gubbins options in quotes
Type: string
--skip_recombination Skip Gubbins execution in subworkflows
Type: boolean

IQ-TREE Parameters

Parameter Description
--iqtree_model Substitution model name
Type: string, Default: HKY
--bb Ultrafast bootstrap replicates
Type: integer, Default: 1000
--alrt SH-like approximate likelihood ratio test replicates
Type: integer, Default: 1000
--asr Ancestral state reconstruction by empirical Bayes
Type: boolean
--iqtree_opts Extra IQ-TREE options in quotes.
Type: string
--skip_phylogeny Skip IQ-TREE execution in subworkflows
Type: boolean

SNP-Dists Parameters

Parameter Description
--a Count all differences not just [AGTC]
Type: boolean
--b Keep top left corner cell
Type: boolean
--csv Output CSV instead of TSV
Type: boolean
--k Keep case, don't uppercase all letters
Type: boolean

Optional Parameters

These optional parameters can be useful in certain settings.

Parameter Description
--outdir Base directory to write results to
Type: string, Default: ./
--run_name Name of the directory to hold results
Type: string, Default: bactopia
--skip_compression Ouput files will not be compressed
Type: boolean
--datasets The path to cache datasets to
Type: string
--keep_all_files Keeps all analysis files created
Type: boolean

Max Job Request Parameters

Set the top limit for requested resources for any single job.

Parameter Description
--max_retry Maximum times to retry a process before allowing it to fail.
Type: integer, Default: 3
--max_cpus Maximum number of CPUs that can be requested for any single job.
Type: integer, Default: 4
--max_memory Maximum amount of memory (in GB) that can be requested for any single job.
Type: integer, Default: 32
--max_time Maximum amount of time (in minutes) that can be requested for any single job.
Type: integer, Default: 120
--max_downloads Maximum number of samples to download at a time
Type: integer, Default: 3

Nextflow Configuration Parameters

Parameters to fine-tune your Nextflow setup.

Parameter Description
--nfconfig A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set.
Type: string
--publish_dir_mode Method used to save pipeline results to output directory.
Type: string, Default: copy
--infodir Directory to keep pipeline Nextflow logs and reports.
Type: string, Default: ${params.outdir}/pipeline_info
--force Nextflow will overwrite existing output files.
Type: boolean
--cleanup_workdir After Bactopia is successfully executed, the work directory will be deleted.
Type: boolean

Nextflow Profile Parameters

Parameters to fine-tune your Nextflow setup.

Parameter Description
--condadir Directory to Nextflow should use for Conda environments
Type: string
--registry Docker registry to pull containers from.
Type: string, Default: dockerhub
--datasets_cache Directory where downloaded datasets should be stored.
Type: string, Default: <BACTOPIA_DIR>/data/datasets
--singularity_cache Directory where remote Singularity images are stored.
Type: string
--singularity_pull_docker_container Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead.
Type: boolean
--force_rebuild Force overwrite of existing pre-built environments.
Type: boolean
--queue Comma-separated name of the queue(s) to be used by a job scheduler (e.g. AWS Batch or SLURM)
Type: string, Default: general,high-memory
--cluster_opts Additional options to pass to the executor. (e.g. SLURM: '--account=my_acct_name'
Type: string
--disable_scratch All intermediate files created on worker nodes of will be transferred to the head node.
Type: boolean

Helpful Parameters

Uncommonly used parameters that might be useful.

Parameter Description
--monochrome_logs Do not use coloured log outputs.
Type: boolean
--nfdir Print directory Nextflow has pulled Bactopia to
Type: boolean
--sleep_time The amount of time (seconds) Nextflow will wait after setting up datasets before execution.
Type: integer, Default: 5
--validate_params Boolean whether to validate parameters against the schema at runtime
Type: boolean, Default: True
--help Display help text.
Type: boolean
--wf Specify which workflow or Bactopia Tool to execute
Type: string, Default: bactopia
--list_wfs List the available workflows and Bactopia Tools to use with '--wf'
Type: boolean
--show_hidden_params Show all params when using --help
Type: boolean
--help_all An alias for --help --show_hidden_params
Type: boolean
--version Display version text.
Type: boolean

Citations

If you use Bactopia and snippy in your analysis, please cite the following.