Assembler
The assembler
module uses a variety of assembly tools to create an assembly of
Illumina and Oxford Nanopore reads. The tools used are:
Tool | Description |
---|---|
Dragonflye | Assembly of Oxford Nanopore reads, as well as hybrid assembly with short-read polishing |
Shovill | Assembly of Illumina paired-end reads |
Shovill-SE | Assembly of Illumina single-end reads |
Unicycler | Hybrid assembly, using short-reads first then long-reads |
Summary statistics for each assembly are generated using assembly-scan.
Output Overview¶
Below is the default output structure for the assembler
step in Bactopia. Where
possible the file descriptions below were modified from a tools description.
<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│ └── main
│ └── assembler
│ ├── flash.{hist|histogram}
| | flye.info
│ ├── logs
│ │ ├── {dragonflye|shovill|unicycler}.log
│ │ ├── nf-assembler.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── <SAMPLE_NAME>.fna.gz
│ ├── <SAMPLE_NAME>.tsv
│ ├── <SAMPLE_NAME>-assembly-error.txt
│ ├── shovill.corrections
│ ├── {flye|miniasm|raven|unicycler}-unpolished.fasta.gz
│ └── {flye|megahit|miniasm|raven|spades|unicycler|velvet}-unpolished.gfa.gz
└── bactopia-runs
└── bactopia-<TIMESTAMP>
├── merged-results
│ ├── assembly-scan.tsv
│ └── logs
│ └── assembly-scan-concat
│ ├── nf-merged-results.{begin,err,log,out,run,sh,trace}
│ └── versions.yml
└── nf-reports
├── bactopia-dag.dot
├── bactopia-report.html
├── bactopia-timeline.html
└── bactopia-trace.txt
Directory structure might be different
Depending on the options used at runtime, the assembler
directory structure might
be different, but the output descriptions below still apply.
Results¶
Merged Results¶
Below are results that are concatenated into a single file.
Filename | Description |
---|---|
assembly-scan.tsv | Assembly statistics for all samples |
Dragonflye¶
Below is a description of the per-sample results for Oxford Nanopore reads using Dragonflye.
Filename | Description |
---|---|
<SAMPLE_NAME>.fna.gz | The final assembly produced by Dragonflye |
<SAMPLE_NAME>.tsv | A tab-delimited file containing assembly statistics |
flye-info.txt | A log file containing information about the Flye assembly |
{flye|miniasm|raven}-unpolished.fasta.gz | Raw unprocessed assembly produced by the used assembler |
{flye|miniasm|raven}-unpolished.gfa.gz | Raw unprocessed assembly graph produced by the used assembler |
Shovill¶
Below is a description of the per-sample results for Illumina reads using Shovill or Shovill-SE.
Filename | Description |
---|---|
<SAMPLE_NAME>.fna.gz | The final assembly produced by Dragonflye |
<SAMPLE_NAME>.tsv | A tab-delimited file containing assembly statistics |
flash.hist | (Paired-End Only) Numeric histogram of merged read lengths. |
flash.histogram | (Paired-End Only) Visual histogram of merged read lengths |
{megahit|spades|velvet}-unpolished.gfa.gz | Raw unprocessed assembly graph produced by the used assembler |
shovill.corrections | List of post-assembly corrections made by Shovill |
Hybrid Assembly (Unicycler)¶
Below is a description of the per-sample results for a hybrid assembly using
Unicycler (--hybrid
). When using Unicycler,
the short-reads are assembled first, then the long-reads are used to polish the
assembly.
Filename | Description |
---|---|
<SAMPLE_NAME>.fna.gz | The final assembly produced by Dragonflye |
<SAMPLE_NAME>.tsv | A tab-delimited file containing assembly statistics |
unicycler-unpolished.fasta.gz | Raw unprocessed assembly produced by Unicycler |
unicycler-unpolished.fasta.gz | Raw unprocessed assembly graph produced by Unicycler |
Hybrid Assembly (Short Read Polishing)¶
Below is a description of the per-sample results for a hybrid assembly using
Dragonflye (--short_polish
). When using
Dragonflye, the long-reads are assembled first, then the short-reads are used
to polish the assembly.
Prefer --short_polish
over --hybrid
with recent ONT sequencing
Using Unicycler (--hybrid
) to create a hybrid
assembly works great when you have low-coverage noisy long-reads. However, if you are
using recent ONT sequencing, you likely have high-coverage and using the --short_polish
method is going to yeild better results (and be faster!) than --hybrid
.
Filename | Description |
---|---|
<SAMPLE_NAME>.fna.gz | The final assembly produced by Dragonflye |
<SAMPLE_NAME>.tsv | A tab-delimited file containing assembly statistics |
flye-info.txt | A log file containing information about the Flye assembly |
{flye|miniasm|raven}-unpolished.fasta.gz | Raw unprocessed assembly produced by the used assembler |
{flye|miniasm|raven}-unpolished.gfa.gz | Raw unprocessed assembly graph produced by the used assembler |
Failed Quality Checks¶
Built into Bactopia are few basic quality checks to help prevent downstream failures. If a sample fails one of these checks, it will be excluded from further analysis. By excluding these samples, complete pipeline failures are prevented.
Extension | Description |
---|---|
-assembly-error.txt | Sample failed read count checks and excluded from further analysis |
Poor samples are excluded to prevent downstream failures
Samples that fail any of the QC checks will be excluded from further analysis.
Those samples will generate a *-error.txt
file with the error message. Excluding
these samples prevents downstream failures that cause the whole workflow to fail.
Example Error: Assembled Successfully, but 0 Contigs
If a sample assembles successfully, but 0 contigs are formed, the sample will be excluded from further analysis.
Example Text from <SAMPLE_NAME>-assembly-error.txt
<SAMPLE_NAME> assembled successfully, but 0 contigs were formed. Please investigate
<SAMPLE_NAME> to determine a cause (e.g. metagenomic, contaminants, etc...) for this
outcome. Further assembly-based analysis of <SAMPLE_NAME> will be discontinued.
Example Error: Assembled successfully, but poor assembly size
If you sample assembles successfully, but the assembly size is less than the minimum
allowed genome size, the sample will be excluded from further analysis. You can
adjust this minimum size using the --min_genome_size
parameter.
Example Text from <SAMPLE_NAME>-assembly-error.txt
<SAMPLE_NAME> assembled size (000 bp) is less than the minimum allowed genome
size (000 bp). If this is unexpected, please investigate <SAMPLE_NAME> to
determine a cause (e.g. metagenomic, contaminants, etc...) for the poor assembly.
Otherwise, adjust the --min_genome_size
parameter to fit your need. Further
assembly based analysis of <SAMPLE_NAME> will be discontinued.
Audit Trail¶
Below are files that can assist you in understanding which parameters and program versions were used.
Logs¶
Each process that is executed will have a folder named logs
. In this folder are helpful
files for you to review if the need ever arises.
Extension | Description |
---|---|
.begin | An empty file used to designate the process started |
.err | Contains STDERR outputs from the process |
.log | Contains both STDERR and STDOUT outputs from the process |
.out | Contains STDOUT outputs from the process |
.run | The script Nextflow uses to stage/unstage files and queue processes based on given profile |
.sh | The script executed by bash for the process |
.trace | The Nextflow Trace report for the process |
versions.yml | A YAML formatted file with program versions |
Parameters¶
Assembler¶
Parameter | Description |
---|---|
--shovill_assembler |
Assembler to be used by Shovill Type: string , Default: skesa |
--dragonflye_assembler |
Assembler to be used by Dragonflye Type: string , Default: flye |
--use_unicycler |
Use unicycler for paired end assembly Type: boolean |
--min_contig_len |
Minimum contig length <0=AUTO> Type: integer , Default: 500 |
--min_contig_cov |
Minimum contig coverage <0=AUTO> Type: integer , Default: 2 |
--contig_namefmt |
Format of contig FASTA IDs in 'printf' style Type: string |
--shovill_opts |
Extra assembler options in quotes for Shovill Type: string |
--shovill_kmers |
K-mers to use Type: string |
--dragonflye_opts |
Extra assembler options in quotes for Dragonflye Type: string |
--trim |
Enable adaptor trimming Type: boolean |
--no_stitch |
Disable read stitching for paired-end reads Type: boolean |
--no_corr |
Disable post-assembly correction Type: boolean |
--unicycler_mode |
Bridging mode used by Unicycler Type: string , Default: normal |
--min_polish_size |
Contigs shorter than this value (bp) will not be polished using Pilon Type: integer , Default: 10000 |
--min_component_size |
Graph dead ends smaller than this size (bp) will be removed from the final graph Type: integer , Default: 1000 |
--min_dead_end_size |
Graph dead ends smaller than this size (bp) will be removed from the final graph Type: integer , Default: 1000 |
--nanohq |
For Flye, use '--nano-hq' instead of --nano-raw Type: boolean |
--medaka_model |
The model to use for Medaka polishing Type: string |
--medaka_rounds |
The number of Medaka polishing rounds to conduct Type: integer |
--racon_rounds |
The number of Racon polishing rounds to conduct Type: integer , Default: 1 |
--no_polish |
Skip the assembly polishing step Type: boolean |
--no_miniasm |
Skip miniasm+Racon bridging Type: boolean |
--no_rotate |
Do not rotate completed replicons to start at a standard gene Type: boolean |
--reassemble |
If reads were simulated, they will be used to create a new assembly. Type: boolean |
--polypolish_rounds |
Number of polishing rounds to conduct with Polypolish for short read polishing Type: integer , Default: 1 |
--pilon_rounds |
Number of polishing rounds to conduct with Pilon for short read polishing Type: integer |
Citations¶
If you use Bactopia and assembler
in your analysis, please cite the following.
-
Bactopia
Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020) -
any2fasta
Seemann T any2fasta: Convert various sequence formats to FASTA (GitHub) -
assembly-scan
Petit III RA assembly-scan: generate basic stats for an assembly (GitHub) -
BWA
Li H Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bio.GN] (2013) -
csvtk
Shen, W csvtk: A cross-platform, efficient and practical CSV/TSV toolkit in Golang. (GitHub) -
Dragonflye
Petit III RA Dragonflye: Assemble bacterial isolate genomes from Nanopore reads. (GitHub) -
FLASH
Magoč T, Salzberg SL FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27.21 2957-2963 (2011) -
Flye
Kolmogorov M, Yuan J, Lin Y, Pevzner P Assembly of Long Error-Prone Reads Using Repeat Graphs Nature Biotechnology (2019) -
Medaka
ONT Research Medaka: Sequence correction provided by ONT Research (GitHub) -
MEGAHIT
Li D, Liu C-M, Luo R, Sadakane K, Lam T-W MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31.10 1674-1676 (2015) -
Miniasm
Li H Miniasm: Ultrafast de novo assembly for long noisy reads (GitHub) -
Minimap2
Li H Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094-3100 (2018) -
Nanoq
Steinig E Nanoq: Minimal but speedy quality control for nanopore reads in Rust (GitHub) -
Pigz
Adler M. pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory (2015) -
Pilon
Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS one 9.11 e112963 (2014) -
Racon
Vaser R, Sović I, Nagarajan N, Šikić M Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27, 737–746 (2017) -
Rasusa
Hall MB Rasusa: Randomly subsample sequencing reads to a specified coverage. (2019). -
Raven
Vaser R, Šikić M Time- and memory-efficient genome assembly with Raven. Nat Comput Sci 1, 332–336 (2021) -
samclip
Seemann T Samclip: Filter SAM file for soft and hard clipped alignments (GitHub) -
Samtools
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009) -
Shovill
Seemann T Shovill: De novo assembly pipeline for Illumina paired reads (GitHub) -
Shovill-SE
Petit III RA Shovill-SE: A fork of Shovill that includes support for single end reads. (GitHub) -
SKESA
Souvorov A, Agarwala R, Lipman DJ SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biology 19:153 (2018) -
SPAdes
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology 19.5 455-477 (2012) -
Unicycler
Wick RR, Judd LM, Gorrie CL, Holt KE Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 13, e1005595 (2017) -
Velvet
Zerbino DR, Birney E Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18.5 821-829 (2008)