Skip to content

Assembler

The assembler module uses a variety of assembly tools to create an assembly of Illumina and Oxford Nanopore reads. The tools used are:

Tool Description
Dragonflye Assembly of Oxford Nanopore reads, as well as hybrid assembly with short-read polishing
Shovill Assembly of Illumina paired-end reads
Shovill-SE Assembly of Illumina single-end reads
Unicycler Hybrid assembly, using short-reads first then long-reads

Summary statistics for each assembly are generated using assembly-scan.

Output Overview

Below is the default output structure for the assembler step in Bactopia. Where possible the file descriptions below were modified from a tools description.

<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│   └── main
│       └── assembler
│           ├── flash.{hist|histogram}
|           |   flye.info
│           ├── logs
│           │   ├── {dragonflye|shovill|unicycler}.log
│           │   ├── nf-assembler.{begin,err,log,out,run,sh,trace}
│           │   └── versions.yml
│           ├── <SAMPLE_NAME>.fna.gz
│           ├── <SAMPLE_NAME>.tsv
│           ├── <SAMPLE_NAME>-assembly-error.txt
│           ├── shovill.corrections
│           ├── {flye|miniasm|raven|unicycler}-unpolished.fasta.gz
│           └── {flye|megahit|miniasm|raven|spades|unicycler|velvet}-unpolished.gfa.gz
└── bactopia-runs
    └── bactopia-<TIMESTAMP>
        ├── merged-results
        │   ├── assembly-scan.tsv
        │   └── logs
        │       └── assembly-scan-concat
        │           ├── nf-merged-results.{begin,err,log,out,run,sh,trace}
        │           └── versions.yml
        └── nf-reports
            ├── bactopia-dag.dot
            ├── bactopia-report.html
            ├── bactopia-timeline.html
            └── bactopia-trace.txt

Directory structure might be different

Depending on the options used at runtime, the assembler directory structure might be different, but the output descriptions below still apply.

Results

Merged Results

Below are results that are concatenated into a single file.

Filename Description
assembly-scan.tsv Assembly statistics for all samples

Dragonflye

Below is a description of the per-sample results for Oxford Nanopore reads using Dragonflye.

Filename Description
<SAMPLE_NAME>.fna.gz The final assembly produced by Dragonflye
<SAMPLE_NAME>.tsv A tab-delimited file containing assembly statistics
flye-info.txt A log file containing information about the Flye assembly
{flye|miniasm|raven}-unpolished.fasta.gz Raw unprocessed assembly produced by the used assembler
{flye|miniasm|raven}-unpolished.gfa.gz Raw unprocessed assembly graph produced by the used assembler

Shovill

Below is a description of the per-sample results for Illumina reads using Shovill or Shovill-SE.

Filename Description
<SAMPLE_NAME>.fna.gz The final assembly produced by Dragonflye
<SAMPLE_NAME>.tsv A tab-delimited file containing assembly statistics
flash.hist (Paired-End Only) Numeric histogram of merged read lengths.
flash.histogram (Paired-End Only) Visual histogram of merged read lengths
{megahit|spades|velvet}-unpolished.gfa.gz Raw unprocessed assembly graph produced by the used assembler
shovill.corrections List of post-assembly corrections made by Shovill

Hybrid Assembly (Unicycler)

Below is a description of the per-sample results for a hybrid assembly using Unicycler (--hybrid). When using Unicycler, the short-reads are assembled first, then the long-reads are used to polish the assembly.

Filename Description
<SAMPLE_NAME>.fna.gz The final assembly produced by Dragonflye
<SAMPLE_NAME>.tsv A tab-delimited file containing assembly statistics
unicycler-unpolished.fasta.gz Raw unprocessed assembly produced by Unicycler
unicycler-unpolished.fasta.gz Raw unprocessed assembly graph produced by Unicycler

Hybrid Assembly (Short Read Polishing)

Below is a description of the per-sample results for a hybrid assembly using Dragonflye (--short_polish). When using Dragonflye, the long-reads are assembled first, then the short-reads are used to polish the assembly.

Prefer --short_polish over --hybrid with recent ONT sequencing

Using Unicycler (--hybrid) to create a hybrid assembly works great when you have low-coverage noisy long-reads. However, if you are using recent ONT sequencing, you likely have high-coverage and using the --short_polish method is going to yeild better results (and be faster!) than --hybrid.

Filename Description
<SAMPLE_NAME>.fna.gz The final assembly produced by Dragonflye
<SAMPLE_NAME>.tsv A tab-delimited file containing assembly statistics
flye-info.txt A log file containing information about the Flye assembly
{flye|miniasm|raven}-unpolished.fasta.gz Raw unprocessed assembly produced by the used assembler
{flye|miniasm|raven}-unpolished.gfa.gz Raw unprocessed assembly graph produced by the used assembler

Failed Quality Checks

Built into Bactopia are few basic quality checks to help prevent downstream failures. If a sample fails one of these checks, it will be excluded from further analysis. By excluding these samples, complete pipeline failures are prevented.

Extension Description
-assembly-error.txt Sample failed read count checks and excluded from further analysis

Poor samples are excluded to prevent downstream failures

Samples that fail any of the QC checks will be excluded from further analysis. Those samples will generate a *-error.txt file with the error message. Excluding these samples prevents downstream failures that cause the whole workflow to fail.

Example Error: Assembled Successfully, but 0 Contigs

If a sample assembles successfully, but 0 contigs are formed, the sample will be excluded from further analysis.

Example Text from <SAMPLE_NAME>-assembly-error.txt
<SAMPLE_NAME> assembled successfully, but 0 contigs were formed. Please investigate <SAMPLE_NAME> to determine a cause (e.g. metagenomic, contaminants, etc...) for this outcome. Further assembly-based analysis of <SAMPLE_NAME> will be discontinued.

Example Error: Assembled successfully, but poor assembly size

If you sample assembles successfully, but the assembly size is less than the minimum allowed genome size, the sample will be excluded from further analysis. You can adjust this minimum size using the --min_genome_size parameter.

Example Text from <SAMPLE_NAME>-assembly-error.txt
<SAMPLE_NAME> assembled size (000 bp) is less than the minimum allowed genome size (000 bp). If this is unexpected, please investigate <SAMPLE_NAME> to determine a cause (e.g. metagenomic, contaminants, etc...) for the poor assembly. Otherwise, adjust the --min_genome_size parameter to fit your need. Further assembly based analysis of <SAMPLE_NAME> will be discontinued.

Audit Trail

Below are files that can assist you in understanding which parameters and program versions were used.

Logs

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

Extension Description
.begin An empty file used to designate the process started
.err Contains STDERR outputs from the process
.log Contains both STDERR and STDOUT outputs from the process
.out Contains STDOUT outputs from the process
.run The script Nextflow uses to stage/unstage files and queue processes based on given profile
.sh The script executed by bash for the process
.trace The Nextflow Trace report for the process
versions.yml A YAML formatted file with program versions

Parameters

Assembler

Parameter Description
--shovill_assembler Assembler to be used by Shovill
Type: string, Default: skesa
--dragonflye_assembler Assembler to be used by Dragonflye
Type: string, Default: flye
--use_unicycler Use unicycler for paired end assembly
Type: boolean
--min_contig_len Minimum contig length <0=AUTO>
Type: integer, Default: 500
--min_contig_cov Minimum contig coverage <0=AUTO>
Type: integer, Default: 2
--contig_namefmt Format of contig FASTA IDs in 'printf' style
Type: string
--shovill_opts Extra assembler options in quotes for Shovill
Type: string
--shovill_kmers K-mers to use
Type: string
--dragonflye_opts Extra assembler options in quotes for Dragonflye
Type: string
--trim Enable adaptor trimming
Type: boolean
--no_stitch Disable read stitching for paired-end reads
Type: boolean
--no_corr Disable post-assembly correction
Type: boolean
--unicycler_mode Bridging mode used by Unicycler
Type: string, Default: normal
--min_polish_size Contigs shorter than this value (bp) will not be polished using Pilon
Type: integer, Default: 10000
--min_component_size Graph dead ends smaller than this size (bp) will be removed from the final graph
Type: integer, Default: 1000
--min_dead_end_size Graph dead ends smaller than this size (bp) will be removed from the final graph
Type: integer, Default: 1000
--nanohq For Flye, use '--nano-hq' instead of --nano-raw
Type: boolean
--medaka_model The model to use for Medaka polishing
Type: string
--medaka_rounds The number of Medaka polishing rounds to conduct
Type: integer
--racon_rounds The number of Racon polishing rounds to conduct
Type: integer, Default: 1
--no_polish Skip the assembly polishing step
Type: boolean
--no_miniasm Skip miniasm+Racon bridging
Type: boolean
--no_rotate Do not rotate completed replicons to start at a standard gene
Type: boolean
--reassemble If reads were simulated, they will be used to create a new assembly.
Type: boolean
--polypolish_rounds Number of polishing rounds to conduct with Polypolish for short read polishing
Type: integer, Default: 1
--pilon_rounds Number of polishing rounds to conduct with Pilon for short read polishing
Type: integer

Citations

If you use Bactopia and assembler in your analysis, please cite the following.