Skip to content

Gather

The main purpose of the gather step is to get all the samples into a single place. This includes downloading samples from ENA/SRA or NCBI Assembly. The tools used are:

Tool Description
art For simulating error-free reads for an input assembly
fastq-dl Downloading FASTQ files from ENA/SRA
ncbi-genome-download Downloading FASTA files from NCBI Assembly

This gather step also does basic QC checks to help prevent downstream failures.

Output Overview

Below is the default output structure for the gather step in Bactopia. Where possible the file descriptions below were modified from a tools description.

<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│   └── main
│       └── gather
│           ├── logs
│           │   ├── nf-gather.{begin,err,log,out,run,sh,trace}
│           │   └── versions.yml
│           ├── <SAMPLE_NAME>-gzip-error.txt
│           ├── <SAMPLE_NAME>-low-basepair-proportion-error.txt
│           ├── <SAMPLE_NAME>-low-read-count-error.txt
│           ├── <SAMPLE_NAME>-low-sequence-depth-error.txt
│           └── <SAMPLE_NAME>-meta.tsv
└── bactopia-runs
    └── bactopia-<TIMESTAMP>
        ├── merged-results
        │   ├── logs
        │   │   └── meta-concat
        │   │       ├── nf-merged-results.{begin,err,log,out,run,sh,trace}
        │   │       └── versions.yml
        │   └── meta.tsv
        └── nf-reports
            ├── bactopia-dag.dot
            ├── bactopia-report.html
            ├── bactopia-timeline.html
            └── bactopia-trace.txt

Results

Merged Results

Below are results that are concatenated into a single file.

Filename Description
meta.tsv A tab-delimited file with bactopia metadata for all samples

gather

Below is a description of the per-sample results from the gather subworkflow.

Extension Description
-meta.tsv A tab-delimited file with bactopia metadata for each sample

Failed Quality Checks

Built into Bactopia are few basic quality checks to help prevent downstream failures. If a sample fails one of these checks, it will be excluded from further analysis. By excluding these samples, complete pipeline failures are prevented.

Extension Description
-gzip-error.txt Sample failed Gzip checks and excluded from further analysis
-low-basepair-proportion-error.txt Sample failed basepair proportion checks and excluded from further analysis
-low-read-count-error.txt Sample failed read count checks and excluded from further analysis
-low-sequence-depth-error.txt Sample failed sequenced basepair checks and excluded from further analysis

Poor samples are excluded to prevent downstream failures

Samples that fail any of the QC checks will be excluded from further analysis. Those samples will generate a *-error.txt file with the error message. Excluding these samples prevents downstream failures that cause the whole workflow to fail.

Example Error: Input FASTQ(s) failed Gzip checks

If input FASTQ(s) fail to pass Gzip test, the sample will be excluded from further analysis.

Example Text from <SAMPLE_NAME>-gzip-error.txt
<SAMPLE_NAME> FASTQs failed Gzip tests. Please check the input FASTQs. Further analysis is discontinued.

Example Error: Input FASTQs have disproportionate number of reads

If input FASTQ(s) for a sample have disproportionately different number of reads between the two pairs, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_proportion parameter.

Example Text from <SAMPLE_NAME>-low-basepair-proportion-error.txt
<SAMPLE_NAME> FASTQs failed to meet the minimum shared basepairs (X``). They sharedYbasepairs, with R1 havingAbp and R2 havingB` bp. Further analysis is discontinued.

Example Error: Input FASTQ(s) has too few reads

If input FASTQ(s) for a sample have less than the minimum required reads, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_reads parameter.

Example Text from <SAMPLE_NAME>-low-read-count-error.txt
<SAMPLE_NAME> FASTQ(s) contain X total reads. This does not exceed the required minimum Y read count. Further analysis is discontinued.

Example Error: Input FASTQ(s) has too little sequenced basepairs

If input FASTQ(s) for a sample fails to meet the minimum number of sequenced basepairs, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_basepairs parameter.

Example Text from <SAMPLE_NAME>-low-sequence-depth-error.txt
<SAMPLE_NAME> FASTQ(s) contain X total basepairs. This does not exceed the required minimum Y bp. Further analysis is discontinued.

Audit Trail

Below are files that can assist you in understanding which parameters and program versions were used.

Logs

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

Extension Description
.begin An empty file used to designate the process started
.err Contains STDERR outputs from the process
.log Contains both STDERR and STDOUT outputs from the process
.out Contains STDOUT outputs from the process
.run The script Nextflow uses to stage/unstage files and queue processes based on given profile
.sh The script executed by bash for the process
.trace The Nextflow Trace report for the process
versions.yml A YAML formatted file with program versions

Parameters

Gather

Parameter Description
--skip_fastq_check Skip minimum requirement checks for input FASTQs
Type: boolean
--min_basepairs The minimum amount of basepairs required to continue downstream analyses.
Type: integer, Default: 2241820
--min_reads The minimum amount of reads required to continue downstream analyses.
Type: integer, Default: 7472
--min_coverage The minimum amount of coverage required to continue downstream analyses.
Type: integer, Default: 10
--min_proportion The minimum proportion of basepairs for paired-end reads to continue downstream analyses.
Type: number, Default: 0.5
--min_genome_size The minimum estimated genome size allowed for the input sequence to continue downstream analyses.
Type: integer, Default: 100000
--max_genome_size The maximum estimated genome size allowed for the input sequence to continue downstream analyses.
Type: integer, Default: 18040666
--attempts Maximum times to attempt downloads
Type: integer, Default: 3
--use_ena Download FASTQs from ENA
Type: boolean
--no_cache Skip caching the assembly summary file from ncbi-genome-download
Type: boolean

Citations

If you use Bactopia and gather in your analysis, please cite the following.