Gather
The main purpose of the gather
step is to get all the samples into a single place. This
includes downloading samples from ENA/SRA or NCBI Assembly. The tools used are:
Tool | Description |
---|---|
art | For simulating error-free reads for an input assembly |
fastq-dl | Downloading FASTQ files from ENA/SRA |
ncbi-genome-download | Downloading FASTA files from NCBI Assembly |
This gather
step also does basic QC checks to help prevent downstream failures.
Output Overview¶
Below is the default output structure for the gather
step in Bactopia. Where
possible the file descriptions below were modified from a tools description.
<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│ └── main
│ └── gather
│ ├── logs
│ │ ├── nf-gather.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── <SAMPLE_NAME>-gzip-error.txt
│ ├── <SAMPLE_NAME>-low-basepair-proportion-error.txt
│ ├── <SAMPLE_NAME>-low-read-count-error.txt
│ ├── <SAMPLE_NAME>-low-sequence-depth-error.txt
│ └── <SAMPLE_NAME>-meta.tsv
└── bactopia-runs
└── bactopia-<TIMESTAMP>
├── merged-results
│ ├── logs
│ │ └── meta-concat
│ │ ├── nf-merged-results.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ └── meta.tsv
└── nf-reports
├── bactopia-dag.dot
├── bactopia-report.html
├── bactopia-timeline.html
└── bactopia-trace.txt
Results¶
Merged Results¶
Below are results that are concatenated into a single file.
Filename | Description |
---|---|
meta.tsv | A tab-delimited file with bactopia metadata for all samples |
gather¶
Below is a description of the per-sample results from the gather
subworkflow.
Extension | Description |
---|---|
-meta.tsv | A tab-delimited file with bactopia metadata for each sample |
Failed Quality Checks¶
Built into Bactopia are few basic quality checks to help prevent downstream failures. If a sample fails one of these checks, it will be excluded from further analysis. By excluding these samples, complete pipeline failures are prevented.
Extension | Description |
---|---|
-gzip-error.txt | Sample failed Gzip checks and excluded from further analysis |
-low-basepair-proportion-error.txt | Sample failed basepair proportion checks and excluded from further analysis |
-low-read-count-error.txt | Sample failed read count checks and excluded from further analysis |
-low-sequence-depth-error.txt | Sample failed sequenced basepair checks and excluded from further analysis |
Poor samples are excluded to prevent downstream failures
Samples that fail any of the QC checks will be excluded from further analysis.
Those samples will generate a *-error.txt
file with the error message. Excluding
these samples prevents downstream failures that cause the whole workflow to fail.
Example Error: Input FASTQ(s) failed Gzip checks
If input FASTQ(s) fail to pass Gzip test, the sample will be excluded from further analysis.
Example Text from <SAMPLE_NAME>-gzip-error.txt
<SAMPLE_NAME> FASTQs failed Gzip tests. Please check the input FASTQs. Further
analysis is discontinued.
Example Error: Input FASTQs have disproportionate number of reads
If input FASTQ(s) for a sample have disproportionately different number of reads
between the two pairs, the sample will be excluded from further analysis. You can
adjust this minimum read count using the --min_proportion
parameter.
Example Text from <SAMPLE_NAME>-low-basepair-proportion-error.txt
<SAMPLE_NAME> FASTQs failed to meet the minimum shared basepairs (X``). They
shared
Ybasepairs, with R1 having
Abp and R2 having
B` bp. Further
analysis is discontinued.
Example Error: Input FASTQ(s) has too few reads
If input FASTQ(s) for a sample have less than the minimum required reads, the
sample will be excluded from further analysis. You can adjust this minimum read
count using the --min_reads
parameter.
Example Text from <SAMPLE_NAME>-low-read-count-error.txt
<SAMPLE_NAME> FASTQ(s) contain X
total reads. This does not exceed the required
minimum Y
read count. Further analysis is discontinued.
Example Error: Input FASTQ(s) has too little sequenced basepairs
If input FASTQ(s) for a sample fails to meet the minimum number of sequenced
basepairs, the sample will be excluded from further analysis. You can
adjust this minimum read count using the --min_basepairs
parameter.
Example Text from <SAMPLE_NAME>-low-sequence-depth-error.txt
<SAMPLE_NAME> FASTQ(s) contain X
total basepairs. This does not exceed the
required minimum Y
bp. Further analysis is discontinued.
Audit Trail¶
Below are files that can assist you in understanding which parameters and program versions were used.
Logs¶
Each process that is executed will have a folder named logs
. In this folder are helpful
files for you to review if the need ever arises.
Extension | Description |
---|---|
.begin | An empty file used to designate the process started |
.err | Contains STDERR outputs from the process |
.log | Contains both STDERR and STDOUT outputs from the process |
.out | Contains STDOUT outputs from the process |
.run | The script Nextflow uses to stage/unstage files and queue processes based on given profile |
.sh | The script executed by bash for the process |
.trace | The Nextflow Trace report for the process |
versions.yml | A YAML formatted file with program versions |
Parameters¶
Gather¶
Parameter | Description |
---|---|
--skip_fastq_check |
Skip minimum requirement checks for input FASTQs Type: boolean |
--min_basepairs |
The minimum amount of basepairs required to continue downstream analyses. Type: integer , Default: 2241820 |
--min_reads |
The minimum amount of reads required to continue downstream analyses. Type: integer , Default: 7472 |
--min_coverage |
The minimum amount of coverage required to continue downstream analyses. Type: integer , Default: 10 |
--min_proportion |
The minimum proportion of basepairs for paired-end reads to continue downstream analyses. Type: number , Default: 0.5 |
--min_genome_size |
The minimum estimated genome size allowed for the input sequence to continue downstream analyses. Type: integer , Default: 100000 |
--max_genome_size |
The maximum estimated genome size allowed for the input sequence to continue downstream analyses. Type: integer , Default: 18040666 |
--attempts |
Maximum times to attempt downloads Type: integer , Default: 3 |
--use_ena |
Download FASTQs from ENA Type: boolean |
--no_cache |
Skip caching the assembly summary file from ncbi-genome-download Type: boolean |
Citations¶
If you use Bactopia and gather
in your analysis, please cite the following.
-
Bactopia
Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020) -
ART
Huang W, Li L, Myers JR, Marth GT ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012) -
fastq-dl
Petit III RA fastq-dl: Download FASTQ files from SRA or ENA repositories. (GitHub) -
fastq-scan
Petit III RA fastq-scan: generate summary statistics of input FASTQ sequences. (GitHub) -
ncbi-genome-download
Blin K ncbi-genome-download: Scripts to download genomes from the NCBI FTP servers (GitHub) -
Pigz
Adler M. pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory (2015)