Bactopia Tool - checkm¶
The checkm module is used CheckM to assess the quality of microbial
genomes recovered from isolates, single cells, and metagenomes.
Example Usage¶
bactopia --wf checkm \
  --bactopia /path/to/your/bactopia/results \ 
  --include includes.txt  
Output Overview¶
Below is the default output structure for the checkm tool. Where possible the 
file descriptions below were modified from a tools description.
<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│   └── tools
│       └── checkm
│           ├── <SAMPLE_NAME>-genes.aln
│           ├── <SAMPLE_NAME>-results.txt
│           ├── bins/
│           ├── lineage.ms
│           ├── logs
│           │   ├── checkm.log
│           │   ├── nf-checkm.{begin,err,log,out,run,sh,trace}
│           │   └── versions.yml
│           └── storage/
└── bactopia-runs
    └── checkm-<TIMESTAMP>
        ├── merged-results
        │   ├── checkm.tsv
        │   └── logs
        │       └── checkm-concat
        │           ├── nf-merged-results.{begin,err,log,out,run,sh,trace}
        │           └── versions.yml
        └── nf-reports
            ├── checkm-dag.dot
            ├── checkm-report.html
            ├── checkm-timeline.html
            └── checkm-trace.txt
Results¶
Merged Results¶
Below are results that are concatenated into a single file.
| Filename | Description | 
|---|---|
| checkm.tsv | A merged TSV file with checkm results from all samples | 
CheckM¶
Below is a description of the per-sample results from CheckM.
| Filename | Description | 
|---|---|
| <SAMPLE_NAME>-genes.aln | Alignment of multi-copy genes and their AAI identity | 
| <SAMPLE_NAME>-results.txt | Final results of Final results of CheckM's lineage_wf | 
| bins/ | A folder with inputs (e.g. proteins) for processing by CheckM | 
| lineage.ms | Output file describing marker set for each bin | 
| storage/ | A folder with intermediate results from CheckM processing | 
Audit Trail¶
Below are files that can assist you in understanding which parameters and program versions were used.
Logs¶
Each process that is executed will have a folder named logs. In this folder are helpful
files for you to review if the need ever arises.
| Extension | Description | 
|---|---|
| .begin | An empty file used to designate the process started | 
| .err | Contains STDERR outputs from the process | 
| .log | Contains both STDERR and STDOUT outputs from the process | 
| .out | Contains STDOUT outputs from the process | 
| .run | The script Nextflow uses to stage/unstage files and queue processes based on given profile | 
| .sh | The script executed by bash for the process | 
| .trace | The Nextflow Trace report for the process | 
| versions.yml | A YAML formatted file with program versions | 
Nextflow Reports¶
These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.
| Filename | Description | 
|---|---|
| checkm-dag.dot | The Nextflow DAG visualisation | 
| checkm-report.html | The Nextflow Execution Report | 
| checkm-timeline.html | The Nextflow Timeline Report | 
| checkm-trace.txt | The Nextflow Trace report | 
Program Versions¶
At the end of each run, each of the versions.yml files are merged into the files below.
| Filename | Description | 
|---|---|
| software_versions.yml | A complete list of programs and versions used by each process | 
| software_versions_mqc.yml | A complete list of programs and versions formatted for MultiQC | 
Parameters¶
Required Parameters¶
Define where the pipeline should find input data and save output data.
| Parameter | Description | 
|---|---|
--bactopia | 
The path to bactopia results to use as inputs  Type: string | 
Filtering Parameters¶
Use these parameters to specify which samples to include or exclude.
| Parameter | Description | 
|---|---|
--include | 
A text file containing sample names (one per line) to include from the analysis  Type: string | 
--exclude | 
A text file containing sample names (one per line) to exclude from the analysis  Type: string | 
CheckM Parameters¶
| Parameter | Description | 
|---|---|
--checkm_unique | 
Minimum number of unique phylogenetic markers required to use lineage-specific marker set.  Type: integer, Default: 10 | 
--checkm_multi | 
Maximum number of multi-copy phylogenetic markers before defaulting to domain-level marker set.  Type: integer, Default: 10 | 
--aai_strain | 
AAI threshold used to identify strain heterogeneity  Type: number, Default: 0.9 | 
--checkm_length | 
Percent overlap between target and query  Type: number, Default: 0.7 | 
--full_tree | 
Use the full tree (requires ~40GB of memory) for determining lineage of each bin.  Type: boolean | 
--skip_pseudogene_correction | 
Skip identification and filtering of pseudogene  Type: boolean | 
--ignore_thresholds | 
Ignore model-specific score thresholds  Type: boolean | 
--checkm_ali | 
Generate HMMER alignment file for each bin  Type: boolean | 
--checkm_nt | 
Generate nucleotide gene sequences for each bin  Type: boolean | 
--force_domain | 
Use domain-level sets for all bins  Type: boolean | 
--no_refinement | 
Do not perform lineage-specific marker set refinement  Type: boolean | 
--individual_markers | 
Treat marker as independent  Type: boolean | 
--skip_adj_correction | 
Do not exclude adjacent marker genes when estimating contamination  Type: boolean | 
Optional Parameters¶
These optional parameters can be useful in certain settings.
| Parameter | Description | 
|---|---|
--outdir | 
Base directory to write results to  Type: string, Default: ./ | 
--run_name | 
Name of the directory to hold results  Type: string, Default: bactopia | 
--skip_compression | 
Ouput files will not be compressed  Type: boolean | 
--datasets | 
The path to cache datasets to  Type: string | 
--keep_all_files | 
Keeps all analysis files created  Type: boolean | 
Max Job Request Parameters¶
Set the top limit for requested resources for any single job.
| Parameter | Description | 
|---|---|
--max_retry | 
Maximum times to retry a process before allowing it to fail.  Type: integer, Default: 3 | 
--max_cpus | 
Maximum number of CPUs that can be requested for any single job.  Type: integer, Default: 4 | 
--max_memory | 
Maximum amount of memory (in GB) that can be requested for any single job.  Type: integer, Default: 32 | 
--max_time | 
Maximum amount of time (in minutes) that can be requested for any single job.  Type: integer, Default: 120 | 
--max_downloads | 
Maximum number of samples to download at a time  Type: integer, Default: 3 | 
Nextflow Configuration Parameters¶
Parameters to fine-tune your Nextflow setup.
| Parameter | Description | 
|---|---|
--nfconfig | 
A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set.  Type: string | 
--publish_dir_mode | 
Method used to save pipeline results to output directory.  Type: string, Default: copy | 
--infodir | 
Directory to keep pipeline Nextflow logs and reports.  Type: string, Default: ${params.outdir}/pipeline_info | 
--force | 
Nextflow will overwrite existing output files.  Type: boolean | 
--cleanup_workdir | 
After Bactopia is successfully executed, the work directory will be deleted. Type: boolean | 
Nextflow Profile Parameters¶
Parameters to fine-tune your Nextflow setup.
| Parameter | Description | 
|---|---|
--condadir | 
Directory to Nextflow should use for Conda environments  Type: string | 
--registry | 
Docker registry to pull containers from.  Type: string, Default: dockerhub | 
--datasets_cache | 
Directory where downloaded datasets should be stored.  Type: string, Default: <BACTOPIA_DIR>/data/datasets | 
--singularity_cache | 
Directory where remote Singularity images are stored.  Type: string | 
--singularity_pull_docker_container | 
Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead.  Type: boolean | 
--force_rebuild | 
Force overwrite of existing pre-built environments.  Type: boolean | 
--queue | 
Comma-separated name of the queue(s) to be used by a job scheduler (e.g. AWS Batch or SLURM)  Type: string, Default: general,high-memory | 
--cluster_opts | 
Additional options to pass to the executor. (e.g. SLURM: '--account=my_acct_name'  Type: string | 
--disable_scratch | 
All intermediate files created on worker nodes of will be transferred to the head node.  Type: boolean | 
Helpful Parameters¶
Uncommonly used parameters that might be useful.
| Parameter | Description | 
|---|---|
--monochrome_logs | 
Do not use coloured log outputs.  Type: boolean | 
--nfdir | 
Print directory Nextflow has pulled Bactopia to  Type: boolean | 
--sleep_time | 
The amount of time (seconds) Nextflow will wait after setting up datasets before execution.  Type: integer, Default: 5 | 
--validate_params | 
Boolean whether to validate parameters against the schema at runtime  Type: boolean, Default: True | 
--help | 
Display help text.  Type: boolean | 
--wf | 
Specify which workflow or Bactopia Tool to execute  Type: string, Default: bactopia | 
--list_wfs | 
List the available workflows and Bactopia Tools to use with '--wf'  Type: boolean | 
--show_hidden_params | 
Show all params when using --help Type: boolean | 
--help_all | 
An alias for --help --show_hidden_params  Type: boolean | 
--version | 
Display version text.  Type: boolean | 
Citations¶
If you use Bactopia and checkm in your analysis, please cite the following.
- 
Bactopia
Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020) - 
CheckM
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25, 1043–1055 (2015) - 
csvtk
Shen, W csvtk: A cross-platform, efficient and practical CSV/TSV toolkit in Golang. (GitHub)