A scalable workflow for detecting differential genomic distances through rapid k-mer matching. Query sequences are analyzed by extracting prefix and suffix k-mers, which are then mapped across large genome collections to identify regions with unexpected distances between sequence boundaries.
- 1. Introduction
- 2. Dependencies
- 3. Installation
- 4. Usage
- 5. Citation
- 6. Issues
- 7. Changelog
- 8. License
- 9. Contacts
- 10. Acknowledgements
This workflow implements a scalable k-mer based approach for detecting differential distances in genomic sequences across massive genome collections. The method extracts prefix and suffix k-mers from query sequences and maps them to genome databases using BWA FastMap. By comparing the observed distances between k-mer pairs to their expected distances in the original query sequence, the pipeline identifies regions where sequence boundaries differ.
The approach scales to very large databases. In the paper where we introduce it, we first apply it to AllTheBacteria (2.4M+ uniformly QC'd bacterial isolate genomes); however, you can supply any sequence data (i.e. metagenomic sequences)
Users provide:
- Genome collections as files-of-files (
.txt) in theinput/directory, pointing to tar.xz compressed genome batches - Query sequences as FASTA files (
.ffn) representing the sequences to search for
The workflow then:
- Extracts prefix and suffix k-mers from each query sequence (configurable k-mer length and gap distance)
- Maps these k-mers to genome collections using BWA fastmap
- Calculates distances between prefix-suffix pairs within each genome
- Compares observed distances to the expected distance in the original query sequence
- Clusters genomes by similar distance patterns using DBSCAN
- Identifies sequences showing differential distances across the genome collection
- Conda (unless the use of Conda is switched off in the configuration) and ideally also Mamba (>= 0.20.0)
- GNU Make
- Python (>=3.7)
- Snakemake (>=6.2.0)
These can be installed by Conda by
bash conda install -c conda-forge -c bioconda -c defaults \
make "python>=3.7" "snakemake>=6.2.0" "mamba>=0.20.0"Other dependencies are installed automatically by
Snakemake when they are requested. The specifications of individual environments can be found in workflow/envs/,
and they contain:
All dependencies across all protocols can also be
installed at once by make conda.
Clone and enter the repository by
git clone https://github.com/aryakaul/prefixsuffix-kmer
cd prefixsuffix-kmerAlternatively, the repository can also be installed using cURL by
mkdir prefixsuffix-kmer
cd prefixsuffix-kmer
curl -L https://github.com/aryakaul/prefixsuffix-kmer/tarball/main \
| tar xvf - --strip-components=1-
Step 1: Provide lists of input genomes.
For every batch, create a txt list of input genomes in theinput/directory (i.e., asinput/{batch_name}.txt. Use either absolute paths (recommended), or paths relative to the root of the Github repository (not relative to the txt files).Such a list can be generated, for instance, by
findby$ find ~/dir_with_my_genomes -name '*.tar.gz' > input/my_first_batch.txt
The supported input file format is a collection of .fa files tar.xz compressed. For example,
$ head -4 input/my_first_batch.txt ~/dir_with_my_genomes/staphylococcus_aureus__01.tar.xz ~/dir_with_my_genomes/staphylococcus_aureus__02.tar.xz ~/dir_with_my_genomes/escherichia_coli__01.tar.xz ~/dir_with_my_genomes/escherichia_coli__02.tar.xz $ tar -tf ~/dir_with_my_genomes/staphylococcus_aureus__01.tar.xz | head -4 staphylococcus_aureus__01/SAMN001.fa staphylococcus_aureus__01/SAMN002.fa staphylococcus_aureus__01/SAMN003.fa staphylococcus_aureus__01/SAMN004.fa
You can find different large collections of genomes in this format (including the 661k collection) here: MOF collections
-
Step 2: Provide genes.
The gene files should be namedinput/{genes}.ffn, and should be in FASTA format. You can provide multiple files. -
Step 3 (optional): Adjust configuration.
By editingconfig.yamlit is possible to specify value ofkand other parameters. -
Step 4: Run the pipeline.
Run the pipeline bymake; this is run by Snakemake with the corresponding parameters. -
Step 5: Retrieve the output files.
All output files will be located inoutput/.
The workflow can be configured via the config.yaml file, and
all options are documented directly there. The configurable functionality includes:
- switching off Conda,
- k for prefix/suffix k-mer matching
- g for gap distance between start and end of the gene
The pipeline is executed via GNU Make, which handles all parameters and passes them to Snakemake.
Here's a list of all implemented commands (to be executed as make {command}):
######################
## General commands ##
######################
all Run everything
help Print help messages
conda Create the conda environments
clean Clean all output archives and files with statistics
cleanall Clean everything but Conda, Snakemake, and input files
cleanallall Clean completely everything
###############
## Reporting ##
###############
viewconf View configuration without comments
reports Create html report
####################
## For developers ##
####################
test Run the workflow on test data
format Reformat all source code
checkformat Check source code formatTests can be run by make test.
If you use this workflow, please cite: Novel genes arise from genomic deletions across the bacterial tree of life
Please use Github issues.
See Releases.
Structure and format for this pipeline, and documentation was heavily inspired and modeled after Miniphy! Check it out!
