Prefix-Suffix K-mer Matching

A scalable workflow for detecting differential genomic distances through rapid k-mer matching. Query sequences are analyzed by extracting prefix and suffix k-mers, which are then mapped across large genome collections to identify regions with unexpected distances between sequence boundaries.

1. Introduction

This workflow implements a scalable k-mer based approach for detecting differential distances in genomic sequences across massive genome collections. The method extracts prefix and suffix k-mers from query sequences and maps them to genome databases using BWA FastMap. By comparing the observed distances between k-mer pairs to their expected distances in the original query sequence, the pipeline identifies regions where sequence boundaries differ.

The approach scales to very large databases. In the paper where we introduce it, we first apply it to AllTheBacteria (2.4M+ uniformly QC'd bacterial isolate genomes); however, you can supply any sequence data (i.e. metagenomic sequences)

How it works

Users provide:

Genome collections as files-of-files (.txt) in the input/ directory, pointing to tar.xz compressed genome batches
Query sequences as FASTA files (.ffn) representing the sequences to search for

The workflow then:

Extracts prefix and suffix k-mers from each query sequence (configurable k-mer length and gap distance)
Maps these k-mers to genome collections using BWA fastmap
Calculates distances between prefix-suffix pairs within each genome
Compares observed distances to the expected distance in the original query sequence
Clusters genomes by similar distance patterns using DBSCAN
Identifies sequences showing differential distances across the genome collection

2. Dependencies

Conda (unless the use of Conda is switched off in the configuration) and ideally also Mamba (>= 0.20.0)
GNU Make
Python (>=3.7)
Snakemake (>=6.2.0)

These can be installed by Conda by

bash conda install -c conda-forge -c bioconda -c defaults \
  make "python>=3.7" "snakemake>=6.2.0" "mamba>=0.20.0"

Other dependencies are installed automatically by Snakemake when they are requested. The specifications of individual environments can be found in workflow/envs/, and they contain:

All dependencies across all protocols can also be installed at once by make conda.

3. Installation

Clone and enter the repository by

git clone https://github.com/aryakaul/prefixsuffix-kmer
cd prefixsuffix-kmer

Alternatively, the repository can also be installed using cURL by

mkdir prefixsuffix-kmer
cd prefixsuffix-kmer
curl -L https://github.com/aryakaul/prefixsuffix-kmer/tarball/main \
    | tar xvf - --strip-components=1

4. Usage

4a. Basic example

Step 1: Provide lists of input genomes.
For every batch, create a txt list of input genomes in the input/ directory (i.e., as input/{batch_name}.txt. Use either absolute paths (recommended), or paths relative to the root of the Github repository (not relative to the txt files).

Such a list can be generated, for instance, by find by

$ find ~/dir_with_my_genomes -name '*.tar.gz' > input/my_first_batch.txt

The supported input file format is a collection of .fa files tar.xz compressed. For example,

$ head -4 input/my_first_batch.txt
~/dir_with_my_genomes/staphylococcus_aureus__01.tar.xz
~/dir_with_my_genomes/staphylococcus_aureus__02.tar.xz
~/dir_with_my_genomes/escherichia_coli__01.tar.xz
~/dir_with_my_genomes/escherichia_coli__02.tar.xz

$ tar -tf ~/dir_with_my_genomes/staphylococcus_aureus__01.tar.xz | head -4
staphylococcus_aureus__01/SAMN001.fa
staphylococcus_aureus__01/SAMN002.fa
staphylococcus_aureus__01/SAMN003.fa
staphylococcus_aureus__01/SAMN004.fa

You can find different large collections of genomes in this format (including the 661k collection) here: MOF collections

Step 2: Provide genes.
The gene files should be named input/{genes}.ffn, and should be in FASTA format. You can provide multiple files.
Step 3 (optional): Adjust configuration.
By editing config.yaml it is possible to specify value of k and other parameters.
Step 4: Run the pipeline.
Run the pipeline by make; this is run by Snakemake with the corresponding parameters.
Step 5: Retrieve the output files.
All output files will be located in output/.

4b. Adjusting configuration

The workflow can be configured via the config.yaml file, and all options are documented directly there. The configurable functionality includes:

switching off Conda,
k for prefix/suffix k-mer matching
g for gap distance between start and end of the gene

4c. List of workflow commands

The pipeline is executed via GNU Make, which handles all parameters and passes them to Snakemake. Here's a list of all implemented commands (to be executed as make {command}):

######################
## General commands ##
######################
    all                  Run everything
    help                 Print help messages
    conda                Create the conda environments
    clean                Clean all output archives and files with statistics
    cleanall             Clean everything but Conda, Snakemake, and input files
    cleanallall          Clean completely everything
###############
## Reporting ##
###############
    viewconf             View configuration without comments
    reports              Create html report
####################
## For developers ##
####################
    test                 Run the workflow on test data
    format               Reformat all source code
    checkformat          Check source code format

4d. Troubleshooting

Tests can be run by make test.

5. Citation

If you use this workflow, please cite: Novel genes arise from genomic deletions across the bacterial tree of life

6. Issues

Please use Github issues.

7. Changelog

See Releases.

8. License

GPL3

9. Contacts

10. Acknowledgements

Structure and format for this pipeline, and documentation was heavily inspired and modeled after Miniphy! Check it out!

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
.conda		.conda
.github/workflows		.github/workflows
.test		.test
assets		assets
post-process		post-process
slurmprofile		slurmprofile
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prefix-Suffix K-mer Matching

Contents

1. Introduction

How it works

2. Dependencies

3. Installation

4. Usage

4a. Basic example

4b. Adjusting configuration

4c. List of workflow commands

4d. Troubleshooting

5. Citation

6. Issues

7. Changelog

8. License

9. Contacts

10. Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prefix-Suffix K-mer Matching

Contents

1. Introduction

How it works

2. Dependencies

3. Installation

4. Usage

4a. Basic example

4b. Adjusting configuration

4c. List of workflow commands

4d. Troubleshooting

5. Citation

6. Issues

7. Changelog

8. License

9. Contacts

10. Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages