Today I am picking up an old but influential paper. Cited over 400 times, the paper “Reducing storage requirements for biological sequence comparison” by Roberts et al. (2004) has had considerable impact on the sequencing community. If you are using modern aligners, you relied on the ideas published in that paper.

One notable paper citing this reference is the publication of Minimap2. It is the first citation in its method section, and as such definitely worth a read.

The Core Idea

The core idea in the paper is that to compare sequences in the age of next generation sequencing (NGS) and large datasets one should use a smart way of reducing data to avoid having to compare each sequence to all other sequences.

To solve that issue the authors present the concept of minimizer.

Today I will implement that concept and see if I can identify meaningful minimizers.

Minimizers, what are they?

Minimizers are a relatively simple idea. The key problem they try to solve is to provide good seeds, meaning locations where two sequences are identical, to kick-start an alignment of those two sequences.

The most naive way of solving this would be to compute all k-mers of both sequences, find the common k-mers and try to align the sequences starting at each k-mer. But those k-mer databases would become huge.

So to reduce that solution space, instead of storing all k-mers, minimizers are the “smallest” k-mers in a window. As long as the sequences share stretches of identical nucleotides large enough, they will also share the odd “smallest” k-mer. In reality one could also store the “biggest” k-mer. It really does not matter as long as the index and the query operation computes the order of the k-mers the same way.

All that is left to do for each minimizer is to store its location in the sequence so that one can use it as a seed for an alignment if it is found in the query.

A Simple Implementation

import hashlib

from pydantic import BaseModel


class Kmer(BaseModel):
    kmer: str

    def __hash__(self):
        # Hashing function I am using to sort k-mers
        # Sorting lexicographically will lead to
        # uninformative k-mers such as AAAA
        return int(hashlib.md5(self.kmer.encode()).hexdigest(), 16)

    def __len__(self):
        return len(self.kmer)

    def __str__(self):
        return self.kmer


class Minimizer(BaseModel):
    kmer: Kmer
    sequence_id: str
    position: int

    def __hash__(self):
        return int(
            hashlib.md5(f"{self.kmer}{self.position}".encode()).hexdigest(), 16
        )

    def __lt__(self, other):
        return hash(self.kmer) < hash(other.kmer)

    def __eq__(self, other):
        if not isinstance(other, Minimizer):
            raise ValueError("Can only compare to other Minimizer")
        return self.kmer == other.kmer and self.position == other.position
    
    def __str__(self):
        return f"'{self.kmer}' @ {self.sequence_id}: {self.position}"


minimizer = Minimizer(kmer=Kmer(kmer="ATCG"), sequence_id="seq_1", position=99)

These classes of Kmer and Minimizer implement the basic functionality that I need next, when I want to find the smallest Minimizer in a sequence. The way I implemented it, I can sort a list of Minimizers based on the hash.

kmers = ["A", "B", "C"]
minimizers = [
    Minimizer(kmer=Kmer(kmer=kmer), position=position, sequence_id="1")
    for position, kmer in enumerate(kmers)
]
sorted_minimizers = sorted(minimizers)
print(f"The k-mer with the lowest value is: '{sorted_minimizers[0].kmer}'")
The k-mer with the lowest value is: 'B'

That’s working well. Now I can get the Minimizers for a sequence along its windows and store them:

def windows(sequence: str, window_size: int):
    for i in range(len(sequence) - window_size + 1):
        yield i


def get_minimizer(
    sequence_id: str, sequence: str, k: int, offset: int
) -> Minimizer:
    all_kmers = []
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i : i + k]
        all_kmers.append(
            Minimizer(
                kmer=Kmer(kmer=kmer),
                sequence_id=sequence_id,
                position=offset + i,
            )
        )
    all_kmers.sort()

    return all_kmers[0]


def minimizers(
    sequence_id: str, sequence: str, w: int, k: int, unique: bool = True
) -> list[dict]:
    minimizers = []

    for offset, start in enumerate(windows(sequence, w)):
        window = sequence[start : start + w]
        minimizers.append(get_minimizer(sequence_id, window, k, offset))

    if unique:
        minimizers = list(set(minimizers))
    return minimizers

These basic functions is all I really need to get the Minimizers of a sequence given a window size and a k.

example_sequence = "Hello World, this is a sequence."

k = 7
window_size = 12

sequence_minimizers = minimizers(
    sequence_id="seq_1",
    sequence=example_sequence,
    w=window_size,
    k=k,
)
sequence_minimizers.sort(key=lambda x: x.position) # sort by position for displaying

print(f"Collected {len(sequence_minimizers)} minimizers")
print("The first three Minimizers:")
for i in range(3):
    print(sequence_minimizers[i])
Collected 7 minimizers
The first three Minimizers:
'Hello W' @ seq_1: 0
'llo Wor' @ seq_1: 2
' World,' @ seq_1: 5

Now that I can get the Minimizers I can also find the common Minimizers between two sequences.

def visualize_minimizer(seq: str, minimizer: Minimizer) -> str:
    before = seq[: minimizer.position].lower()
    after = seq[minimizer.position + len(minimizer.kmer) :].lower()
    return f"{before}{str(minimizer.kmer.kmer).upper()}{after}"


def compare_sequences(seq1: str, seq2: str, w: int, k: int) -> list[dict]:
    minimizers1 = minimizers("seq_1", seq1, w, k)
    minimizers2 = minimizers("seq_2", seq2, w, k)

    minimizer_set1 = {m.kmer for m in minimizers1}
    minimizer_set2 = {m.kmer for m in minimizers2}

    common_minimizers = minimizer_set1.intersection(minimizer_set2)

    # Prepare results
    comparison_results = []
    for minimizer in common_minimizers:
        # find the substrings that match and show that
        for i in [m for m in minimizers1 if m.kmer == minimizer]:
            for j in [m for m in minimizers2 if m.kmer == minimizer]:
                print(f"Common kmer: {minimizer.kmer}")
                print(visualize_minimizer(seq1, i))
                print(visualize_minimizer(seq2, j))
                print("")

    return comparison_results


sequence1 = "ATGCTAGCATACGCATCACGCATC"
sequence2 = "GGATCAGCTCGAGCATACGCATACGCATCGCATCGAT"

w = 10  # Window size
k = 8  # k-mer size
comparison_results = compare_sequences(sequence1, sequence2, w, k)
Common kmer: ATACGCAT
atgctagcATACGCATcacgcatc
ggatcagctcgagcATACGCATacgcatcgcatcgat

Common kmer: ATACGCAT
atgctagcATACGCATcacgcatc
ggatcagctcgagcatacgcATACGCATcgcatcgat

Common kmer: TACGCATC
atgctagcaTACGCATCacgcatc
ggatcagctcgagcatacgcaTACGCATCgcatcgat

Common kmer: AGCATACG
atgctAGCATACGcatcacgcatc
ggatcagctcgAGCATACGcatacgcatcgcatcgat

This implementation shows how easy it is for this approach to find appropriate seed locations for starting a pairwise alignment.

The paper goes more into detail and also introduces the concept of End-minimizers. I can only recommend checking it out.

Thats all I have today. I hope it was interesting.