Today I am picking up an old but influential paper. Cited over 400 times, the paper “Reducing storage requirements for biological sequence comparison” by Roberts et al. (2004) has had considerable impact on the sequencing community. If you are using modern aligners, you relied on the ideas published in that paper.
One notable paper citing this reference is the publication of Minimap2. It is the first citation in its method section, and as such definitely worth a read.
The Core Idea
The core idea in the paper is that to compare sequences in the age of next generation sequencing (NGS) and large datasets one should use a smart way of reducing data to avoid having to compare each sequence to all other sequences.
To solve that issue the authors present the concept of minimizer.
Today I will implement that concept and see if I can identify meaningful minimizers.
Minimizers, what are they?
Minimizers are a relatively simple idea. The key problem they try to solve is to provide good seeds, meaning locations where two sequences are identical, to kick-start an alignment of those two sequences.
The most naive way of solving this would be to compute all k-mers of both sequences, find the common k-mers and try to align the sequences starting at each k-mer. But those k-mer databases would become huge.
So to reduce that solution space, instead of storing all k-mers, minimizers are the “smallest” k-mers in a window. As long as the sequences share stretches of identical nucleotides large enough, they will also share the odd “smallest” k-mer. In reality one could also store the “biggest” k-mer. It really does not matter as long as the index and the query operation computes the order of the k-mers the same way.
All that is left to do for each minimizer is to store its location in the sequence so that one can use it as a seed for an alignment if it is found in the query.
A Simple Implementation
import hashlib
from pydantic import BaseModel
class Kmer(BaseModel):
kmer: str
def __hash__(self):
# Hashing function I am using to sort k-mers
# Sorting lexicographically will lead to
# uninformative k-mers such as AAAA
return int(hashlib.md5(self.kmer.encode()).hexdigest(), 16)
def __len__(self):
return len(self.kmer)
def __str__(self):
return self.kmer
class Minimizer(BaseModel):
kmer: Kmer
sequence_id: str
position: int
def __hash__(self):
return int(
hashlib.md5(f"{self.kmer}{self.position}".encode()).hexdigest(), 16
)
def __lt__(self, other):
return hash(self.kmer) < hash(other.kmer)
def __eq__(self, other):
if not isinstance(other, Minimizer):
raise ValueError("Can only compare to other Minimizer")
return self.kmer == other.kmer and self.position == other.position
def __str__(self):
return f"'{self.kmer}' @ {self.sequence_id}: {self.position}"
minimizer = Minimizer(kmer=Kmer(kmer="ATCG"), sequence_id="seq_1", position=99)
These classes of Kmer and Minimizer implement the basic functionality that I need next, when I want to find the smallest Minimizer in a sequence. The way I implemented it, I can sort a list of Minimizers based on the hash.
kmers = ["A", "B", "C"]
minimizers = [
Minimizer(kmer=Kmer(kmer=kmer), position=position, sequence_id="1")
for position, kmer in enumerate(kmers)
]
sorted_minimizers = sorted(minimizers)
print(f"The k-mer with the lowest value is: '{sorted_minimizers[0].kmer}'")
The k-mer with the lowest value is: 'B'
That’s working well. Now I can get the Minimizers for a sequence along its windows and store them:
def windows(sequence: str, window_size: int):
for i in range(len(sequence) - window_size + 1):
yield i
def get_minimizer(
sequence_id: str, sequence: str, k: int, offset: int
) -> Minimizer:
all_kmers = []
for i in range(len(sequence) - k + 1):
kmer = sequence[i : i + k]
all_kmers.append(
Minimizer(
kmer=Kmer(kmer=kmer),
sequence_id=sequence_id,
position=offset + i,
)
)
all_kmers.sort()
return all_kmers[0]
def minimizers(
sequence_id: str, sequence: str, w: int, k: int, unique: bool = True
) -> list[dict]:
minimizers = []
for offset, start in enumerate(windows(sequence, w)):
window = sequence[start : start + w]
minimizers.append(get_minimizer(sequence_id, window, k, offset))
if unique:
minimizers = list(set(minimizers))
return minimizers
These basic functions is all I really need to get the Minimizers of a sequence given a window size
and a k
.
example_sequence = "Hello World, this is a sequence."
k = 7
window_size = 12
sequence_minimizers = minimizers(
sequence_id="seq_1",
sequence=example_sequence,
w=window_size,
k=k,
)
sequence_minimizers.sort(key=lambda x: x.position) # sort by position for displaying
print(f"Collected {len(sequence_minimizers)} minimizers")
print("The first three Minimizers:")
for i in range(3):
print(sequence_minimizers[i])
Collected 7 minimizers
The first three Minimizers:
'Hello W' @ seq_1: 0
'llo Wor' @ seq_1: 2
' World,' @ seq_1: 5
Now that I can get the Minimizers I can also find the common Minimizers between two sequences.
def visualize_minimizer(seq: str, minimizer: Minimizer) -> str:
before = seq[: minimizer.position].lower()
after = seq[minimizer.position + len(minimizer.kmer) :].lower()
return f"{before}{str(minimizer.kmer.kmer).upper()}{after}"
def compare_sequences(seq1: str, seq2: str, w: int, k: int) -> list[dict]:
minimizers1 = minimizers("seq_1", seq1, w, k)
minimizers2 = minimizers("seq_2", seq2, w, k)
minimizer_set1 = {m.kmer for m in minimizers1}
minimizer_set2 = {m.kmer for m in minimizers2}
common_minimizers = minimizer_set1.intersection(minimizer_set2)
# Prepare results
comparison_results = []
for minimizer in common_minimizers:
# find the substrings that match and show that
for i in [m for m in minimizers1 if m.kmer == minimizer]:
for j in [m for m in minimizers2 if m.kmer == minimizer]:
print(f"Common kmer: {minimizer.kmer}")
print(visualize_minimizer(seq1, i))
print(visualize_minimizer(seq2, j))
print("")
return comparison_results
sequence1 = "ATGCTAGCATACGCATCACGCATC"
sequence2 = "GGATCAGCTCGAGCATACGCATACGCATCGCATCGAT"
w = 10 # Window size
k = 8 # k-mer size
comparison_results = compare_sequences(sequence1, sequence2, w, k)
Common kmer: ATACGCAT
atgctagcATACGCATcacgcatc
ggatcagctcgagcATACGCATacgcatcgcatcgat
Common kmer: ATACGCAT
atgctagcATACGCATcacgcatc
ggatcagctcgagcatacgcATACGCATcgcatcgat
Common kmer: TACGCATC
atgctagcaTACGCATCacgcatc
ggatcagctcgagcatacgcaTACGCATCgcatcgat
Common kmer: AGCATACG
atgctAGCATACGcatcacgcatc
ggatcagctcgAGCATACGcatacgcatcgcatcgat
This implementation shows how easy it is for this approach to find appropriate seed locations for starting a pairwise alignment.
The paper goes more into detail and also introduces the concept of End-minimizers. I can only recommend checking it out.
Thats all I have today. I hope it was interesting.