Bloom Filter from Scratch

Today I want to implement a very simple bloom filter from scratch. Simply to explore what the challenges are and how far I can get.

Bloom filters are cool probabilistic data structures that can quickly tell you if you’ve seen a datapoint before. It has false positives but no false negatives. Wikipedia phrases it as: ‘in other words, a query returns either “possibly in set” or “definitely not in set”.’, which is a very nice way of phrasing it.

In the era of machine learning bloom filters could for example be used to deduplicate large datasets. In the case of large protein databases for example, while one will lose some unique protein sequences, it is guaranteed that the final set of protein sequences is unique and the memory demand of a bloom filter is constant.This means that even the largest datasets can be processed with consistent memory usage.

As such, bloom filters are a useful tool that should not be missing from any bioinformatician’s toolkit.

import hashlib
import math


class BloomFilter:
    def __init__(
        self,
        maximum_elements: int,
        false_positive: float = 0.001,
    ):
        # Compare https://en.wikipedia.org/wiki/Bloom_filter
        # Calculate the size of the bit array (m)
        n = maximum_elements
        e = false_positive
        m = -(n * math.log(e)) / (math.log(2) ** 2)
        # Calculate the number of hash functions (k)
        k = math.ceil((m / n) * math.log(2))

        self.size = math.ceil(m)
        self.hash_count = math.ceil(k)

        if self.size < 0:
            raise ValueError("Bloom filter too small: {self.size}")

        if self.hash_count < 1:
            raise ValueError(f"Too few hashes: {self.hash_count}")

        # My "bit array" is a list of booleans
        self.bit_array = [False] * self.size

    def _hashes(self, element: str):
        for i in range(self.hash_count):
            hash = (
                int(
                    hashlib.sha256(f"{i}_{element}".encode()).hexdigest(),
                    16,
                )
                % self.size
            )
            yield hash

    def add(self, element: str):
        for index in self._hashes(element):
            self.bit_array[index] = True

    def contains(self, element: str):
        for index in self._hashes(element):
            if self.bit_array[index] is False:
                return False
        return True

This is a very simple implementation. There is a bit of math behind calculating the size of the bit-array (here simply a list of booleans) from the maximum number of expected elements and the tolerated false positive rate. But besides this, the implementation is very simple.

To check if a word has been seen before, one checks if all hashes derived from that word have a positive bit in the bit array. If a single bit is not true, one can be certain that this word has never been seen this before. If all are true, one might have seen the word before, but maybe not.

My implementation can be speed up by using the ‘bitarray’ library of python and by using a faster hashing function. But for understanding how Bloom filters work I prefer this simple approach.

Let’s see this code in action. I am adding a few words to a small bloom filter and show that words I added are known, while new words are not known:

bf = BloomFilter(maximum_elements=1000, false_positive=0.001)

words = [
    "Hello",
    "World",
    "this",
    "is",
    "a",
    "Bloom filter",
]
for word in words:
    bf.add(word)

for word in ["Bloom filter", "volcano"]:
    if bf.contains(word):
        print(f"I might have seen '{word}' before")
    else:
        print(f"I am sure I have never seen '{word}' before")

I might have seen 'Bloom filter' before
I am sure I have never seen 'volcano' before

I want to see how the bit array gets filled up. So here I am adding values and after each value I the new bit array. For simplicity I choose the size in such a way that I have a bit array of length 19 and use 5 hash functions.

def print_bitarray(bf: BloomFilter):
    visualization = "".join(["X" if x else "_" for x in bf.bit_array])
    print(f"{sum(bf.bit_array):>3} {visualization}")


bf = BloomFilter(maximum_elements=5, false_positive=0.05)

print(f"Bit array is {len(bf.bit_array)} positions long")
print(
    f"Each word will flip {bf.hash_count} positions to true if not already true"
)

# Adding the elements
print_bitarray(bf)
for i in range(5):
    bf.add(f"Word_{str(i)}")
    print_bitarray(bf)

Bit array is 32 positions long
Each word will flip 5 positions to true if not already true
  0 ________________________________
  5 __X_____________XX_____XX_______
 10 _XX_____________XX___X_XX__X_XX_
 12 _XX__X__________XX___X_XX__XXXX_
 16 _XX__X___X__X___XX_X_X_XX__XXXXX
 19 _XX_XX___X_XX_X_XX_X_X_XX__XXXXX

In this example it is nicely visible that bits are flipped to true (X) and once flipped, are never flipped back. This also provides an intuitive understanding as to why one has to specify the expected number of elements during the creation of the Bloom filter.

That’s all I have for today. There are several advanced uses of Bloom filters in bioinformatics, it’s fun to read the Bloom filters in Bioinformatics Wikipedia page: https://en.wikipedia.org/wiki/Bloom_filters_in_bioinformatics