Playing With Embeddings
Embeddings are very cool. Today I would like to share how I used embeddings and classical machine learning to bring order into my picture library. I, like many other people, use my phone camera quite liberally. I take pictures I want to keep, and I send pictures of price tags to friends and family for comparison. I document successful recipes and take pictures of documents as a digital copy to file away....
Processing Single-Cell data from Mouse
Single cell RNA sequencing (scRNA-Seq) is a fascinating way of getting insights into the molecular processes guiding an individual cell. While RNA won’t provide the full picture on the inner workings of a cell, proteins, hormones and nutrients will have a say in that too, it certainly is a part of the puzzle. For quite some time, researchers have had the ability to not only look at one cell, but hundreds or thousands of cells....
Digging into the Human Lung Cell Atlas
The human cell atlas project is growing and I want to know how to use it. The Human Lung Cell Atlas (HLCA) was published in 2023. It has ~2.3 Million cells and can be downloaded here: data.humancellatlas.org/hca-bio-networks/lung/atlases/lung-v1-0 In this notebook I will download the atlas and explore the data that is provided. For me this is a way to learn about a new resource I am not yet familiar with, but would like to understand better....
Minimizers are Just Fancy K-mers
Today I am picking up an old but influential paper. Cited over 400 times, the paper “Reducing storage requirements for biological sequence comparison” by Roberts et al. (2004) has had considerable impact on the sequencing community. If you are using modern aligners, you relied on the ideas published in that paper. One notable paper citing this reference is the publication of Minimap2. It is the first citation in its method section, and as such definitely worth a read....
Monty Hall Problem: Win a Goat or a Car
The Monty Hall Problem has been living in my head rent-free since I saw the movie “21”. The problem is famous and quickly stated: In a gameshow there is one contestant and three doors. Behind one of the doors is a prize (a car), behind the two other is no prize (a goat). The contestant can choose any door. After the contestant chooses their door, the gameshow host, knowing where the goats are, opens one of the two doors not picked by the contestant and reveals a goat....
Bloom Filter from Scratch
Today I want to implement a very simple bloom filter from scratch. Simply to explore what the challenges are and how far I can get. Bloom filters are cool probabilistic data structures that can quickly tell you if you’ve seen a datapoint before. It has false positives but no false negatives. Wikipedia phrases it as: ‘in other words, a query returns either “possibly in set” or “definitely not in set”.’, which is a very nice way of phrasing it....
Hi-C: Unraveling the 3D Structure of Genomes
Today I wanted to have a look at the first Hi-C paper. Hi-C was a technology that was published while I was studying and always fascinated me. Through the power of linking DNA to proteins, a bit of digestion or fragmentation, ligation and sequencing, we can figure out the loops, twists and interactions on a whole genome level. If you have not yet heard of Hi-C, you might want to check out the Wikipedia page real quick: https://en....