In the past I have explored what I can do with image embeddings and used it to train a very usable set of classifiers that sort out random photos and nature photos from my camera roll. If you want to read about that you can find the blog post here: openpaul.github.io/posts/2025-04-06-image-sorting and here a small intro into embeddings: openpaul.github.io/posts/2024-09-28-image-embeddings/

Recently I became interested in detecting faces and identifying people in my photos locally. Apps such as immich support that and if I just want to detect faces and sort my pictures it would be my go-to app. But I want to play around and understand what is going on.

So I had a look and immich uses insightface. It’s open source and a convenient way to get your hands on some face detection pipelines and models.

So as always I first make a quick Python setup for this project use conda or uv as you see fit:

uv venv -p 3.13 # use 3.13, 3.12 does lead to some install issues for me
source .venv/bin/activate
uv pip install \                                   
   jupyter \    
   insightface==0.7.3 \
   hdbscan==0.8.40 \
   scikit-learn==1.7.1 \
   plotnine \
   onnxruntime \
   pillow \
   p9customtheme \
   umap-learn==0.5.9.post2

Now of course the next step is to simply test a very basic face detection pipeline. Luckily this is very easy with Insightface, but first I download and extract a face dataset to have some images:

# Face Clustering Demo - Setup and Data Download
import shutil
import subprocess
import tempfile
from pathlib import Path


def get_testdataset(folder: Path = Path("./face_detection/testdata")) -> Path:
    test_dataset_url = "https://www.kaggle.com/api/v1/datasets/download/olgabelitskaya/yale-face-database"
    folder.mkdir(parents=True, exist_ok=True)

    if folder.exists() and any(folder.iterdir()):
        return folder

    if shutil.which("curl") is None:
        raise RuntimeError("curl is not installed")
    if shutil.which("unzip") is None:
        raise RuntimeError("unzip is not installed")

    with tempfile.TemporaryDirectory() as tmpdir:
        zip_path = Path(tmpdir) / "yale_face_database.zip"
        subprocess.run(
            ["curl", "-L", "-o", str(zip_path), test_dataset_url], check=True
        )
        subprocess.run(
            ["unzip", "-o", str(zip_path), "-d", str(folder)], check=True
        )
    return folder


dataset_path = get_testdataset()
image_files = [
    f
    for f in Path(dataset_path).glob("subject*")
    if f.is_file() and f.name.startswith("subject")
]
print(f"Found {len(image_files)} images in dataset at {dataset_path}")
Found 165 images in dataset at face_detection/testdata

The pictures in that dataset are of faces with different expressions and lighting. But overall very easy to work with as we will see in a moment.

Next we setup the insightface analysis workflow which is only a few lines of code.

import numpy as np
from insightface.app import FaceAnalysis
from PIL import Image

face_pipeline = FaceAnalysis(
    name="buffalo_l",
    providers=["CPUExecutionProvider"],
)
face_pipeline.prepare(ctx_id=0, det_size=(640, 640))
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: /home/paul/.insightface/models/buffalo_l/1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: /home/paul/.insightface/models/buffalo_l/2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: /home/paul/.insightface/models/buffalo_l/det_10g.onnx detection [1, 3, '?', '?'] 127.5 128.0
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: /home/paul/.insightface/models/buffalo_l/genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: /home/paul/.insightface/models/buffalo_l/w600k_r50.onnx recognition ['None', 3, 112, 112] 127.5 127.5
set det-size: (640, 640)
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: /home/paul/.insightface/models/buffalo_l/w600k_r50.onnx recognition ['None', 3, 112, 112] 127.5 127.5
set det-size: (640, 640)

This is all we need to setup a state of the art facial recognition pipeline. Let’s run it on a photo. The photo we use is called “Malala Yousafzai with Kamala Harris.jpg” and was obtained from “United States Senate - The Office of Kamala Harris, Public domain, via Wikimedia Commons” (commons.wikimedia.org/wiki/File:Malala_Yousafzai_with_Kamala_Harris.jpg)

from io import BytesIO

import matplotlib.pyplot as plt
import requests
from matplotlib import patches

image_url = "https://upload.wikimedia.org/wikipedia/commons/7/7b/Malala_Yousafzai_with_Kamala_Harris.jpg"


def image_from_url(url: str) -> np.ndarray:
    try:
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        response.raise_for_status()
    except requests.RequestException as e:
        raise RuntimeError(f"Failed to download image from {url}: {e}")
    img = Image.open(BytesIO(response.content)).convert("RGB")

    return np.array(img)


img = image_from_url(image_url)
faces = face_pipeline.get(img)

fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Original image
axes[0].imshow(img)
axes[0].set_title("Original")
axes[0].axis("off")

# Annotated image
axes[1].imshow(img)
for face in faces:
    box = face.bbox.astype(int)
    rect = patches.Rectangle(
        (box[0], box[1]),
        box[2] - box[0],
        box[3] - box[1],
        linewidth=2,
        edgecolor="#23890F",
        facecolor="none",
    )
    axes[1].add_patch(rect)
    label = f"Score: {face.det_score:.2f}\nAge: {face.age:.1f}"
    axes[1].text(
        box[0], box[1] - 5, label, color="black", fontsize=6, weight="bold"
    )

axes[1].set_title("Annotated")
axes[1].axis("off")

plt.tight_layout()
plt.show()

png

That was easy. And we got so much information. We found 6 faces and for each got a confidence score, the bounding box and an age estimate. We also got a face embedding which is key for later comparison between photos. I think the model did a very decent job and found most relevant faces.

Now the key feature is that we can compare face embeddings. Let’s say we find another photo of Kamala Harris. In this example we will use her Vice Presidential Portrait (Lawrence Jackson, Public domain, via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Kamala_Harris_Vice_Presidential_Portrait.jpg)

vice_presidential_image_url = "https://upload.wikimedia.org/wikipedia/commons/4/41/Kamala_Harris_Vice_Presidential_Portrait.jpg"
vice_presidential_img = image_from_url(vice_presidential_image_url)
vice_presidential_faces = face_pipeline.get(vice_presidential_img)
assert len(vice_presidential_faces) == 1, "Expected to find one face in the vice presidential image"

# show the detected face
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.imshow(vice_presidential_img)
box = vice_presidential_faces[0].bbox.astype(int)
rect = patches.Rectangle(
    (box[0], box[1]),
    box[2] - box[0],
    box[3] - box[1],
    linewidth=2,
    edgecolor="#23890F",
    facecolor="none",
)
ax.add_patch(rect)
label = f"Score: {vice_presidential_faces[0].det_score:.2f}\nAge: {vice_presidential_faces[0].age:.1f}"
ax.text(
    box[0], box[1] - 5, label, color="black", fontsize=6, weight="bold"
)
ax.set_title("Vice Presidential Image with Detected Face")
ax.axis("off")
plt.tight_layout()
plt.show()

png

Now in that image we only find a single face. So we can use that to find Kamala’s face in the other image by comparing the distance between the known embedding and all others:

def compute_embedding_distance(
    emb1: np.ndarray, emb2: np.ndarray, distance_metric: str = "euclidean", normalized: bool = False
) -> float:
    """
    Compute the unormalized Euclidean distance between two face embeddings.
    """
    if distance_metric == "euclidean":
        if normalized:
            emb1 = emb1 / np.linalg.norm(emb1)gst
            
            emb2 = emb2 / np.linalg.norm(emb2)
        return float(np.linalg.norm(emb1 - emb2))
    elif distance_metric == "cosine":
        if normalized:
            emb1 = emb1 / np.linalg.norm(emb1)
            emb2 = emb2 / np.linalg.norm(emb2)
        return 1 - np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
    else:
        raise ValueError(f"Unsupported distance metric: {distance_metric}")
    
kamala_embedding = vice_presidential_faces[0].embedding
distances = []
for face in faces:
    dist = compute_embedding_distance(kamala_embedding, face.embedding, distance_metric="euclidean", normalized=False)
    distances.append((face, dist))
distances.sort(key=lambda x: x[1])
print("Distances to Kamala Harris:")
for face, dist in distances:
    print(f"Distance: {dist:.4f}")
Distances to Kamala Harris:
Distance: 19.4282
Distance: 27.6562
Distance: 28.5953
Distance: 30.6948
Distance: 31.5702
Distance: 33.2068

Euclidean distance measures the straight-line distance between points, but for high-dimensional embeddings, it’s sensitive to vector magnitude rather than direction. Cosine distance, which measures the angle between vectors, is often preferred for face embeddings because modern models are trained to map identities to directions in embedding space, making identity comparisons robust to scale or lighting variations. In this case we use the buffalo_l model and while I could not find any information on the training of it, I assume it was trained with cosine similarity in its objective.

As such let’s look at the distances with cosine similarity (where we do not need to normalize the vectors as cosine similarity is scale invariant):

distances = []
for face in faces:
    dist = compute_embedding_distance(kamala_embedding, face.embedding, distance_metric="cosine", normalized=False)
    distances.append((face, dist))
distances.sort(key=lambda x: x[1])
print("Distances to Kamala Harris:")
for face, dist in distances:
    print(f"Distance: {dist:.4f}")
Distances to Kamala Harris:
Distance: 0.3348
Distance: 0.9678
Distance: 0.9767
Distance: 0.9948
Distance: 0.9955
Distance: 0.9995

Ok, the numbers have gotten smaller. A cosine distance of 1 means the vector is perpendicular to the other vector which in 512 dimensions is a bit of a head scratcher. But how does it look when we annotate the image with these distances:


#label all faces with their distance to Kamala Harris
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.imshow(img)
for face, dist in distances:
    box = face.bbox.astype(int)
    rect = patches.Rectangle(
        (box[0], box[1]),
        box[2] - box[0],
        box[3] - box[1],
        linewidth=2,
        edgecolor="#FF5733" if dist < 0.6 else "#23890F",
        facecolor="none",
    )
    ax.add_patch(rect)
    label = f"Dist: {dist:.4f}"
    ax.text(
        box[0], box[1] - 5, label, color="black", fontsize=6, weight="bold"
    )
ax.set_title("Faces labeled with distance to Kamala Harris")
ax.axis("off")
plt.tight_layout()
plt.show()

png

And we found Kamala. It’s like finding Waldo but much easier and using computers.

Real photos

My question was now, how does it perform on real photos, so my photos and how can I cluster it into people. And how well does it work to cluster into people?

Are distances enough or do I need a clustering algorithm?

To answer this I extracted all faces from my 2025 pictures I had, I saved the faces into a new folder as jpegs. I then used some scripts and some manual sorting to sort people into folders. I then loaded these pictures again and extracted their embeddings.

This way I got a one to one relationship between file and face, which made labeling easy for this post. Here is the code how I got there:

from tqdm import tqdm
import sqlite3

photo_folders = [Path("~/Nextcloud/Bilder/Ordered/2025"), Path("~/Nextcloud/Bilder/Prachiti and Paul/2025")]
photo_folders = [folder.expanduser() for folder in photo_folders]

face_folder = Path("./data/facedetections/faces")
face_folder.mkdir(parents=True, exist_ok=True)

def find_jpg(folder: Path):
    # Find all jpg files in the folder and its subfolders
    return [f for f in folder.glob("**/*.jpg") if f.is_file()]
jpg_files = []
for folder in photo_folders:
    jpg_files.extend(find_jpg(folder))
print(f"Found {len(jpg_files)} jpg files")


class FaceEmbeddingDatabase:
    def __init__(self, path = "./data/facedetections/face_embeddings.db"):
        self.conn = sqlite3.connect(path)
        self._create_table()

    def _create_table(self):
        with self.conn:
            self.conn.execute("""
            CREATE TABLE IF NOT EXISTS face_embeddings (
                id INTEGER PRIMARY KEY,
                photo_name TEXT,
                embedding BLOB
            )
            """)

    def _delete_embedding(self, photo_path: Path):
        # if existing, delete the embedding for the given photo
        photo_name = photo_path.name
        with self.conn:
            self.conn.execute("""
            DELETE FROM face_embeddings WHERE photo_name = ?
            """, (photo_name,))

    def insert_embedding(self, photo_path: Path, embedding: np.ndarray):
        photo_name = photo_path.name
        self._delete_embedding(photo_path)

        with self.conn:
            self.conn.execute("""
            INSERT INTO face_embeddings (photo_name, embedding)
            VALUES (?, ?)
            """, (photo_name, embedding.tobytes()))
    
    def get_embedding(self, photo_path: Path) -> np.ndarray | None:
        photo_name = photo_path.name
        cursor = self.conn.cursor()
        cursor.execute("""
        SELECT embedding FROM face_embeddings WHERE photo_name = ?
        """, (photo_name,))
        row = cursor.fetchone()
        if row:
            return np.frombuffer(row[0], dtype=np.float32)
        return None
    
    def list_photos(self) -> list[Path]:
        cursor = self.conn.cursor()
        cursor.execute("""
        SELECT photo_name FROM face_embeddings
        """)
        rows = cursor.fetchall()
        return [Path(row[0]) for row in rows]
    
    def get_all_embeddings(self) -> dict[Path, np.ndarray]:
        cursor = self.conn.cursor()
        cursor.execute("""
        SELECT photo_name, embedding FROM face_embeddings
        """)
        rows = cursor.fetchall()
        embeddings = {}
        for row in rows:
            photo_name = Path(row[0])
            embedding = np.frombuffer(row[1], dtype=np.float32)
            embeddings[photo_name] = embedding
        return embeddings
        



# Process each jpg file and save detected faces as jpgd
def process_photo(photo_path: Path, output_folder: Path, db: FaceEmbeddingDatabase):
    try:
        img = np.array(Image.open(photo_path))
        faces = face_pipeline.get(img)
    except Exception as e:
        print(f"Failed to process {photo_path}: {e}")
        return

    for i, face in enumerate(faces):
        try:
            box = face.bbox.astype(int)
            face_img = img[box[1]:box[3], box[0]:box[2]]
            face_pil = Image.fromarray(face_img)
            face_output_path = output_folder / f"{photo_path.stem}_face_{i}.jpg"
            face_pil.save(face_output_path)
            db.insert_embedding(face_output_path, face.embedding)

        except Exception as e:
            print(f"Failed to process face {i} in {photo_path}: {e}")


embedding_db = FaceEmbeddingDatabase()

#for photo_path in tqdm(jpg_files):
#    process_photo(photo_path, face_folder, embedding_db)

face_images = [f for f in face_folder.glob("*.jpg") if f.is_file()]
print(f"Saved {len(face_images)} face images to {face_folder}")
Found 3081 jpg files
Saved 3246 face images to data/facedetections/faces

Now that I have so many face embeddings I want to see what the average distance is. This is an easy way to see what threshold would be good to cluster people.

import plotnine as p9
import p9customtheme
import pandas as pd
import itertools

def cosine_distance(emb1: np.ndarray, emb2: np.ndarray) -> float:
    return 1 - np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

db = FaceEmbeddingDatabase()
photo_paths = db.list_photos()

pairs = list(
    itertools.islice(
        itertools.combinations(photo_paths, 2),
        5000
    )
)
distances = []
for photo1, photo2 in tqdm(pairs):
    emb1 = db.get_embedding(photo1)
    emb2 = db.get_embedding(photo2)
    if emb1 is not None and emb2 is not None:
        dist = cosine_distance(emb1, emb2)
        distances.append(dist)
100%|██████████| 5000/5000 [00:56<00:00, 89.16it/s]
100%|██████████| 5000/5000 [00:56<00:00, 89.16it/s]

df = pd.DataFrame({"cosine_distance": distances})
(p9.ggplot(df, p9.aes(x="cosine_distance"))
 + p9.geom_histogram(bins=50, fill="#23890F", color="black", alpha=1)
 + p9.labs(
     title=f"Cosine Distances Between {len(pairs)} Face Embedding Pairs",
     subtitle="Using buffalo_l model embeddings",
     x="Cosine Distance",
     y="Count"
 )
)

svg

Clearly the average embedding cosine distance for this model is above 0.8 but the range is from 0 to 1.6. Similar faces will be in the range 0.3-0.5, so I can use that information to sort faces into folders. Which is what I did next.

from random import sample

embedding_db = FaceEmbeddingDatabase()

# now we move all faces into folders based on their distance to the anchor faces
output_base = Path("data/facedetections/people")
output_base.mkdir(parents=True, exist_ok=True)


min_distance = 0.4  # threshold for considering a face as matching an anchor

face_images = [f for f in face_folder.glob("*.jpg") if f.is_file()]

def find_fotos(folder: Path):
    return [f for f in folder.glob("**/*.jpg") if f.is_file()]

people = [p.name for p in output_base.glob("*") if p.is_dir()]

for face_image in tqdm(face_images):
    face_embedding = embedding_db.get_embedding(face_image)
    if face_embedding is None:
        continue

    closest_person = None
    closest_distance = float("inf")

    # comapare to all images in all people folders
    for person_name in people:
        person_fotos = find_fotos(output_base / person_name)
        for foto in sample(person_fotos, min(6, len(person_fotos))):
            foto_embedding = embedding_db.get_embedding(foto)
            if foto_embedding is None:
                continue
            dist = cosine_distance(face_embedding, foto_embedding)
            if dist < closest_distance:
                closest_distance = dist
                closest_person = person_name

            if closest_distance < min_distance:
                break
    # if the closest distance is below the threshold, move the face image to that person's folder
    if closest_distance < min_distance and closest_person is not None:
        target_folder = output_base / closest_person
        target_folder.mkdir(parents=True, exist_ok=True)
        target_path = target_folder / face_image.name
        #shutil.move(str(face_image), str(target_path))
        #print(f"Moved {face_image} to {target_path} (distance: {closest_distance:.4f})")
100%|██████████| 3246/3246 [19:24<00:00,  2.79it/s]
100%|██████████| 3246/3246 [19:24<00:00,  2.79it/s]

Now that I have sorted them into people, I am curious how the in people cosine distance values look like.



people = [p.name for p in output_base.glob("*") if p.is_dir()]
inner_people_distances = []
n = 20
for person_i, person_name in enumerate(people):
    person_fotos = find_fotos(output_base / person_name)
    if len(person_fotos) < 2:
        continue
    if len(person_fotos) > n:
        # randomly sample n fotos
        person_fotos = sample(person_fotos, n)
    foto_pairs = list(itertools.combinations(person_fotos, 2))
    for foto1, foto2 in foto_pairs:
        emb1 = embedding_db.get_embedding(foto1)
        emb2 = embedding_db.get_embedding(foto2)
        if emb1 is not None and emb2 is not None:
            dist = cosine_distance(emb1, emb2)
            inner_people_distances.append({"person": f"Person {person_i}", "cosine_distance": dist})
df = pd.DataFrame(inner_people_distances)

(p9.ggplot(df, p9.aes(x="cosine_distance", fill = "person"))
 + p9.geom_histogram(bins=50,  color="black", alpha=1)
 + p9.labs(
     title=f"Cosine Distance of {len(df)} In-Person Face Embedding Pairs",
     subtitle="Using buffalo_l model embeddings",
     x="Cosine Distance",
     y="Count",
     fill = "Person"
 )
)

    

svg

Nice to see that now the in-person distances are on average much smaller. But they are not as small as I would have thought, with some outliers as high as 0.8. So clearly only relying on the distance between one refence picture and a query picture is not enough to assign faces to people.

One way to go would be to use hdbscan, a very powerful clustering algorithm. Lets try it out.

from sklearn.cluster import HDBSCAN as hdbscan

db = FaceEmbeddingDatabase()
all_embeddings = db.get_all_embeddings()
all_images = [img for img in all_embeddings.keys()]
embedding_matrix = np.array([all_embeddings[img] for img in all_images])
clusterer = hdbscan(min_cluster_size=4, metric='cosine')
clusters = clusterer.fit_predict(embedding_matrix)

# lets see if we can find our people in the clusters
cluster_df = pd.DataFrame({
    "image": all_images,
    "cluster": clusters
})
# image_name
cluster_df["image_name"] = cluster_df["image"].apply(lambda x: x.name)
cluster_df["person"] = None

Ok that was easy, but did it work? Before I get to the analysis I want to look at something I noticed:

There are some faces that are from the same person but have very different embeddings. I think orientation plays a role as sometimes they are not correctly rotated clearly.

from scipy.ndimage import rotate
#image1 = "IMG_2656.jpg" 
image1 = "PXL_20250322_103251717.jpg"
# among all images find these and extract faces 
image1_file = [f for f in jpg_files if f.name == image1][0]
assert image1_file is not None, f"Image {image1} not found"

def rotate_image(image: np.ndarray, angle: float) -> np.ndarray:
    return rotate(image, angle, reshape=False)



embeddings = []
for angle in tqdm(range(0, 361, 5), desc="Rotating image and extracting embeddings"):
    img = np.array(Image.open(image1_file))
    rotated_img = rotate_image(img, angle)
    faces = face_pipeline.get(rotated_img)
    if len(faces) == 0:
        print(f"No faces found at angle {angle}")
    elif len(faces) > 1:
        print(f"Multiple faces found at angle {angle}, skipping")
    elif len(faces) == 1:
        for face in faces:
            embeddings.append((angle, face.embedding))
    else:
        print(f"Unexpected number of faces ({len(faces)}) at angle {angle}")

# cosine distance between all pairs of embeddings
distances = []
for (angle1, emb1), (angle2, emb2) in tqdm(itertools.combinations(embeddings, 2), total=len(embeddings)*(len(embeddings)-1)//2, desc="Computing cosine distances"):
    dist = cosine_distance(emb1, emb2)
    distances.append({"angle1": angle1, "angle2": angle2, "cosine_distance": dist})
Rotating image and extracting embeddings:  64%|██████▍   | 47/73 [01:59<01:05,  2.51s/it]

Multiple faces found at angle 230, skipping


Rotating image and extracting embeddings: 100%|██████████| 73/73 [03:04<00:00,  2.53s/it]
Rotating image and extracting embeddings: 100%|██████████| 73/73 [03:04<00:00,  2.53s/it]
Computing cosine distances: 100%|██████████| 2556/2556 [00:00<00:00, 157429.60it/s]
Computing cosine distances: 100%|██████████| 2556/2556 [00:00<00:00, 157429.60it/s]
df = pd.DataFrame(distances)
df["cosine_distance"] = df["cosine_distance"].round(2)
df["angle1"] = df["angle1"].astype(str) + "°"
df["angle2"] = df["angle2"].astype(str) + "°"


# fill the inverse distances as well to make a full matrix
df_inverse = df.rename(columns={"angle1": "angle2", "angle2": "angle1"})


df = pd.concat([df, df_inverse], ignore_index=True)
# add diagonal with zero distances
for angle, _ in embeddings:
    df = pd.concat([df, pd.DataFrame([{"angle1": f"{angle}°", "angle2": f"{angle}°", "cosine_distance": 0.0}])], ignore_index=True)

# sort angles numerically for better heatmap display
angle_order = [f"{angle}°" for angle in range(0, 361, 5)]
# ensure 0 is center so start at 
first_half = angle_order[:len(angle_order)//2]
second_half = angle_order[len(angle_order)//2:]
angle_order = second_half + first_half

df["angle1"] = pd.Categorical(df["angle1"].to_list(), categories=angle_order, ordered=True)
df["angle2"] = pd.Categorical(df["angle2"].to_list(), categories=angle_order, ordered=True)
# make angles back to numeric
# plot heatmap of distances
(p9.ggplot(df, p9.aes(x="angle1", y="angle2", fill="cosine_distance"))
 + p9.geom_tile()
 + p9.scale_fill_cmap(name="Cosine Distance")
 + p9.labs(
     title="Cosine Distances Between Rotated Images",
     subtitle="Using buffalo_l model embeddings",
     x="Rotation Angle 1",
     y="Rotation Angle 2",
 )
 
 + p9.scale_x_discrete(breaks=[f"{angle}°" for angle in range(0, 360, 45)]) 
 + p9.scale_y_discrete(breaks=[f"{angle}°" for angle in range(0, 360, 45)])
)

svg

Clear as day we can see that the cosine distance varies with the rotation of the image. Which it really should not. There is a good amount of similarity in the range -60+60 degree rotation, which I suppose is a normal head movement. But once we flip the image on its head there is less similarity, even between upside down faces. Weird is that the matrix is not symmetrical, so rotating one way is different than rotating into the other direction? Maybe that’s an artifact of my code though.

So it seems to be crucial to get the embedding the right way up.

Knowing this I think one first thing is to respect the EXIF tags that indicate orientation while loading. That should mitigate most issues. But it would also be feasible to train a quick rotation classifier for images. While buffalo_l also gives us the roll of each face, this is highly variable and in a short test it was not suitable for rotation of the source image.

Cluster evaluation

Now I want to see how good hdbscan recapitulates my people.

from sklearn.cluster import HDBSCAN as hdbscan

db = FaceEmbeddingDatabase()
all_embeddings = db.get_all_embeddings()
all_images = [img for img in all_embeddings.keys()]
embedding_matrix = np.array([all_embeddings[img] for img in all_images])

# for each image.name lets find which person it is by first loading all poeple folders
people = {p.name: find_fotos(output_base / p.name) for p in output_base.glob("*") if p.is_dir()}
people_df = [ ]
for person_name, fotos in people.items():
    for foto in fotos:
        people_df.append({"image_name": str(foto.name), "person": person_name})
people_df = pd.DataFrame(people_df)


def cluster_at(embedding_matrix: np.ndarray, min_cluster_size: int = 4):
    clusterer = hdbscan(min_cluster_size=min_cluster_size, metric='cosine')
    return clusterer.fit_predict(embedding_matrix)


images_df = []
for image in all_images:
    images_df.append({"image_name": str(image.name)})
images_df = pd.DataFrame(images_df)
images_df = images_df.merge(people_df, on="image_name", how="left")
# ensure rows are ordered as all_images
images_df = images_df.set_index("image_name").loc[[str(img.name) for img in all_images]].reset_index()

for min_cluster_size in range(2, 30):
    clusters = cluster_at(embedding_matrix, min_cluster_size=min_cluster_size)
    images_df[f"cluster_{min_cluster_size}"] = clusters
images_df.head(10)

image_namepersoncluster_2cluster_3cluster_4cluster_5cluster_6cluster_7cluster_8cluster_9...cluster_20cluster_21cluster_22cluster_23cluster_24cluster_25cluster_26cluster_27cluster_28cluster_29
0PXL_20250220_080322405_face_0.jpgpaul-1-1-14129181612...5554555655
1PXL_20250203_215912302_face_0.jpgprachiti-1-1-1-1-1-1-1-1...201715141010101099
2PXL_20250206_101204588_face_0.jpgprachiti876-1-1-185736252...201715141010101099
3PXL_20250228_070913437_face_0.jpgprachiti780-1-1-1-1-1-1-1...201715141010101099
4PXL_20250206_101217695.PORTRAIT_face_0.jpgprachiti-1-1-1-185-1-1-1...201715141010101099
5PXL_20250228_070911915_face_0.jpgprachiti780-1-1-1-1-1-1-1...201715141010101099
6PXL_20250206_101214142.PORTRAIT_face_0.jpgprachiti876-1-1-1-1-1-1-1...201715141010101099
7PXL_20250223_171157468_face_0.jpgNaN-1-1-1-179665844...21191817151515151414
8PXL_20250214_133735637_face_0.jpgNaN82628915911279665844...21191817151515151414
9PXL_20250130_165154016_face_0.jpgpaul-1-1-14129181612...5554555655

10 rows × 30 columns

# compute cluster evaluation metrics for each person, so we can for each person see how well they are clustered

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
person_metrics = []

for person in people.keys():
    
    person_images = images_df[images_df["person"] == person]
    cluster_sizes = images_df.columns[images_df.columns.str.startswith("cluster_")]
    for cluster_col in cluster_sizes:
        largest_person_cluster = -1
        # find the largest cluster for this person
        clusters = person_images[cluster_col].value_counts()
        if len(clusters) > 0:
            largest_person_cluster = clusters.idxmax()
            tp = clusters.max()
        else:
            tp = 0
        fp = (images_df[cluster_col] == largest_person_cluster).sum() - tp
        fn = len(person_images) - tp
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        person_metrics.append({
            "person": person,
            "min_cluster_size": int(cluster_col.split("_")[1]),
            "precision": precision,
            "recall": recall,
            "person_fotos": len(person_images)
        })

person_metrics_df = pd.DataFrame(person_metrics)
# plot precision vs recall as a function of min_cluster_size
# melt the dataframe
person_metrics_melted = person_metrics_df.melt(id_vars=["person", "min_cluster_size", 'person_fotos'], value_vars=["precision", "recall"], var_name="metric", value_name="value")

person_metrics_melted["many_photos"] = person_metrics_melted["person_fotos"] > 20
# replace with ">20 photos" and "<=20 photos
person_metrics_melted["many_photos"] = person_metrics_melted["many_photos"].map({True: ">20 Photos", False: "≤20 Photos"})

# rename person to Person 1, Person 2, ...
person_mapping = {person: f"Person {i+1}" for i, person in enumerate(people.keys())}
person_metrics_melted["person"] = person_metrics_melted["person"].map(person_mapping)

# plot as lineplots
(
    p9.ggplot(person_metrics_melted, p9.aes(x="min_cluster_size", y="value", color="person"))
    + p9.geom_line(size=1)
    + p9.facet_grid("many_photos~metric")
    + p9.labs(
        title="Clustering Precision and Recall per Person",
        subtitle="Using buffalo_l model embeddings and HDBSCAN clustering",
        x="HDBSCAN min_cluster_size",
        y="Value",
        color="Metric"
    )
)

svg

Ok this really shows that hdbscan works well only once we have roughly 20 events for a person. Below that the clusters are just not really reliable. And a minimum size of ~10 seems to be a good start to find clusters.

My gut tells me that a combination of average cosine similarity and relying on a large cluster size, might be the way. So a pipeline to find photos with a person in it could look like this:

  1. Get a picture of that person with a good face
  2. Rotate that face into 4 directions
  3. Cluster these 4 embeddings with all the embeddings we have for other faces
  4. Use hdbscan to find all clusters that contain our seeds
  5. If we find clusters check that the average cosine distance in these clusters is < 0.6
  6. If no clusters are found or clusters are too noisy, return hits across all with < 0.4 cosine distance

This could be robust, accounts for the rotation issue and can handle small clusters via cosine similarity.

It won’t be perfect but for the edge use-case of identifying photos with me in it, it might just work.

Words of Caution

This is how I approached it without diving deep into the documentation, papers and best practices of the field. I am sure there are smarter people that wrote better articles about facial detection. After all it is a massive industry. But for me it was nice to dive head first into it. Trying to understand it from the available models.

So take this all as entertainment more than actual advice.