I saw this article, “Molecular mimicry as a mechanism of viral immune evasion and autoimmunity”, and I got immediately interested in reproducing Figure 1b.

In there, the authors investigate peptide similarity between viruses and the human proteome. They say certain viruses might have adjusted their peptide use to match peptides found in the human proteome, so that they can evade the MHC recognition.

And this is a biologically cool mechanism, and the method they used is rather simple. So I thought: let’s try to reproduce it.

To get started, I downloaded the supplement data for Figure 1b, which the authors included very kindly, and I first tried to reproduce their plot from their data.

Reproducing the plot with original data

import polars as pl
import tempfile
import plotnine as p9
import p9customtheme
from tqdm import tqdm
import re


def load_data_1b():
    supplementary_table_1b_url = "https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-024-53658-8/MediaObjects/41467_2024_53658_MOESM11_ESM.xlsx"
    df = pl.read_excel(supplementary_table_1b_url, sheet_id=2)
    # skip first row
    headers = list(df.row(0))
    # some headers are repeated 3 times, add suffixes _1, _2, _3
    header_count = {}
    for i, header in enumerate(headers):
        if header in header_count:
            header_count[header] += 1
            headers[i] = f"{header}_{header_count[header]}"
        else:
            header_count[header] = 1
    df = df[1:]
    df.columns = headers

    # We use a temporary CSV file to ensure consistent data types
    with tempfile.TemporaryDirectory() as tmpdir:
        df.write_csv(f"{tmpdir}/supplementary_table_1b.csv")
        df = pl.read_csv(
            f"{tmpdir}/supplementary_table_1b.csv", infer_schema_length=10000
        )

    return df


original_df = load_data_1b()
original_df.head(4)
shape: (4, 15)
Proteome IDVirusViral Zone Family0 mismatches≤ 1 mismatch≤ 2 mismatch≤ 3 mismatch0 mismatches_2≤ 1 mismatch_2≤ 2 mismatch_2≤ 3 mismatch_20 mismatches_3≤ 1 mismatch_3≤ 2 mismatch_3≤ 3 mismatch_3
strstrstrf64f64f64f64f64f64f64f64f64f64f64f64
"UP000180764""Adenoassociated virus 2""Parvoviridae"0.21715515.67137296.742671100.00.00.00.0730192.7382260.00.00.00.0
"UP000119554""Aichi virus""Picornaviridae"0.37113421.03092897.443299100.00.00.082610.4956634.3370510.00.00.00.0
"UP000006934""Australian bat lyssavirus""Rhabdoviridae"0.16773819.12216997.511881100.00.00.00.01.5181330.00.00.00.0
"UP000000832""Banna virus""Sedoreoviridae"0.23972616.78082297.294521100.00.00.00.0690612.2444750.00.00.00.0

Ok, the original data is now loaded. And we see that we have columns with repeated names, which correspond to the kmers 8, 12 and 18. So the next step is to pivot the data into a long format for better usability.

basic_columns = ["Proteome ID", "Virus", "Viral Zone Family"]
original_df_8mers = original_df.select(
    basic_columns
    + ["0 mismatches", "≤ 1 mismatch", "≤ 2 mismatch", "≤ 3 mismatch"]
).with_columns([pl.lit(8).alias("k")])
original_df_12mers = (
    original_df.select(
        basic_columns
        + [
            "0 mismatches_2",
            "≤ 1 mismatch_2",
            "≤ 2 mismatch_2",
            "≤ 3 mismatch_2",
        ]
    )
    .with_columns([pl.lit(12).alias("k")])
    .rename(
        {
            "0 mismatches_2": "0 mismatches",
            "≤ 1 mismatch_2": "≤ 1 mismatch",
            "≤ 2 mismatch_2": "≤ 2 mismatch",
            "≤ 3 mismatch_2": "≤ 3 mismatch",
        }
    )
)
original_df_18mers = (
    original_df.select(
        basic_columns
        + [
            "0 mismatches_3",
            "≤ 1 mismatch_3",
            "≤ 2 mismatch_3",
            "≤ 3 mismatch_3",
        ]
    )
    .with_columns([pl.lit(18).alias("k")])
    .rename(
        {
            "0 mismatches_3": "0 mismatches",
            "≤ 1 mismatch_3": "≤ 1 mismatch",
            "≤ 2 mismatch_3": "≤ 2 mismatch",
            "≤ 3 mismatch_3": "≤ 3 mismatch",
        }
    )
)
original_df_melted = pl.concat(
    [original_df_8mers, original_df_12mers, original_df_18mers]
)
original_df_melted.head(4)
shape: (4, 8)
Proteome IDVirusViral Zone Family0 mismatches≤ 1 mismatch≤ 2 mismatch≤ 3 mismatchk
strstrstrf64f64f64f64i32
"UP000180764""Adenoassociated virus 2""Parvoviridae"0.21715515.67137296.742671100.08
"UP000119554""Aichi virus""Picornaviridae"0.37113421.03092897.443299100.08
"UP000006934""Australian bat lyssavirus""Rhabdoviridae"0.16773819.12216997.511881100.08
"UP000000832""Banna virus""Sedoreoviridae"0.23972616.78082297.294521100.08

Now, with that settled, the authors write that they used z-scores for each column. Right now, the values are percentages of kmers matched to the human proteome. So z-scoring is a good way to get the data into the same space.

original_df_zscored = (
    original_df_melted.unpivot(
        index=basic_columns + ["k"],
        on=["0 mismatches", "≤ 1 mismatch", "≤ 2 mismatch", "≤ 3 mismatch"],
        variable_name="mismatches",
        value_name="percentage_matched",
    )
    # zscore normalization within each k and mismatches group
    .with_columns(
        [
            (
                (
                    pl.col("percentage_matched")
                    - pl.col("percentage_matched")
                    .mean()
                    .over(["k", "mismatches"])
                )
                / pl.col("percentage_matched").std().over(["k", "mismatches"])
            ).alias("zscore")
        ]
    )
    # replace mismatches with ""
    .with_columns(
        pl.col("mismatches")
        .str.replace_all(" mismatches", "")
        .str.replace_all(" mismatch", "")
    )
)
original_df_zscored.head(4)
shape: (4, 7)
Proteome IDVirusViral Zone Familykmismatchespercentage_matchedzscore
strstrstri32strf64f64
"UP000180764""Adenoassociated virus 2""Parvoviridae"8"0"0.217155-0.419988
"UP000119554""Aichi virus""Picornaviridae"8"0"0.3711340.089245
"UP000006934""Australian bat lyssavirus""Rhabdoviridae"8"0"0.167738-0.583418
"UP000000832""Banna virus""Sedoreoviridae"8"0"0.239726-0.345343

Now the data is centred around 0 and z-scaled. We should now be able to plot Figure 1b. At least almost. If we plot it now, the row order would not be the same as their heatmap.

In order to have the rows ordered in the same way, I used Google Gemini to extract the row names. And then we can add the taxon mnemonic from the supplementary file 1. So let’s add this metadata:

suppl_file_1_url = "https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-024-53658-8/MediaObjects/41467_2024_53658_MOESM3_ESM.xlsx"
suppl1 = (
    pl.read_excel(suppl_file_1_url)
    .select(["Proteome ID", "Taxon mnemonic"])
    .with_columns(
        pl.when(pl.col("Taxon mnemonic") == "null")
        .then("Proteome ID")
        .otherwise(pl.col("Taxon mnemonic"))
        .alias("Taxon mnemonic")
    )
)
suppl1.head(4)
shape: (4, 2)
Proteome IDTaxon mnemonic
strstr
"UP000180764""AAV2S"
"UP000119554""AIVA8"
"UP000006934""ABLVB"
"UP000000832""BAVJK"
mnemonic_order = [
    "UUKS",
    "TOSV",
    "UP000207632",
    "RVFVZ",
    "PTPV",
    "PIARV",
    "MACHU",
    "LYCVA",
    "LASSJ",
    "JUNIN",
    "UP000117676",
    "BUNSH",
    "BUNL8",
    "BUNYW",
    "SEOU8",
    "PUUMS",
    "HANTV",
    "DUGBA",
    "CCHFI",
    "VSIVA",
    "RABVP",
    "MOKV",
    "LBV",
    "ISFV",
    "EBLV1",
    "DUVV",
    "CHAV",
    "ABLVB",
    "PIV5",
    "NIPAV",
    "MUMPM",
    "MEASC",
    "UP000109776",
    "HENDH",
    "MABVM",
    "EBOZM",
    "HRSVB",
    "INCAA",
    "INBLE",
    "I34A1",
    "UP000181622",
    "WEEV",
    "EEVVP",
    "SINDV",
    "SFV",
    "SAGV",
    "RRVN",
    "ONNVG",
    "MAYAB",
    "EEEVF",
    "CHIKS",
    "BFV",
    "RUBVT",
    "HEVCH",
    "ZIKV",
    "YEFV1",
    "WNV",
    "POWVL",
    "UP000171556",
    "MVEV5",
    "LIV",
    "UP000180575",
    "KUNJM",
    "JAEVJ",
    "HCV77",
    "DEN1W",
    "ENMGO",
    "SALVA",
    "ROSV2",
    "POL1M",
    "EMCVR",
    "HRV8A",
    "HE701",
    "HED68",
    "HAVHM",
    "HPE2W",
    "CX16G",
    "COSAA",
    "AIVA8",
    "SOUV3",
    "SVC12",
    "NVN68",
    "LORDV",
    "MERS",
    "SARS2",
    "UP000147828",
    "SARS",
    "CVH22",
    "HUTV",
    "HASV1",
    "ROTHC",
    "ROTGA",
    "ROTSH",
    "BAVJK",
    "POVWU",
    "UP000154903",
    "POVK6",
    "POVJC",
    "POVBK",
    "HPV18",
    "HPV16",
    "HPV2A",
    "HPV1",
    "PAVHV",
    "AAV2S",
    "HHV6Z",
    "HCMVM",
    "VZVD",
    "HHV8P",
    "HHV7J",
    "HHV6U",
    "HHV2H",
    "HHV11",
    "EBVB9",
    "UP000137988",
    "YMTV5",
    "YLDV",
    "VAR67",
    "VACCW",
    "ORFSA",
    "MONPZ",
    "MCV1",
    "HSPV",
    "CWPXB",
    "ADE02",
    "SFVCP",
    "HTL1C",
    "FOAMV",
    "HV2BE",
    "HV1H2",
    "HBVG3",
    "HBVCJ",
    "HDVD3",
    "TTVV1",
][::-1]
# Make an polars enum
mnemonic_enum = pl.Enum(mnemonic_order)
original_df_zscored = original_df_zscored.join(
    suppl1, on="Proteome ID", how="left"
).with_columns(
    pl.col("Taxon mnemonic").cast(mnemonic_enum).alias("Taxon mnemonic")
)
original_df_zscored.head(4)
shape: (4, 8)
Proteome IDVirusViral Zone Familykmismatchespercentage_matchedzscoreTaxon mnemonic
strstrstri32strf64f64enum
"UP000180764""Adenoassociated virus 2""Parvoviridae"8"0"0.217155-0.419988"AAV2S"
"UP000119554""Aichi virus""Picornaviridae"8"0"0.3711340.089245"AIVA8"
"UP000006934""Australian bat lyssavirus""Rhabdoviridae"8"0"0.167738-0.583418"ABLVB"
"UP000000832""Banna virus""Sedoreoviridae"8"0"0.239726-0.345343"BAVJK"
(
    p9.ggplot(
        original_df_zscored,
        p9.aes(x="mismatches", y="Taxon mnemonic", fill="zscore"),
    )
    + p9.geom_tile()
    + p9.facet_grid(". ~ k", scales="free", space="free")
    + p9.scale_fill_gradientn(
        colors=(
            "#045375",
            "#089099",
            "#7CCBA2",
            "#FCDE9C",
            "#FAB27B",
            "#E23333",
        ),
        limits=(-4, 6.7),
    )
    + p9.labs(x="")
    + p9.theme(figure_size=(6, 18))
)

svg

Wonderful, that is exactly the heatmap you can see in their Figure 1b in the manuscript. This is already very cool, as the authors went through the trouble of actually providing all their source data. A big thanks to them.

This now sets me up to see if I can also reproduce the k-mer values.

Reproduction of the values

To start easily, I will only reproduce the 0-mismatch column at first. As these values are super easy to obtain. After that, I will consider mismatches too.

First, I will need to get the human proteome and the virus proteomes they used. This can all be obtained from UniProt.

The authors write, “a mimic was not counted if both the human and pathogen proteins had the same enzyme commission number or were reported as having the same chemical reaction in UniProt due to concern of homology between similarly functioning proteins”

So I also downloaded the EC number and the catalytic activity information from UniProt as from their code (https://github.com/melamedlab/MolecularMimicry/blob/main/Analysis/Figure1/src/Figure1.qmd). I see this is the data they used.

I used the UniProt webpage to get the API URL and made a small function to get all the data I need:

from pathlib import Path

human_proteome_id = "UP000005640"
virus_proteome_ids = original_df.get_column("Proteome ID").unique().to_list()


data_folder = Path("./data/MHC_binding_predictions/")
data_folder.mkdir(parents=True, exist_ok=True)


def download_uniprot_df(proteome_id: str) -> pl.DataFrame:
    url = f"https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Csequence%2Cec%2Ccc_catalytic_activity&format=tsv&query=%28%28proteome%3A{proteome_id}%29%29"
    filepath = data_folder / f"{proteome_id}.tsv.gz"
    if not filepath.exists():
        pl.read_csv(url, separator="\t").write_csv(filepath, separator="\t")
    df = pl.read_csv(filepath, separator="\t").unique(["Entry", "Sequence"])
    return df


human_proteome_df = download_uniprot_df(human_proteome_id)
human_proteome_df.head(4)
shape: (4, 10)
EntryReviewedEntry NameProtein namesGene NamesOrganismLengthSequenceEC numberCatalytic activity
strstrstrstrstrstri64strstrstr
"Q13129""reviewed""RLF_HUMAN""Zinc finger protein Rlf (Rearr…"RLF""Homo sapiens (Human)"1914"MADGKGDAAAVAGAGAEAPAVAGAGDGVET…nullnull
"A0A2R8Y595""unreviewed""A0A2R8Y595_HUMAN""Transcriptional repressor CTCF…"CTCF""Homo sapiens (Human)"638"MEGDAVEAIVEESETFIKGKERKTYQRRRE…nullnull
"A0A590UJW9""unreviewed""A0A590UJW9_HUMAN""Myosin VI""MYO6""Homo sapiens (Human)"194"MEDGKPVWAPHPTDGFQMGNIVDIGPDSLT…nullnull
"Q6P6B1""reviewed""ERIC5_HUMAN""Glutamate-rich protein 5""ERICH5 C8orf47""Homo sapiens (Human)"374"MGCSSSALNKAGDSSRFPSVTSNEHFSTAE…nullnull

With this working, I need to compare kmers between the viruses and the human. But first, I need to set up the EC number comparison and the catalytic activity comparison to match the paper:

def get_kmers_set(sequence: str, k: int) -> set[str]:
    return {sequence[i : i + k] for i in range(len(sequence) - k + 1)}


def extract_ec_prefixes(ec: str) -> set[str]:
    """Extract EC prefixes for fast matching."""
    if not ec:
        return set()
    prefixes = set()
    for e in str(ec).split(";"):
        e = e.strip()
        if e:
            prefix = re.sub(r"\.-.*$", "", e)
            if prefix:
                prefixes.add(prefix)
    return prefixes


# some tests for extract_ec_prefixes
assert extract_ec_prefixes("2.7.7.48; 3.1.-.-; 3.4.19.12; 3.4.22.-") == {
    "2.7.7.48",
    "3.1",
    "3.4.19.12",
    "3.4.22",
}


def extract_ca_reactions(ca: str) -> set[str]:
    """Extract catalytic activity reactions."""
    if not ca:
        return set()
    return {r.strip() for r in str(ca).split(";") if r.strip()}


def ec_overlap_fast(human_prefixes: set[str], viral_ecs: list[str]) -> bool:
    """Check if any viral EC starts with any human prefix."""
    if not human_prefixes or not viral_ecs:
        return False
    for viral_ec in viral_ecs:
        for prefix in human_prefixes:
            if viral_ec.startswith(prefix):
                return True
    return False


def ca_overlap_fast(
    human_reactions: set[str], viral_reactions: set[str]
) -> bool:
    """Check if any reactions overlap."""
    if not human_reactions or not viral_reactions:
        return False
    return len(human_reactions.intersection(viral_reactions)) > 0


# Pre-process human proteome metadata (only once)
human_ec_prefixes: list[set[str]] = []
human_ca_reactions: list[set[str]] = []
human_sequences: list[str] = []

for row in human_proteome_df.iter_rows(named=True):
    human_sequences.append(row["Sequence"])
    human_ec_prefixes.append(extract_ec_prefixes(row["EC number"]))
    human_ca_reactions.append(extract_ca_reactions(row["Catalytic activity"]))

The computation is straightforward, as it is essentially just a few k-mer matches:

# Main computation with pre-computed data
perfect_match_results = []
ks = [8, 12, 18]
for k in ks:
    # Build human kmer -> set of protein indices (once per k)
    human_kmer_to_proteins: dict[str, set[int]] = {}
    for idx, seq in enumerate(
        tqdm(human_sequences, desc=f"Building human {k}-mers")
    ):
        for kmer in get_kmers_set(seq, k):
            if kmer not in human_kmer_to_proteins:
                human_kmer_to_proteins[kmer] = set()
            human_kmer_to_proteins[kmer].add(idx)

    print(f"Built {len(human_kmer_to_proteins)} unique human {k}-mers")

    for proteome_id in tqdm(
        virus_proteome_ids, desc=f"Processing viruses k={k}"
    ):
        virus_df = download_uniprot_df(proteome_id)
        total_kmers = 0
        matched_kmers = 0

        for row in virus_df.iter_rows(named=True):
            seq = row["Sequence"]
            # Pre-extract viral EC and CA for this protein
            viral_ecs = [
                e.strip()
                for e in str(row["EC number"] or "").split(";")
                if e.strip()
            ]
            viral_reactions = extract_ca_reactions(row["Catalytic activity"])

            for kmer in get_kmers_set(seq, k):
                total_kmers += 1

                # Check if kmer exists in human proteome
                if kmer not in human_kmer_to_proteins:
                    continue

                # Check if ANY human protein with this kmer is valid (no EC/CA overlap)
                for human_idx in human_kmer_to_proteins[kmer]:
                    h_ec_prefixes = human_ec_prefixes[human_idx]
                    h_ca_reactions = human_ca_reactions[human_idx]

                    # Skip if EC overlaps
                    if ec_overlap_fast(h_ec_prefixes, viral_ecs):
                        continue
                    # Skip if CA overlaps
                    if ca_overlap_fast(h_ca_reactions, viral_reactions):
                        continue

                    # Found a valid match!
                    matched_kmers += 1
                    break  # Only count once per viral kmer

        percentage_matched = (
            (matched_kmers / total_kmers * 100) if total_kmers > 0 else 0
        )
        perfect_match_results.append(
            {
                "Proteome ID": proteome_id,
                "k": k,
                "total_kmers": total_kmers,
                "matched_kmers": matched_kmers,
                "percentage_matched": percentage_matched,
            }
        )
Building human 8-mers:   2%|▏         | 1662/83607 [00:00<00:38, 2143.92it/s]

Building human 8-mers: 100%|██████████| 83607/83607 [00:23<00:00, 3557.97it/s]


Built 11319503 unique human 8-mers


Processing viruses k=8: 100%|██████████| 134/134 [00:00<00:00, 219.95it/s]
Building human 12-mers: 100%|██████████| 83607/83607 [00:22<00:00, 3742.37it/s]


Built 11643591 unique human 12-mers


Processing viruses k=12: 100%|██████████| 134/134 [00:00<00:00, 235.27it/s]
Building human 18-mers: 100%|██████████| 83607/83607 [00:22<00:00, 3731.15it/s]


Built 11803228 unique human 18-mers


Processing viruses k=18: 100%|██████████| 134/134 [00:00<00:00, 237.30it/s]

With these percentages computed, I can do a direct comparison of the raw data to the original data. This should show if the overall signal is similar.

perfect_match_results_df = (
    pl.DataFrame(perfect_match_results)
    .with_columns(pl.lit("Recomputed").alias("Source"))
    .rename({"percentage_matched": "0 mismatches"})
)
combined_df = pl.concat(
    [
        perfect_match_results_df,
        original_df_melted.with_columns(
            pl.lit("Maguire et al").alias("Source")
        ),
    ],
    how="diagonal_relaxed",
).select(["Proteome ID", "k", "Source", "0 mismatches"])

(
    p9.ggplot(
        combined_df.pivot(
            index=["Proteome ID", "k"], values="0 mismatches", on="Source"
        ).fill_null(0)
    )
    + p9.aes(x="Recomputed", y="Maguire et al")
    + p9.geom_point()
    + p9.facet_grid(".~k")
    + p9.scale_x_log10()
    + p9.scale_y_log10()
    + p9.coord_fixed()
    + p9.geom_abline(linetype="dashed", color="red")
)
/home/paul/Nextcloud/software/openpaul.github.io/.venv/lib/python3.13/site-packages/pandas/core/arraylike.py:399: RuntimeWarning: divide by zero encountered in log10

svg

Ok, this looks encouraging but not perfect. Clearly, my code does what it should, but for some, especially the 12mers and 18mers, I get very different results. After a bit of playing, it seems my code removes more protein matches based on EC numbers than the authors. I am not sure why, but once I remove my EC filter, it looks like this:

perfect_match_results = []
for k in ks:
    # Build human kmer -> set of protein indices (once per k)
    human_kmer_to_proteins: dict[str, set[int]] = {}
    for idx, seq in enumerate(
        tqdm(human_sequences, desc=f"Building human {k}-mers")
    ):
        for kmer in get_kmers_set(seq, k):
            if kmer not in human_kmer_to_proteins:
                human_kmer_to_proteins[kmer] = set()
            human_kmer_to_proteins[kmer].add(idx)

    print(f"Built {len(human_kmer_to_proteins)} unique human {k}-mers")

    for proteome_id in tqdm(
        virus_proteome_ids, desc=f"Processing viruses k={k}"
    ):
        virus_df = download_uniprot_df(proteome_id)
        total_kmers = 0
        matched_kmers = 0

        for row in virus_df.iter_rows(named=True):
            seq = row["Sequence"]
            # Pre-extract viral EC and CA for this protein
            viral_ecs = [
                e.strip()
                for e in str(row["EC number"] or "").split(";")
                if e.strip()
            ]
            viral_reactions = extract_ca_reactions(row["Catalytic activity"])

            for kmer in get_kmers_set(seq, k):
                total_kmers += 1

                # Check if kmer exists in human proteome
                if kmer not in human_kmer_to_proteins:
                    continue

                # Check if ANY human protein with this kmer is valid (no EC/CA overlap)
                for human_idx in human_kmer_to_proteins[kmer]:
                    h_ca_reactions = human_ca_reactions[human_idx]

                    # Skip if CA overlaps
                    if ca_overlap_fast(h_ca_reactions, viral_reactions):
                        continue

                    # Found a valid match!
                    matched_kmers += 1
                    break  # Only count once per viral kmer

        percentage_matched = (
            (matched_kmers / total_kmers * 100) if total_kmers > 0 else 0
        )
        perfect_match_results.append(
            {
                "Proteome ID": proteome_id,
                "k": k,
                "total_kmers": total_kmers,
                "matched_kmers": matched_kmers,
                "percentage_matched": percentage_matched,
            }
        )
perfect_match_results_df = (
    pl.DataFrame(perfect_match_results)
    .with_columns(pl.lit("Recomputed").alias("Source"))
    .rename({"percentage_matched": "0 mismatches"})
)
combined_df = pl.concat(
    [
        perfect_match_results_df,
        original_df_melted.with_columns(
            pl.lit("Maguire et al").alias("Source")
        ),
    ],
    how="diagonal_relaxed",
).select(["Proteome ID", "k", "Source", "0 mismatches"])

(
    p9.ggplot(
        combined_df.pivot(
            index=["Proteome ID", "k"], values="0 mismatches", on="Source"
        ).fill_null(0)
    )
    + p9.aes(x="Recomputed", y="Maguire et al")
    + p9.geom_point()
    + p9.facet_grid(".~k")
    + p9.scale_x_log10()
    + p9.scale_y_log10()
    + p9.coord_fixed()
    + p9.geom_abline(linetype="dashed", color="red")
)
Building human 8-mers:  14%|█▍        | 11780/83607 [00:02<00:26, 2716.75it/s]

Building human 8-mers: 100%|██████████| 83607/83607 [00:21<00:00, 3817.98it/s]


Built 11319503 unique human 8-mers


Processing viruses k=8: 100%|██████████| 134/134 [00:00<00:00, 232.65it/s]
Building human 12-mers: 100%|██████████| 83607/83607 [00:22<00:00, 3696.64it/s]


Built 11643591 unique human 12-mers


Processing viruses k=12: 100%|██████████| 134/134 [00:00<00:00, 233.11it/s]
Building human 18-mers: 100%|██████████| 83607/83607 [00:21<00:00, 3955.13it/s]


Built 11803228 unique human 18-mers


Processing viruses k=18: 100%|██████████| 134/134 [00:00<00:00, 226.22it/s]
/home/paul/Nextcloud/software/openpaul.github.io/.venv/lib/python3.13/site-packages/pandas/core/arraylike.py:399: RuntimeWarning: divide by zero encountered in log10

svg

This looks much closer to the original results. But of course, even small changes will change the z-scores. But to see this, I plot the old and new data next to each other.

combined_df_zscore = (
    combined_df.unpivot(
        index=["Proteome ID", "Source", "k"],
        on=["0 mismatches"],
        variable_name="mismatches",
        value_name="percentage_matched",
    )
    # zscore normalization within each k and mismatches group
    .with_columns(
        [
            (
                (
                    pl.col("percentage_matched")
                    - pl.col("percentage_matched")
                    .mean()
                    .over(["k", "mismatches", "Source"])
                )
                / pl.col("percentage_matched")
                .std()
                .over(["k", "mismatches", "Source"])
            ).alias("zscore")
        ]
    )
    # replace mismatches with ""
    .with_columns(
        pl.col("mismatches")
        .str.replace_all(" mismatches", "")
        .str.replace_all(" mismatch", "")
    )
)
combined_df_zscore = combined_df_zscore.join(
    suppl1, on="Proteome ID", how="left"
).with_columns(
    pl.col("Taxon mnemonic").cast(mnemonic_enum).alias("Taxon mnemonic")
)

(
    p9.ggplot(
        combined_df_zscore,
        p9.aes(x="Source", y="Taxon mnemonic", fill="zscore"),
    )
    + p9.geom_tile()
    + p9.facet_grid(". ~ k", scales="free", space="free")
    + p9.scale_fill_gradientn(
        colors=(
            "#045375",
            "#089099",
            "#7CCBA2",
            "#FCDE9C",
            "#FAB27B",
            "#E23333",
        ),
        limits=(-4, 6.7),
    )
    + p9.labs(x="")
    + p9.theme(
        figure_size=(6, 18),
        axis_text_x=p9.element_text(angle=90, hjust=0.5),
    )
)

svg

Well, that looks super. While of course, we can still see some discrepancies, it seems the dataset I created is very close to the dataset the authors created with their pipeline.

But this also shows that even though this is a rather simple in silico analysis, purely the person doing the analysis, the date of the analysis and small implementation details do matter for reproducibility. I don’t understand why my EC filtering was too harsh, but clearly, it is not what the authors did.

Mismatches

The pattern in the original heatmap was largely based on the number of mismatches allowed for each k-mer.

To do this, Python is not really the right tool anymore. Instead, I used Claude and made a suffix tree matching software that allows me to do such a comparison.

In short I used C++ to make use of the seqan3 library (docs.seqan.de/seqan3/3.4.0/). This way, I was able to use an FM Index for the mismatch-tolerant kmer search, which is the right data structure for this task.

You can find the source code of the tool here: github.com/openpaul/fuzzsearch

The tool provides this CLI interface:

./build/fuzzsearch 
Usage: ./build/fuzzsearch [OPTIONS]
Options:
  -q, --query FILE        Query proteome FASTA file (required)
  -r, --reference FILE    Reference proteome FASTA file (required)
  -o, --output FILE       Output TSV file (required)
  -k, --kmer-lengths LIST Comma-separated kmer lengths (default: 8,12,18)
  -m, --max-mismatches N  Maximum mismatches allowed (default: 3)
  -b, --batch-size N      Number of query sequences to process in a batch (default: 10000)
  -i, --index             Only build index and exit (optional)
  -t, --threads N         Number of threads to use (default: 1)
  -h, --help              Show this help message

Example:
  ./build/fuzzsearch -q query.fasta -r reference.fasta -o output.tsv -k 8,12,18 -m 3 -t 4

To use it, I create FASTA files and compare all viruses to the human reference.

fasta_folder = data_folder / "fasta_files"
fasta_folder.mkdir(parents=True, exist_ok=True)


def proteome_to_fasta(proteome_id: str, fasta_folder: Path) -> Path:
    fasta_path = fasta_folder / f"{proteome_id}.fasta"
    if fasta_path.exists():
        return fasta_path

    proteome_df = download_uniprot_df(proteome_id)
    with open(fasta_path, "w") as fasta_file:
        for row in proteome_df.iter_rows(named=True):
            accession = row["Entry"]
            sequence = row["Sequence"]
            fasta_file.write(f">{accession}\n")
            # write sequence in lines of 60 characters
            for i in range(0, len(sequence), 60):
                fasta_file.write(sequence[i : i + 60] + "\n")
    return fasta_path


proteome_to_fasta(human_proteome_id, fasta_folder)
for proteome_id in virus_proteome_ids:
    proteome_to_fasta(proteome_id, fasta_folder)

With the fasta files prepared, I can build an index for the human reference, as that index can be reused across all searches.

! fuzzsearch --index \
!    -t 1 \
!    -q data/MHC_binding_predictions/fasta_files/UP000000344.fasta \
!    -r data/MHC_binding_predictions/fasta_files/UP000005640.fasta \
!    -m 3 \
!    -k 8,12,18 \
!    -o /dev/null 2> /dev/null
Loading existing index from: data/MHC_binding_predictions/fasta_files/UP000005640.fasta.fuzz.index
Index loaded successfully
Reference IDs loaded successfully (82997 sequences)
Index and reference IDs loaded successfully.
Indexing complete. Exiting.

With the index built, I can start the search across all viruses.

The results look like this:

query_id	reference_id	k	query_pos	reference_pos
Q89268	Q92547	8	268	577
Q89269	Q92547	8	44	577
P03132	Q92547	8	268	577
D5SGZ8	Q8IV77	8	61	9
D5SGZ8	P47211	8	65	17
Q89270	Q92547	8	44	577

So it preserves all the information we need for EC and catalytic filtering, as we have the protein ID of each hit.

import subprocess
from concurrent.futures import ThreadPoolExecutor, as_completed
from time import sleep

result_folder = data_folder / "mhcfuzz_results/"
result_folder.mkdir(parents=True, exist_ok=True)
fasta_folder = data_folder / "fasta_files"


def process_proteome(m: int, uniprot_id: str) -> str | None:
    """Process a single proteome with fuzzsearch."""
    reference_fasta = fasta_folder / f"{human_proteome_id}.fasta"
    query_fasta = fasta_folder / f"{uniprot_id}.fasta"
    outfile = result_folder / f"{uniprot_id}_m{m}.parquet"
    tsv_file = result_folder / f"{uniprot_id}_m{m}.tsv"

    if outfile.exists():
        return None

    try:
        cmd = [
            "fuzzsearch",
            "-t",
            str(2),
            "-m",
            str(m),
            "-k",
            "8,12,18",
            "-r",
            str(reference_fasta),
            "-q",
            str(query_fasta),
            "-o",
            str(tsv_file),
        ]
        subprocess.run(
            cmd,
            check=True,
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL,
        )
    except subprocess.CalledProcessError as e:
        return f"Error processing {uniprot_id} with m={m}: {e}"
    sleep(5)  #  wait a bit to ensure file is written
    (
        pl.read_csv(
            tsv_file,
            separator="\t",
            schema={
                "query_id": pl.Utf8,
                "reference_id": pl.Utf8,
                "k": pl.Int8,
                "query_pos": pl.Int32,
                "reference_pos": pl.Int32,
            },
        )
        .with_columns(
            [
                pl.lit(m).alias("mismatch"),
                pl.lit(uniprot_id).alias("Proteome ID"),
            ]
        )
        .write_parquet(outfile, compression="snappy")
    )
    tsv_file.unlink()
    return None


# Build list of all tasks
tasks = [
    (m, uniprot_id)
    for m in [0, 1, 2, 3]
    for uniprot_id in original_df.get_column("Proteome ID").to_list()
]

# Run as multiple threads
with ThreadPoolExecutor(max_workers=8) as executor:
    futures = {
        executor.submit(process_proteome, m, uid): (m, uid) for m, uid in tasks
    }
    for future in tqdm(
        as_completed(futures), total=len(futures), desc="Running fuzzsearch"
    ):
        error = future.result()
        if error:
            print(error)
Running fuzzsearch: 100%|██████████| 536/536 [00:00<00:00, 719176.89it/s]

This computation took a while on my desktop PC (~3h), but now I have all the data I need to recreate the plot. First, all hits are filtered for their catalytic activity, and then the fraction of total hits is computed.

For that, I will also need the total proteome size of the viruses and the number of proteins. But that’s quickly computed.

# load in the data and remove hits with catalytic activity overlaps
result_folder = data_folder / "mhcfuzz_results/"
results = []


def get_ca_df(ca_df: pl.DataFrame) -> pl.LazyFrame:
    return pl.DataFrame(
        [
            {
                "Entry": row["Entry"],
                "ca_set": list(extract_ca_reactions(row["Catalytic activity"])),
                "ca": row["Catalytic activity"],
            }
            for row in ca_df.iter_rows(named=True)
        ]
    ).lazy()


human_ca_df = get_ca_df(human_proteome_df).rename({"ca_set": "human_ca_set"})
for m in [0, 1, 2, 3]:
    for proteome_id in tqdm(
        virus_proteome_ids, desc=f"Processing viruses m={m}"
    ):
        parquet_file = result_folder / f"{proteome_id}_m{m}.parquet"

        uniprot_df = download_uniprot_df(proteome_id)
        if not parquet_file.exists():
            continue
        try:
            virus_hits = pl.scan_parquet(parquet_file, low_memory=True)
        except Exception as e:
            print(f"Error reading {parquet_file}: {e}")
            continue

        virus_df = (
            virus_hits.join(
                get_ca_df(uniprot_df).rename({"ca_set": "viral_ca_set"}),
                left_on="query_id",
                right_on="Entry",
                how="left",
            )
            .join(
                human_ca_df,
                left_on="reference_id",
                right_on="Entry",
                how="left",
            )
            .with_columns(
                (
                    pl.col("human_ca_set").list.set_intersection("viral_ca_set")
                ).alias("ca_overlap")
            )
            .filter(pl.col("ca_overlap").list.len().eq(0))
        )

        n_proteins = uniprot_df.height
        proteome_size = sum(uniprot_df.get_column("Length").to_list())
        mismatches = "0" if m == 0 else f"≤ {m}"

        results.append(
            (
                virus_df.select(["k", "query_pos"])
                .unique(["k", "query_pos"])
                .group_by("k")
                .agg(pl.count().alias("matched_kmers"))
                .with_columns(
                    (
                        pl.lit(proteome_size + 1 - n_proteins) - pl.col("k")
                    ).alias("total_kmers")
                )
                .with_columns(
                    (
                        pl.col("matched_kmers") / pl.col("total_kmers") * 100
                    ).alias("percentage_matched")
                )
                .with_columns(
                    pl.lit(proteome_id).alias("Proteome ID"),
                    pl.lit(mismatches).alias("mismatches"),
                )
                .collect()
            )
        )
results_df = pl.concat(results, how="vertical_relaxed")
Processing viruses m=0:   0%|          | 0/134 [00:00<?, ?it/s]/tmp/ipykernel_61706/3252795822.py:65: DeprecationWarning: `pl.count()` is deprecated. Please use `pl.len()` instead.
(Deprecated in version 0.20.5)
Processing viruses m=0: 100%|██████████| 134/134 [00:00<00:00, 245.96it/s]
Processing viruses m=1:  12%|█▏        | 16/134 [00:00<00:01, 74.94it/s]

Processing viruses m=1: 100%|██████████| 134/134 [00:01<00:00, 83.06it/s] 
Processing viruses m=2: 100%|██████████| 134/134 [00:24<00:00,  5.49it/s]
Processing viruses m=3: 100%|██████████| 134/134 [13:09<00:00,  5.89s/it]

With this computed, we can again correlate the percentages to the published results.

combined_df = pl.concat(
    [
        results_df.with_columns(pl.lit("Recomputed").alias("Source")),
        original_df_zscored.with_columns(
            pl.lit("Maguire et al").alias("Source")
        ),
    ],
    how="diagonal_relaxed",
).select(["Proteome ID", "k", "Source", "mismatches", "percentage_matched"])

(
    p9.ggplot(
        combined_df.pivot(
            index=["Proteome ID", "k", "mismatches"],
            values="percentage_matched",
            on="Source",
        )
        .fill_null(0)
        .with_columns(
            pl.col("Recomputed") + 0.001, pl.col("Maguire et al") + 0.001
        )
    )
    + p9.aes(x="Recomputed", y="Maguire et al")
    + p9.geom_point()
    + p9.facet_grid("mismatches ~ k")
    + p9.scale_x_log10()
    + p9.scale_y_log10()
    + p9.geom_abline(linetype="dashed", color="red")
)

svg

This looks odd. For zero mismatches, the agreement between the original data and my recomputed dataset is very nice, I would say. Also, for k=12 and k=18 with up to 3 mismatches, there is good agreement. But clearly, the data does not match up well for the 8mers with any kind of mismatch but 0.

In other words, the authors find almost all viral 8mers in the human proteome if there are 2 or 3 mismatches allowed, while I do not find the kmers as easily. This might hint at my software not working well.

For this blog post, I decided not to dive too much deeper into this, as the data looks good for k=12 and k=18, with k=12 being the crucial k-mer size.

So let’s continue with this dataset by computing the z-scores and plotting the final heatmap.

recomputed_df_zscored = (
    results_df.pivot(
        index=["Proteome ID"],
        on=["mismatches", "k"],
        values="percentage_matched",
    )
    .fill_null(0)
    .unpivot(
        index=["Proteome ID"],
        variable_name="variable",
        value_name="percentage_matched",
    )
    # Yes this is a hack, but this was a fast way
    # to get the zero values into the dataframe
    .with_columns(
        pl.col("variable")
        .str.extract(r"\"(.*)\".*,(\d+)\}", 1)
        .alias("mismatches"),
        pl.col("variable")
        .str.extract(r"\"(.*)\".*,(\d+)\}", 2)
        .alias("k")
        .cast(pl.Int8),
    )
    .with_columns(
        [
            (
                (
                    pl.col("percentage_matched")
                    - pl.col("percentage_matched")
                    .mean()
                    .over(["k", "mismatches"])
                )
                / pl.col("percentage_matched").std().over(["k", "mismatches"])
            ).alias("zscore")
        ]
    )
    .join(suppl1, on="Proteome ID", how="left")
    .with_columns(
        pl.col("Taxon mnemonic").cast(mnemonic_enum).alias("Taxon mnemonic")
    )
)

Now the data is all there, and I can plot panel 1b. And I think it has turned out well.

The key is really the filtering of homologous function, which the authors did on catalytic activity and EC numbers. Clearly, that matching logic defines a lot of the values. Especially if the virus proteome has only a few proteins and thus few hits, this filtering step is impactful. As always, larger numbers are more forgiving.

(
    p9.ggplot(
        recomputed_df_zscored,
        p9.aes(x="mismatches", y="Taxon mnemonic", fill="zscore"),
    )
    + p9.geom_tile()
    + p9.facet_grid(". ~ k", scales="free", space="free")
    + p9.scale_fill_gradientn(
        colors=(
            "#045375",
            "#089099",
            "#7CCBA2",
            "#FCDE9C",
            "#FAB27B",
            "#E23333",
        ),
        limits=(-4, 6.7),
    )
    + p9.labs(x="")
    + p9.theme(figure_size=(6, 18))
)

svg

This post has gotten longer than usual, but it has recaptured the idea of the blog: To recreate panels from papers with my own source code, only based on the description information published in the paper.

I think it worked rather well, although the values do deviate a bit between my recreation and the paper. I am sure if I were to continue and work on the statistical test of the paper, I would get to similar conclusions.

That’s all for this panel. Thank you for reading. I hope you got something out of it.

So much data

In total, my k-mer matching has created over 7Gb of k-mer matches, which totals to 1,603,808,123 rows.

This is so much data that I had to use polars lazy mode (which is anyway faster) and the low_memory flag to have it process.