Trying Helixer

Helixer is a eukaryotic gene prediction tool published in 2025. I became interested in this area during my PhD, when I made extensive use of GeneMark-ES and Augustus.

Both Augustus and GeneMark-ES work well, but they are a bit dated. In addition, GeneMark-ES requires a custom license and a key. In contrast, Helixer is licensed under GPL-3.0 and is available on GitHub.

Today I tried running Helixer.

Installation

The authors provide both a Docker image and a manual installation path. Docker would be the easiest route, but since Helixer can use CUDA and I have an AMD GPU I will try the manual route first.

Since it appears to be a Python project, I chose uv to manage the virtual environment. I add custom ROCm install instructions to the pip command below:

uv venv -p 3.11
source .venv/bin/activate

# Install tensorflow for ROCm
uv pip install tensorflow-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-7.1.1/
# Now normally install the Helixer package
uv pip install git+https://github.com/usadellab/Helixer

This failed because Helixer has specific requirements that the commands above couldn’t meet, due to ROCm package versions.

Example screenshot:

Screenshot of some requirements of helixer

I’m sure those are the versions used to build their local environment, but the version pinning seems stricter than necessary.

Since my ROCm setup appears to be the issue, I’ll try installing Helixer without GPU support first:

uv venv -p 3.11
source .venv/bin/activate

# Now normally install the Helixer package
uv pip install git+https://github.com/usadellab/Helixer

Note: Only Python 3.11 appears to work; I tried 3.10, 3.11, and 3.12, but only 3.11 could resolve the dependencies.

Getting a Genome

Next we need a genome. I use a CSV from NCBI to obtain the genome for the skin fungus Malassezia restricta. It’s likely this genome was part of the Helixer training dataset.

uv pip install polars jupyter

import polars as pl
url = "https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt"
url = "assembly_summary_refseq.txt"

# Note: long infer length often helps Polars to define the Schema better
# quote_char=None is needed because some fields contain unescaped quotes
df = pl.read_csv(url, separator="\t", skip_rows=1, infer_schema_length=100000, quote_char=None) 
df.head()

shape: (5, 38)

#assembly_accession	bioproject	biosample	wgs_master	refseq_category	taxid	species_taxid	organism_name	infraspecific_name	isolate	version_status	assembly_level	release_type	genome_rep	seq_rel_date	asm_name	asm_submitter	gbrs_paired_asm	paired_asm_comp	ftp_path	excluded_from_refseq	relation_to_type_material	asm_not_live_date	assembly_type	group	genome_size	genome_size_ungapped	gc_percent	replicon_count	scaffold_count	contig_count	annotation_provider	annotation_name	annotation_date	total_gene_count	protein_coding_gene_count	non_coding_gene_count	pubmed_id
str	str	str	str	str	i64	i64	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	i64	i64	f64	i64	i64	i64	str	str	str	str	str	str	str
"GCF_000001215.4"	"PRJNA164"	"SAMN02803731"	"na"	"reference genome"	7227	7227	"Drosophila melanogaster"	"na"	"na"	"latest"	"Chromosome"	"Major"	"Full"	"2014-08-01"	"Release 6 plus ISO1 MT"	"The FlyBase Consortium/Berkele…	"GCA_000001215.4"	"identical"	"https://ftp.ncbi.nlm.nih.gov/g…	"na"	"na"	"na"	"haploid"	"invertebrate"	143706478	142553500	42.0	7	1869	2441	"FlyBase"	"FlyBase Release 6.54"	"2023-12-26"	"17872"	"13962"	"3543"	"10731132;12537568;12537572;125…
"GCF_000001405.40"	"PRJNA168"	"na"	"na"	"reference genome"	9606	9606	"Homo sapiens"	"na"	"na"	"latest"	"Chromosome"	"Patch"	"Full"	"2022-02-03"	"GRCh38.p14"	"Genome Reference Consortium"	"GCA_000001405.29"	"different"	"https://ftp.ncbi.nlm.nih.gov/g…	"na"	"na"	"na"	"haploid-with-alt-loci"	"vertebrate_mammalian"	3099441038	2948318359	41.0	24	470	996	"NCBI RefSeq"	"GCF_000001405.40-RS_2025_08"	"2025-08-01"	"59792"	"20076"	"22235"	"7219534;10508508;10830953;1123…
"GCF_000001635.27"	"PRJNA169"	"na"	"na"	"reference genome"	10090	10090	"Mus musculus"	"strain=C57BL/6J"	"na"	"latest"	"Chromosome"	"Major"	"Full"	"2020-06-24"	"GRCm39"	"Genome Reference Consortium"	"GCA_000001635.9"	"identical"	"https://ftp.ncbi.nlm.nih.gov/g…	"na"	"na"	"na"	"haploid"	"vertebrate_mammalian"	2728206152	2654605538	42.0	21	101	305	"NCBI RefSeq"	"GCF_000001635.27-RS_2024_02"	"2024-02-01"	"50763"	"22198"	"17705"	"12954771;19468303;21750661"
"GCF_000001735.4"	"PRJNA116"	"SAMN03081427"	"na"	"reference genome"	3702	3702	"Arabidopsis thaliana"	"ecotype=Columbia"	"na"	"latest"	"Chromosome"	"Minor"	"Full"	"2018-03-15"	"TAIR10.1"	"The Arabidopsis Information Re…	"GCA_000001735.2"	"identical"	"https://ftp.ncbi.nlm.nih.gov/g…	"na"	"na"	"na"	"haploid"	"plant"	119146348	118960704	36.0	5	5	100	"TAIR and Araport"	"Annotation submitted by TAIR a…	"2022-10-20"	"38312"	"27562"	"5884"	"10574454;10617197;10617198;111…
"GCF_000002075.1"	"PRJNA209509"	"SAMN02953658"	"AASC00000000.3"	"reference genome"	6500	6500	"Aplysia californica"	"na"	"F4 #8"	"latest"	"Scaffold"	"Major"	"Full"	"2013-05-15"	"AplCal3.0"	"Broad Institute"	"GCA_000002075.2"	"different"	"https://ftp.ncbi.nlm.nih.gov/g…	"na"	"na"	"na"	"haploid"	"invertebrate"	927296314	737783370	40.5	0	4331	164544	"NCBI RefSeq"	"NCBI Aplysia californica Annot…	"2020-08-18"	"21514"	"19405"	"2038"	"16230032"

This CSV contains information about RefSeq genomes deposited in the NCBI GenBank.

df_m_restricta = df.filter(pl.col("organism_name") == pl.lit("Malassezia restricta"),
          pl.col("refseq_category") == pl.lit("reference genome"))
df_m_restricta.select(["#assembly_accession", "organism_name", "refseq_category", "ftp_path"])

shape: (1, 4)

#assembly_accession	organism_name	refseq_category	ftp_path
str	str	str	str
"GCF_003290485.1"	"Malassezia restricta"	"reference genome"	"https://ftp.ncbi.nlm.nih.gov/g…

import requests
from pathlib import Path

folder = Path("data")
folder.mkdir(exist_ok=True)

for row in df_m_restricta.iter_rows(named=True):
    ftp_path = row["ftp_path"]
    filename = ftp_path.split("/")[-1] + "_genomic.fna.gz"
    url = f"{ftp_path}/{filename}"
    print(f"Downloading {url} to {folder/filename}")
    response = requests.get(url)
    with open(folder/filename, "wb") as f:
        f.write(response.content)

Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/290/485/GCF_003290485.1_ASM329048v1/GCF_003290485.1_ASM329048v1_genomic.fna.gz to data/GCF_003290485.1_ASM329048v1_genomic.fna.gz

Running Helixer

Now that I have a genome I should be able to run the gene inference. The tutorial says to run:

Helixer.py \
  --lineage fungi \
  --fasta-path data/GCF_003290485.1_ASM329048v1_genomic.fna.gz  \
  --species Malassezia_restricta \
  --gff-output-path data/Malassezia_restricta_helixer.gff3

But this initially produced an error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/paul/.local/share/Helixer/models/fungi'

It seems I needed to download the models first; I had skipped that step.

So I downloaded the models:

python .venv/bin/fetch_helixer_models.py

That placed the models in my home folder (/home/$USER/.local/share/Helixer/models). That’s not ideal because it clutters my machine, so I’ll need to remember to delete them later.

When I reran the command I got another error:

Error: helixer_post_bin not found in $PATH, this is required for Helixer.py to complete.

It looked like I was missing a required command. Searching the repository locally I couldn’t find it, but the README shows I had skipped another install step.

I needed to install HelixerPost, their Rust post-processing tool.

Since I have Rust installed, this should be straightforward:

cargo install --git https://github.com/usadellab/HelixerPost.git

On my Debian machine I also needed an additional system dependency: libhdf5-dev

sudo apt install libhdf5-dev

With that installed, I could build the post-processor.

I tried running Helixer again:

Helixer.py \
  --lineage fungi \
  --fasta-path data/GCF_003290485.1_ASM329048v1_genomic.fna.gz  \
  --species Malassezia_restricta \
  --gff-output-path data/Malassezia_restricta_helixer.gff3

This time it ran and I waited for the results. I have 16 CPU threads and Helixer used them fully, which is good since I don’t have a compatible GPU.

After just under three minutes it finished and I saw:

Total: 10585288bp across 1474 windows

Helixer successfully finished the annotation of data/GCF_003290485.1_ASM329048v1_genomic.fna.gz in 0.05 hours. GFF file written to data/Malassezia_restricta_helixer.gff3.

This is very fast, especially without a GPU.

I also tried the Docker image for comparison. The following command was created by Claude Sonnet 4.6 digesting Docker README:

docker run --rm \
  --mount type=bind,source="$(pwd)"/data,target=/home/helixer_user/shared \
  --mount type=bind,source="$HOME/.local/share/Helixer",target=/home/helixer_user/.local/share/Helixer \
  gglyptodon/helixer-docker:latest \
  Helixer.py \
    --lineage fungi \
    --fasta-path /home/helixer_user/shared/GCF_003290485.1_ASM329048v1_genomic.fna.gz \
    --species Malassezia_restricta \
    --gff-output-path /home/helixer_user/shared/Malassezia_restricta_helixer.gff3

The Docker command also launched as expected and used all CPU threads; it finished after a few minutes. So both approaches work well.

Conclusion

Overall, Helixer seems to work very well. It’s worth reading the installation instructions; even going in blind, the project provides enough guardrails to make installation relatively easy. I like that the project uses Python and Rust - sensible choices familiar to many people in bioinformatics.

One downside is that Helixer currently provides models only for fungi, plants, invertebrates, and vertebrates. The eukaryotes I’m interested in (from my PhD) are unicellular organisms that are often overlooked; they don’t fit neatly into those categories except sometimes as fungi, and are often called algae or protists, which is a larger topic by itself.

The Helixer authors suggest training custom models for these cases. That is potentially a lot of work, so I won’t attempt it today.

All in all, Helixer seems like a useful addition to the eukaryotic gene-caller toolbox.

Cleaning up

As I now installed on my machine the post processing tool, downloaded models into my home folder, and downloaded the docker image it’s worth cleaning up.

# Remove the HelixerPost binary installed via cargo
cargo uninstall helixer_post_bin

# Remove the downloaded Helixer models
rm -rf ~/.local/share/Helixer

# Remove the Docker image
docker rmi gglyptodon/helixer-docker:latest

# Finally remove the local .venv
rm -r .venv

Installation#

Getting a Genome#

Running Helixer#

Conclusion#

Cleaning up#

Installation

Getting a Genome

Running Helixer

Conclusion

Cleaning up