Helixer is a eukaryotic gene prediction tool published in 2025. I became interested in this area during my PhD, when I made extensive use of GeneMark-ES and Augustus.

Both Augustus and GeneMark-ES work well, but they are a bit dated. In addition, GeneMark-ES requires a custom license and a key. In contrast, Helixer is licensed under GPL-3.0 and is available on GitHub.

Today I tried running Helixer.

Installation

The authors provide both a Docker image and a manual installation path. Docker would be the easiest route, but since Helixer can use CUDA and I have an AMD GPU I will try the manual route first.

Since it appears to be a Python project, I chose uv to manage the virtual environment. I add custom ROCm install instructions to the pip command below:

uv venv -p 3.11
source .venv/bin/activate

# Install tensorflow for ROCm
uv pip install tensorflow-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-7.1.1/
# Now normally install the Helixer package
uv pip install git+https://github.com/usadellab/Helixer

This failed because Helixer has specific requirements that the commands above couldn’t meet, due to ROCm package versions.

Example screenshot:

Screenshot of some requirements of helixer

I’m sure those are the versions used to build their local environment, but the version pinning seems stricter than necessary.

Since my ROCm setup appears to be the issue, I’ll try installing Helixer without GPU support first:

uv venv -p 3.11
source .venv/bin/activate

# Now normally install the Helixer package
uv pip install git+https://github.com/usadellab/Helixer

Note: Only Python 3.11 appears to work; I tried 3.10, 3.11, and 3.12, but only 3.11 could resolve the dependencies.

Getting a Genome

Next we need a genome. I use a CSV from NCBI to obtain the genome for the skin fungus Malassezia restricta. It’s likely this genome was part of the Helixer training dataset.

uv pip install polars jupyter
import polars as pl
url = "https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt"
url = "assembly_summary_refseq.txt"

# Note: long infer length often helps Polars to define the Schema better
# quote_char=None is needed because some fields contain unescaped quotes
df = pl.read_csv(url, separator="\t", skip_rows=1, infer_schema_length=100000, quote_char=None) 
df.head()
shape: (5, 38)
#assembly_accessionbioprojectbiosamplewgs_masterrefseq_categorytaxidspecies_taxidorganism_nameinfraspecific_nameisolateversion_statusassembly_levelrelease_typegenome_repseq_rel_dateasm_nameasm_submittergbrs_paired_asmpaired_asm_compftp_pathexcluded_from_refseqrelation_to_type_materialasm_not_live_dateassembly_typegroupgenome_sizegenome_size_ungappedgc_percentreplicon_countscaffold_countcontig_countannotation_providerannotation_nameannotation_datetotal_gene_countprotein_coding_gene_countnon_coding_gene_countpubmed_id
strstrstrstrstri64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstri64i64f64i64i64i64strstrstrstrstrstrstr
"GCF_000001215.4""PRJNA164""SAMN02803731""na""reference genome"72277227"Drosophila melanogaster""na""na""latest""Chromosome""Major""Full""2014-08-01""Release 6 plus ISO1 MT""The FlyBase Consortium/Berkele…"GCA_000001215.4""identical""https://ftp.ncbi.nlm.nih.gov/g…"na""na""na""haploid""invertebrate"14370647814255350042.0718692441"FlyBase""FlyBase Release 6.54""2023-12-26""17872""13962""3543""10731132;12537568;12537572;125…
"GCF_000001405.40""PRJNA168""na""na""reference genome"96069606"Homo sapiens""na""na""latest""Chromosome""Patch""Full""2022-02-03""GRCh38.p14""Genome Reference Consortium""GCA_000001405.29""different""https://ftp.ncbi.nlm.nih.gov/g…"na""na""na""haploid-with-alt-loci""vertebrate_mammalian"3099441038294831835941.024470996"NCBI RefSeq""GCF_000001405.40-RS_2025_08""2025-08-01""59792""20076""22235""7219534;10508508;10830953;1123…
"GCF_000001635.27""PRJNA169""na""na""reference genome"1009010090"Mus musculus""strain=C57BL/6J""na""latest""Chromosome""Major""Full""2020-06-24""GRCm39""Genome Reference Consortium""GCA_000001635.9""identical""https://ftp.ncbi.nlm.nih.gov/g…"na""na""na""haploid""vertebrate_mammalian"2728206152265460553842.021101305"NCBI RefSeq""GCF_000001635.27-RS_2024_02""2024-02-01""50763""22198""17705""12954771;19468303;21750661"
"GCF_000001735.4""PRJNA116""SAMN03081427""na""reference genome"37023702"Arabidopsis thaliana""ecotype=Columbia""na""latest""Chromosome""Minor""Full""2018-03-15""TAIR10.1""The Arabidopsis Information Re…"GCA_000001735.2""identical""https://ftp.ncbi.nlm.nih.gov/g…"na""na""na""haploid""plant"11914634811896070436.055100"TAIR and Araport""Annotation submitted by TAIR a…"2022-10-20""38312""27562""5884""10574454;10617197;10617198;111…
"GCF_000002075.1""PRJNA209509""SAMN02953658""AASC00000000.3""reference genome"65006500"Aplysia californica""na""F4 #8""latest""Scaffold""Major""Full""2013-05-15""AplCal3.0""Broad Institute""GCA_000002075.2""different""https://ftp.ncbi.nlm.nih.gov/g…"na""na""na""haploid""invertebrate"92729631473778337040.504331164544"NCBI RefSeq""NCBI Aplysia californica Annot…"2020-08-18""21514""19405""2038""16230032"

This CSV contains information about RefSeq genomes deposited in the NCBI GenBank.

df_m_restricta = df.filter(pl.col("organism_name") == pl.lit("Malassezia restricta"),
          pl.col("refseq_category") == pl.lit("reference genome"))
df_m_restricta.select(["#assembly_accession", "organism_name", "refseq_category", "ftp_path"])
shape: (1, 4)
#assembly_accessionorganism_namerefseq_categoryftp_path
strstrstrstr
"GCF_003290485.1""Malassezia restricta""reference genome""https://ftp.ncbi.nlm.nih.gov/g…
import requests
from pathlib import Path

folder = Path("data")
folder.mkdir(exist_ok=True)

for row in df_m_restricta.iter_rows(named=True):
    ftp_path = row["ftp_path"]
    filename = ftp_path.split("/")[-1] + "_genomic.fna.gz"
    url = f"{ftp_path}/{filename}"
    print(f"Downloading {url} to {folder/filename}")
    response = requests.get(url)
    with open(folder/filename, "wb") as f:
        f.write(response.content)
Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/290/485/GCF_003290485.1_ASM329048v1/GCF_003290485.1_ASM329048v1_genomic.fna.gz to data/GCF_003290485.1_ASM329048v1_genomic.fna.gz

Running Helixer

Now that I have a genome I should be able to run the gene inference. The tutorial says to run:

Helixer.py \
  --lineage fungi \
  --fasta-path data/GCF_003290485.1_ASM329048v1_genomic.fna.gz  \
  --species Malassezia_restricta \
  --gff-output-path data/Malassezia_restricta_helixer.gff3

But this initially produced an error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/paul/.local/share/Helixer/models/fungi'

It seems I needed to download the models first; I had skipped that step.

So I downloaded the models:

python .venv/bin/fetch_helixer_models.py

That placed the models in my home folder (/home/$USER/.local/share/Helixer/models). That’s not ideal because it clutters my machine, so I’ll need to remember to delete them later.

When I reran the command I got another error:

Error: helixer_post_bin not found in $PATH, this is required for Helixer.py to complete.

It looked like I was missing a required command. Searching the repository locally I couldn’t find it, but the README shows I had skipped another install step.

I needed to install HelixerPost, their Rust post-processing tool.

Since I have Rust installed, this should be straightforward:

cargo install --git https://github.com/usadellab/HelixerPost.git

On my Debian machine I also needed an additional system dependency: libhdf5-dev

sudo apt install libhdf5-dev

With that installed, I could build the post-processor.

I tried running Helixer again:

Helixer.py \
  --lineage fungi \
  --fasta-path data/GCF_003290485.1_ASM329048v1_genomic.fna.gz  \
  --species Malassezia_restricta \
  --gff-output-path data/Malassezia_restricta_helixer.gff3

This time it ran and I waited for the results. I have 16 CPU threads and Helixer used them fully, which is good since I don’t have a compatible GPU.

After just under three minutes it finished and I saw:

Total: 10585288bp across 1474 windows

Helixer successfully finished the annotation of data/GCF_003290485.1_ASM329048v1_genomic.fna.gz in 0.05 hours. GFF file written to data/Malassezia_restricta_helixer.gff3.

This is very fast, especially without a GPU.

I also tried the Docker image for comparison. The following command was created by Claude Sonnet 4.6 digesting Docker README:

docker run --rm \
  --mount type=bind,source="$(pwd)"/data,target=/home/helixer_user/shared \
  --mount type=bind,source="$HOME/.local/share/Helixer",target=/home/helixer_user/.local/share/Helixer \
  gglyptodon/helixer-docker:latest \
  Helixer.py \
    --lineage fungi \
    --fasta-path /home/helixer_user/shared/GCF_003290485.1_ASM329048v1_genomic.fna.gz \
    --species Malassezia_restricta \
    --gff-output-path /home/helixer_user/shared/Malassezia_restricta_helixer.gff3

The Docker command also launched as expected and used all CPU threads; it finished after a few minutes. So both approaches work well.

Conclusion

Overall, Helixer seems to work very well. It’s worth reading the installation instructions; even going in blind, the project provides enough guardrails to make installation relatively easy. I like that the project uses Python and Rust - sensible choices familiar to many people in bioinformatics.

One downside is that Helixer currently provides models only for fungi, plants, invertebrates, and vertebrates. The eukaryotes I’m interested in (from my PhD) are unicellular organisms that are often overlooked; they don’t fit neatly into those categories except sometimes as fungi, and are often called algae or protists, which is a larger topic by itself.

The Helixer authors suggest training custom models for these cases. That is potentially a lot of work, so I won’t attempt it today.

All in all, Helixer seems like a useful addition to the eukaryotic gene-caller toolbox.

Cleaning up

As I now installed on my machine the post processing tool, downloaded models into my home folder, and downloaded the docker image it’s worth cleaning up.

# Remove the HelixerPost binary installed via cargo
cargo uninstall helixer_post_bin

# Remove the downloaded Helixer models
rm -rf ~/.local/share/Helixer

# Remove the Docker image
docker rmi gglyptodon/helixer-docker:latest

# Finally remove the local .venv
rm -r .venv