Visual Summary | GENEB: Why Genomic Models Are Hard to Compare

Summary (Overview)

GENEB is a large-scale diagnostic benchmark that evaluates 40 genomic foundation models on 100 tasks across 13 functional categories using a unified linear-probing protocol, enabling controlled cross-model comparison.
Model rankings vary sharply across task categories; aggregate leaderboards are unstable and mask important task-level trade-offs.
Scale shows a statistically significant but imperfect correlation with performance (Spearman $\rho=0.565$ , $p<0.001$ ; rising to $\rho=0.685$ after excluding a prokaryotic outlier), but architecture and pretraining alignment frequently outweigh parameter count.
Transformer-based models generally outperform the evaluated state-space alternative (Mamba-SSM), with the largest gaps on cross-species regulatory tasks; tokenization effects interact with architecture and task.
Few-shot robustness follows an inverse pattern: models with low full-shot ceilings show smaller absolute drops, and the model winning under full supervision reranks in 8 of 13 categories under 10-shot evaluation.

Introduction and Theoretical Foundation

The rapid expansion of genomic foundation models has produced a heterogeneous landscape of architectures, tokenization strategies, and pretraining corpora. However, the field lacks a unified evaluation infrastructure: models are evaluated on disjoint benchmarks, under incompatible protocols, and often reported as state-of-the-art within narrowly defined settings. This fragmentation makes it impossible to answer basic questions about relative model quality and has led to a widening gap between claims of superiority and reproducible evidence.

The authors introduce GENEB to address this gap. Inspired by MTEB in NLP (Muennighoff et al., 2023), GENEB provides a shared reference point for principled cross-model comparison. The benchmark evaluates frozen representations via lightweight probing, isolating representation quality and enabling controlled comparison across architecture, tokenization, and pretraining data.

Methodology

Models and Tasks: 40 genomic foundation models are evaluated on 100 DNA classification tasks drawn from nine widely used benchmarks, spanning 13 functional categories: Histone Modifications (30 tasks), Promoters (22), Enhancers (8), DNA Methylation (8), Splice Sites (7), lncRNA (6), Mouse Enhancers (5), TF Binding (5), Species Classification (3), Regulatory (2), Virus/Phage (2), Coding/Non-coding (1), and Chromatin Accessibility (1). Models cover architectures including Transformer-encoder, Transformer-decoder, Mamba-SSM, Hybrid-Mamba-MoE, Hyena, CNN-Transformer hybrids, Graph-Transformer, and StripedHyena; parameter counts range from 2M to 7B.

Probing Protocol: Frozen sequence embeddings are used as features for logistic regression (max_iter=1000) and evaluated in 1-shot, 10-shot, and full-data regimes. Results are averaged over five fixed random seeds. The metric is Matthews Correlation Coefficient (MCC), robust to class imbalance. Tasks exceeding $10^5$ sequences are subsampled.

Controlled Comparisons: To isolate the effects of architecture, tokenization, and pretraining data, 30 matched model pairs are constructed that differ in exactly one factor while holding others constant (see Appendix E.3 for full enumeration).

Empirical Validation / Results

Scale–Performance Relationship: Model size shows a significant aggregate association with macro-MCC ( $\rho=0.565$ , $p<0.001$ ), strengthening to $\rho=0.685$ ( $p<0.001$ ) after excluding the prokaryotic-only outlier Evo-1-131K. However, within the 36 in-domain models, 31 instances exist where a model at least 5× smaller outperforms a larger counterpart. For example, MutBERT (86M) exceeds eccDNAMamba (1B) by +0.110 macro-MCC despite an 11.6× size difference.

Per-Category Scaling Correlations (Table 1):

Category	$\rho$	$p$
Histone modifications	0.579	<0.001
lncRNA	0.568	<0.001
Splice sites	0.537	<0.001
Enhancers	0.490	0.001
Promoters	0.487	0.001
Coding/non-coding	0.482	0.002
Mouse enhancers	0.474	0.002
Virus/phage	0.434	0.005
Regulatory	0.377	0.017
TF binding	0.356	0.024
DNA methylation	0.345	0.030
Species classification	0.304	0.057
Chromatin accessibility	0.238	0.140

Scaling is significant in 11 of 13 categories, with substantial variation in strength. Chromatin accessibility and species classification show no significant scaling.

Architecture Comparison: Under controlled conditions (matched pretraining corpus and tokenization), Transformer models show substantial advantages over the evaluated state-space model. Omni-DNA-1B (Transformer-decoder) exceeds eccDNAMamba (Mamba-SSM) by +0.149 macro-MCC; GenomeOcean-500M shows a +0.131 gap over the same baseline. Within Transformers, the encoder–decoder comparison is task-dependent: GENA-LM-Large-T2T (encoder) beats OmniNA-220M (decoder) by +0.127 under matched conditions. Architecture gaps are largest on cross-species regulatory tasks (e.g., +0.355 on virus/phage, +0.305 on mouse enhancers). Chromatin accessibility is a notable exception where the SSM model (eccDNAMamba) beats GenomeOcean-500M by +0.124.

Tokenization Effects: No global ordering emerges. Under matched conditions, BPE and k-mer perform comparably in Transformer-encoders; in Transformer-decoders, BPE exceeds k-mer on average but with high variance. Single-nucleotide tokenization (MutBERT) outperforms BPE in human-pretrained encoder comparisons. Non-standard vocabularies show variable results.

Pretraining Data Effects (Transfer Learning): Controlled comparisons (Table 2) show that multi-species pretraining yields an average +0.012 macro-MCC improvement over human-only, with structured advantages on chromatin accessibility (+0.062), splice sites (+0.038), species classification (+0.031), and lncRNA (+0.022). Virus/phage tasks favor human-only ( $\Delta=-0.034$ ). Multi-species vs. microbial pretraining shows the largest corpus effect ( $\Delta=+0.084$ ), with microbial-focused pretraining transferring poorly to eukaryotic tasks (e.g., splice sites $\Delta=+0.222$ ). Eukaryotic-genes vs. multi-species (single pair) shows a +0.063 advantage for gene-focused pretraining.

Table 2: Human vs. Multi-species Pretraining (Per-Category $\Delta$ MCC, 6 Controlled Pairs)

Task Category	$\Delta$ MCC	Wins (Multi/6)
Overall (macro)	+0.012	4/6
Chromatin Acc.	+0.062	6/6
Splice Sites	+0.038	4/6
Species Class.	+0.031	3/6
Mouse Enh.	+0.023	4/6
lncRNA	+0.022	5/6
Histone Mod.	+0.009	4/6
Regulatory	+0.008	3/6
DNA Methylation	+0.005	2/6
Coding/Non-cod.	+0.000	3/6
Enhancers	−0.001	3/6
Promoters	−0.001	2/6
TF Binding	−0.005	2/6
Virus/Phage	−0.034	2/6

Green: multi-species advantage ( $\Delta>+0.02$ ); Gray: parity ( $|\Delta|\leq0.02$ ); Red: human advantage ( $\Delta<-0.02$ ).

Few-Shot Robustness: Mean macro-MCC degrades from 0.488 (full-shot) to 0.253 (10-shot) to 0.106 (1-shot). Degradation is structured by category: promoter prediction retains 38.8% of full-shot MCC at 1-shot, while virus/phage (93.5% drop), DNA methylation (93.2%), and lncRNA (91.3%) collapse. The inverse performance pattern: models with low full-shot ceilings show the smallest absolute drops, which does not indicate greater robustness.

Hard Frontier: 28 tasks have mean MCC below 0.35, dominated by 4mC methylation (e.g., G. subterraneus 0.061) and plant lncRNA. Even the strongest models barely exceed 0.4 on 4mC tasks, indicating that scaling alone does not close the gap.

High-Variance Tasks: 13 tasks with cross-model standard deviation >0.12 reveal decisive design patterns. Pretraining scope and architectural family predict top/bottom placement: multi-species and eukaryotic-gene pretraining capture 32/39 top-3 placements, while human-only pretraining is concentrated in bottom positions (29/39).

Domain Mismatch: Prokaryotic-only pretraining (Evo-1-131K) ranks among the weakest despite 7B parameters, due to the eukaryotic skew of GENEB tasks. Recomputing scale correlation without it raises $\rho$ from 0.565 to 0.685.

Theoretical and Practical Implications

Leaderboard Instability: Aggregate rankings are unreliable for model selection; per-category evaluation is essential. The model winning under full supervision reranks in 8 of 13 categories under 10-shot evaluation.
Architecture and Pretraining Matter More than Scale: In many tasks, architecture (Transformer vs. SSM) and pretraining alignment (multi-species vs. human-only) produce performance gaps larger than differences between models 5–10× apart in parameter count.
Few-Shot Sensitivity: Absolute few-shot drops conflate task tractability with model quality; category-resolved evaluation is the appropriate diagnostic.
Practical Recommendations:
- For compact deployment: MutBERT (86M) is the strongest sub-100M model in 8 of 13 categories.
- For epigenomic-profile tasks (TF binding, regulatory, enhancers): Enformer and SPACE (CNN-Transformer hybrids) lead.
- For hard regimes (DNA methylation, plant lncRNA): no model achieves high performance; progress requires advances beyond scaling.
- For few-shot scenarios: the best full-shot model is not necessarily the best low-data model; practitioners should consult 10-shot results.
Reproducibility: The study highlights that 25% of surveyed models (13 of 53) could not be evaluated due to unavailable weights, broken code, or hardware constraints, underscoring the need for better release practices.

Conclusion

GENEB enables controlled, systematic comparison of 40 genomic foundation models across 100 tasks. The key findings are:

Scale shows a substantial but imperfect association with performance ( $\rho=0.565$ ).
Architecture and pretraining alignment frequently offset scale differences.
Category-level rankings are unstable under few-shot regimes.
Transformer-based models generally outperform the evaluated state-space alternative, with domain-specific exceptions.
Tokenization effects are context-dependent.
Microbial-only pretraining transfers poorly to eukaryotic tasks.

These results argue for category-aware, controlled evaluation over aggregate leaderboards. GENEB provides a reference framework for principled model selection and highlights limitations of current evaluation practices. Future work should extend coverage to long-range regulatory tasks, prokaryotic and viral genomics, and full fine-tuning comparisons.