# Beyond IID: How General Are Tabular Foundation Models, Really?

> BeyondArena reveals tabular foundation models excel only on small IID data, while traditional models dominate non-IID, large, and high-dimensional tasks.

- **Source:** [arXiv](https://arxiv.org/abs/2606.30410)
- **Published:** 2026-07-01
- **Permalink:** https://picx.dev/p/EhGwFo
- **Whiteboard:** https://picx.dev/p/EhGwFo/image

## Summary

## Summary (Overview)

- **Holistic Benchmark for Tabular Data**: The paper introduces **BeyondArena**, the first unified benchmark covering both IID and non-IID task types (temporal, grouped) across diverse dataset scales (tiny to large), feature dimensionality, and feature types (text, high‑cardinality categories), curated from 11 models and 142 datasets.
- **Tabular Foundation Models (TFMs) Excel on IID, Small Data**: TFMs (TabICLv2, TabPFN‑2.6, TabDPT) dominate on tiny‑ to medium‑sized IID data (e.g., up to 10k rows) but are outperformed by traditional tree‑based and deep learning models on non‑IID, large‑scale, high‑dimensional, and high‑cardinality categorical data.
- **DataFoundry Framework**: A Python framework and metadata schema for reproducible curation and preprocessing of tabular datasets, providing notebooks for each dataset and unified export to benchmarks/data repositories.
- **Critical Role of Appropriate Splits**: Using random splits for grouped or temporal data can severely distort model rankings (Kendall’s τ = 0.49–0.60 in ablation), confirming the need for task‑appropriate validation.
- **TFMs Not Yet Fully General**: While TFMs achieve peak performance on 70% of datasets (including ties), they fail to reach the best performance on 42 datasets – mainly large, high‑dimensional, non‑IID, or high‑cardinality tasks, highlighting key gaps for future research.

## Introduction and Theoretical Foundation

Predictive machine learning on tabular data has recently seen a surge of interest in **tabular foundation models (TFMs)** – models pretrained on large corpora of tables that can be used via in‑context learning (ICL). However, evaluations of TFMs have been fragmented across application domains (e.g., cybersecurity, soil mapping, clinical predictions) while the core tabular research community mostly focuses on **IID (independent and identically distributed)** tasks. This narrow focus ignores more challenging scenarios – non‑IID data (temporal, grouped), large scale, high dimensionality, and special feature types (text, high‑cardinality categories) – which are common in real‑world practice. As a result, progress may stagnate by optimizing marginal improvements on IID benchmarks rather than addressing genuine generalization challenges.

The paper defines IID vs. non‑IID based on the **application‑dependent appropriate train‑test split**:

> **IID**: Test samples do not follow a particular structure → random split.
> **Non‑IID**: Application requires temporal split (time index, test after train) or grouped split (group index, all samples from an unseen group kept together). Grouped tasks can be **label‑per‑group** (group‑level label) or **label‑per‑sample** (individual labels for unseen groups).

This definition is critical: the same dataset can be IID or non‑IID depending on the deployment scenario (e.g., fraud detection for past investigation vs. real‑time prevention). BeyondArena systematically handles all three task types.

**Motivation**: Unify fragmented evaluation efforts, create a holistic benchmark, and understand how well TFMs would perform in real‑world predictive applications – thereby guiding research toward the most demanding challenges.

## Methodology

### Dataset Curation (DataFoundry)
- **Process**: Manual curation of 1128 datasets from 21 prior benchmark studies and public repositories (UCI, OpenML, Hugging Face, Kaggle, etc.). Each dataset was verified by humans.
- **Selection Criteria**: (1) Unique within benchmark; (2) at least 100 training samples (no few‑shot); (3) appropriate task type (IID/temporal/grouped); (4) published for predictive classification/regression; (5) representative real‑world problem (not synthetic or vectorized images); (6) no obvious ethical concerns.
- **Outcome**: 142 high‑quality datasets (12.6% of investigated). Characteristics: sizes from tiny (<1k rows) to large (>100k rows), 73% binary classification, 31% regression, 18% grouped, 15% temporal, 44% from UCI, 34% from Kaggle; diverse domains (medical, business, finance, biology, etc.).
- **Framework**: **DataFoundry** – Python package and metadata schema; each dataset has a reproducible notebook for preprocessing, checks, splits, and export.

### Experimental Design
- **Models** (11 total):
  - **Baseline**: Linear/Logistic Regression, RandomForest, ExtraTrees.
  - **GBDT**: CatBoost, LightGBM, XGBoost.
  - **MLP**: RealMLP, TabM.
  - **TFM (in‑context learning only)**: TabDPT, TabPFN‑2.6, TabICLv2.
  - For TFMs: no fine‑tuning; default preprocessing; ensemble members reduced for largest datasets; imputed missing results (large datasets) with default RandomForest.
- **Outer Splits**:
  - IID: repeated $n$-fold cross‑validation ($n=3$ for 2.5k–250k samples, $n=10$ for 500–2.5k, $n=20$ for <500).
  - Grouped: same but group‑based CV.
  - Temporal: manually created application‑specific temporal splits (multiple time points, minimum 50% training data).
- **Inner Validation & Tuning**: 8‑fold CV (or $5 \times 5$‑fold CV for <500 samples) with random search (25 configs) using search spaces from Erickson et al. (2024). Time limits: 4h CV + 1h test (12h for TFMs).
- **Preprocessing**:
  - Dates: converted to 10 numerical features via skrub (weekday, spline periodic encoding).
  - Text: encoded using **Qwen3‑Embedding‑8B** (best zero‑shot multilingual encoder per MMTEB) into 32‑dimensional vector via Matryoshka representation.
  - Grouped data: label‑per‑sample – drop group index; label‑per‑group – replace with 50‑dimensional group‑encoding vector.
- **Metrics**:
  - Binary classification: ROC AUC.
  - Multiclass classification: log‑loss.
  - Regression: RMSE.
  - Aggregated via **Elo** (calibrated to default XGBoost = 1000) and **Improvability** (error relative to best method, 0%–100%).
- **Compute**: ~$50k cost, ~16.25 wall‑clock years on GCP CPU/GPU VMs.

## Empirical Validation / Results

### Main Findings

- **Figure 4 (Leaderboard)**: TFMs (TabICLv2, TabPFN‑2.6) rank highest by Elo and Improvability when considering default performance, but **tuned + ensembled** traditional models (RealMLP, CatBoost) catch up or surpass.
- **Figure 5 (Per Sub‑benchmark Elo)**:
  - **IID**, **Tiny**, **Small**: TFMs dominate.
  - **Grouped**, **Temporal**, **Large**, **High Dimensionality**, **High‑Cardinality**: Traditional models (especially RealMLP, CatBoost) significantly outperform TFMs.
  - **Text**: Mixed; Qwen3 encoding helps short‑text datasets.
- **Figure 6 (Rank‑1 Share)**:
  - TFMs (TabICLv2: 19% rank‑1; TabPFN‑2.6: 10.5%) are best on many datasets, but on 42 datasets (30%) TFMs are clearly outperformed – mostly non‑IID, large, high‑dimensional, or high‑cardinality tasks.
  - TFMs are a viable peak‑performance solution on 70% of all datasets (including ties with non‑TFMs).

### Ablation Studies (Table 1)

| Part | Ablation Setting (BeyondArena → new) | Kendall τ | Win Rate |
|------|--------------------------------------|-----------|----------|
| Inner Splits (B.1) | $5 \times 5$‑fold CV → 8‑fold CV | 0.93 | 100% |
| (B.2) | Non‑IID → IID inner splits | 1.00 | 100% |
| Grouped Pre‑processing (C.1a) | L‑P‑G: Agg. Index → N/A | 0.81 | 71% |
| (C.1b) | L‑P‑S: Drop Index → N/A | 0.43 | 71% |
| Text Pre‑processing (C.2a) | Short text: Qwen3 → TF‑IDF | 1.00 | 78% |
| (C.2b) | Long text: Qwen3 → TF‑IDF | 0.89 | 6% |
| Calibration (D) | N/A → Probability Calibration | 0.85 | 18% |

**Key Ablation Insights**:
- **5‑repeated 5‑fold CV** for tiny datasets outperforms 8‑fold CV (consistent with overtuning avoidance).
- **Non‑IID inner splits** are beneficial; using random splits for temporal/grouped data harms model selection.
- **Grouped preprocessing** (encoding/dropping group indices) improves performance overall, but TabPFN‑2.6 improves when preprocessing is omitted (negative bias).
- **Qwen3** beats TF‑IDF on short‑text data; TF‑IDF better on long‑text, but rankings stable.
- **Probability calibration** significantly improves log‑loss for all models except TabPFN‑2.6 and RealMLP; recommended as default.

Statistical tests (Appendix E) confirm that global rankings and pairwise comparisons across models and sub‑benchmarks are significant.

## Theoretical and Practical Implications

- **For Practitioners**: TFMs are excellent default choices for small‑ to medium‑sized IID tabular tasks with low‑dim features and no complex modalities. However, for large‑scale, non‑IID, high‑dim, or high‑cardinality categorical data, traditional tree‑based models (especially gradient‑boosted decision trees) or tuned deep learning models (RealMLP) should be preferred. The benchmark provides concrete guidance per data regime.

- **For Researchers**: BeyondArena exposes critical limitations of current TFMs:
  - They struggle with temporal dependencies, grouped structures (unseen entities), large sample sizes, high dimensionality, and many categories.
  - Fine‑tuning TFMs may bridge the gap (not evaluated), but pure in‑context learning is insufficient for many real‑world applications.
  - Future TFM development should focus on non‑IID generalization, scalability, and handling of high‑cardinality or long‑text features.
  - The DataFoundry framework and modular benchmark enable reproducible cross‑study comparisons.

- **Benchmarking Standard**: The paper demonstrates that inappropriate (IID) splits for non‑IID data can mislead model rankings, reinforcing the need for task‑appropriate validation. The ablation results highlight the sensitivity of benchmarks to preprocessing and validation choices.

## Conclusion

**Main Takeaways**:
- BeyondArena unifies IID and non‑IID evaluation for tabular data with 142 datasets, 11 models, and a reproducible curation framework (DataFoundry).
- Current TFMs are strong on tiny‑ to medium‑sized IID data but are not yet “fully general”; traditional models still dominate on non‑IID, large, and high‑dim settings.
- The benchmark and ablations provide clear recommendations for practitioners and outline research directions for next‑generation tabular foundation models.

**Limitations**:
- Tuning limited to 25 random configurations.
- TFMs evaluated only in in‑context learning mode (no fine‑tuning).
- Only three open‑source TFMs included; closed‑source or other architectures unexplored.
- Uneven representation of sub‑benchmarks in the global leaderboard.
- Best practices for non‑IID validation and preprocessing (grouped, text) remain an open problem.

**Future Work**:
- Extend to few‑shot predictions ( < 100 samples), multimodal tabular data (images, images+text), relational learning, survival analysis.
- Investigate fine‑tuning of TFMs for non‑IID scenarios.
- Explore automated preprocessing search for grouped and text features.
- Expand coverage of closed‑source models and other architectures (e.g., TabTransformer, FT‑Transformer).

---

_Markdown view of https://picx.dev/p/EhGwFo, served by PicX — AI-generated visual whiteboard summaries of research papers._