Summary of "MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image"

Summary (Overview)

Introduces MulTaBench, a new benchmark of 40 datasets (20 image-tabular, 20 text-tabular) designed for challenging Multimodal Tabular Learning (MMTL) tasks that require Target-Aware Representations (TAR).
Establishes and implements a curation pipeline based on two core desiderata: Joint Signal (each modality provides complementary predictive information) and Task-awareness (task-agnostic, frozen embeddings lose critical fine-grained information).
Demonstrates empirically that finetuning embeddings (TAR) as a preprocessing step consistently outperforms using frozen, off-the-shelf embeddings across a diverse suite of tabular learners, encoder scales, and embedding dimensions.
Constitutes the largest image-tabular benchmarking effort to date, spanning domains like healthcare and e-commerce, and is designed to facilitate the development of novel Multimodal Tabular Foundation Models.

Introduction and Theoretical Foundation

Tabular Foundation Models (TFMs) have become state-of-the-art for supervised tabular learning but are fundamentally unimodal, lacking native support for unstructured modalities like text and image. They rely on frozen, pretrained embeddings to process these inputs. In many real-world domains (e.g., e-commerce, healthcare, social media), tabular data is inherently multimodal, combining structured features with text and images.

Existing MMTL benchmarks often focus on the mere co-occurrence of modalities, which masks the benefits of task-specific tuning and groups together problems requiring different modeling solutions. The paper argues that to study MMTL effectively, a dataset should satisfy two properties:

Joint Signal: Each modality provides complementary information.
Task-awareness: Generic, pretrained embeddings fail to capture the fine-grained, task-specific details required for optimal prediction, necessitating Target-Aware Representations (TAR).

The core limitation is that embeddings act as lossy summaries. Pretrained models are optimized for broad semantic content (e.g., distinguishing an X-ray from a mammogram) at the expense of fine-grained details (e.g., precise size estimations or localized anomalies) crucial for specific downstream tasks.

Methodology

Curation Pipeline and Desiderata

The paper translates the theoretical desiderata into a measurable curation pipeline using four experimental conditions (summarized in Table 1 and Figure 1):

Table 1: Experimental Conditions. Breakdown by feature composition and representation strategy.

Condition	Structured	Unstructured	Target-Aware (TAR)
Unimodal Structured	✓	×	–
Unimodal Unstructured	×	✓	×
Joint Frozen	✓	✓	×
Joint TAR	✓	✓	✓

Joint Signal Criterion: Performance in the Joint Frozen condition must be higher than both Unimodal Structured and Unimodal Unstructured conditions.
Task-awareness Criterion: Performance in the Joint TAR condition must improve over the Joint Frozen condition.

Acceptance Criteria: A dataset is included in MulTaBench if it satisfies both criteria across at least 3 out of 5 curation tabular learners.

Implementation Details

Embeddings: e5-v2-small for text, DINO-v3-small for images (selected for efficiency).
Target-Aware Representations (TAR): The last 3 transformer layers of the encoder are finetuned using LoRA ( $r = 16$ , $\alpha = 32$ , dropout $0.1$ ) on the prediction target as a separate preprocessing step. For regression, the target is discretized into 20 bins.
Dimensionality Reduction: All embeddings are down-projected to 30 dimensions using PCA for computational efficiency.
Curation Tabular Learners: LightGBM, CatBoost, TabM, TabPFNv2, and TabPFN-2.5.
Evaluation: 5 random seeds, up to 10,000 examples per run. AUC for classification, $R^2$ for regression.

Formal Acceptance Criteria

Let $D$ be a candidate dataset and $M$ be the pool of 5 curation learners. For learner $m \in M$ , let $S_m(\text{Condition})$ denote its average performance.

Joint Gain: $\Delta_{\text{Joint}}(m) = S_m(\text{Joint Frozen}) - \max\{S_m(\text{UnimodalStructured}), S_m(\text{UnimodalUnstructured})\}$
Awareness Gain: $\Delta_{\text{Awareness}}(m) = S_m(\text{Joint TAR}) - S_m(\text{Joint Frozen})$

A dataset $D$ is accepted if: $|\{ m \in M : \Delta_{\text{Joint}}(m) > \delta \wedge \Delta_{\text{Awareness}}(m) > \delta \}| \geq \rho \cdot |M|$ , with $\delta = 0.001$ and $\rho = 3/5$ .

Empirical Validation / Results

MulTaBench Composition

The final benchmark contains 40 datasets, balanced between image/text modalities and classification/regression tasks. Dataset sizes range from 400 to 114,000 rows. Key statistics are provided in Table 3 from the paper.

Table 3: All 40 MulTaBench Datasets Properties (Abbreviated). Task: Classification (CLS) or Regression (REG). Classes: number of target classes (for CLS). N: total examples. Struct.: numerical + categorical features. Text: free-text features. Img.: image features.

Dataset	Task	Classes	N	Struct.	Text	Img.
Image-Tabular (20 datasets)
CBIS-DDSM	CLS	4	1,696	8	0	1
CheXpert	CLS	3	46,437	17	0	1
PetFinder	CLS	8	14,652	17	4	1
...	...	...	...	...	...	...
Text-Tabular (20 datasets)
Jigsaw Toxicity	CLS	2	100,000	29	2	0
Wine Review	CLS	30	84,123	?,	2	0
...	...	...	...	...	...	...

Curation Results

Text-Tabular: From 56 unique datasets aggregated from existing benchmarks, only 23 (41%) passed both curation criteria. 20 were selected for MulTaBench. TAR consistently outperformed frozen embeddings across all learners (Figure 3).
Image-Tabular: From 16 unique datasets found in literature, only 5 (31%) passed. An additional 15 were manually curated from public sources (e.g., Kaggle) to reach 20, making MulTaBench the largest image-tabular benchmark.

Robustness of Task-Awareness

The paper extensively validates that the gains from TAR are robust and generalize.

New Tabular Learners: TAR outperforms frozen embeddings across 12 additional models (including XGBoost, RandomForest, TabSTAR, ConTextTab, AutoGluon-Multimodal) for both modalities (Figure 4). Notably, ConTextTab (SOTA on the CARTE benchmark) performs poorly on MulTaBench, highlighting its focus on a different class of problems.

Key Finding: "Target-aware embeddings consistently outperform frozen embeddings across all new models and modalities."
Embedding Model Scale: Using larger encoders (e5-large, DINO-v3-large) improves performance, but TAR still significantly outperforms frozen embeddings even at the larger scale. In fact, TAR Small is often better than Frozen Large (Figure 5).
Embedding Dimension: The advantage of TAR holds across PCA dimensions of 15, 30, and 60, and even when no PCA is applied (Figure 6, 14), proving the gain is not an artifact of compression.
Qualitative Analysis (Image): Visualization of DINO-v3 attention maps shows that TAR shifts the encoder's focus from generic regions to task-relevant areas (e.g., from general anatomy to the lung in CheXpert, from background clutter to animal ears in PetFinder) (Figure 7).

Table 2: The PetFinder Analysis. S=Structured, I=Image, T=Text. For all models, performing Joint Modeling and Target-Aware Representations for both modalities maximizes AUC (shown in %).

Model	I	T	S+I	S+T	S+I+T	S+I TAR +T	S+I+T TAR	S+I TAR +T TAR
LightGBM	77.2	72.1	79.9	77.7	81.1	82.8	84.2	85.7
CatBoost	78.9	73.5	81.7	79.3	83.2	83.9	85.2	86.4
TabM	80.2	74.9	83.0	80.7	84.2	84.8	86.3	87.0
TabPFNv2	80.7	73.5	83.2	79.3	83.9	84.5	86.3	87.1
TabPFN-2.5	81.1	76.0	83.7	81.0	84.9	85.3	87.3	88.0

Theoretical and Practical Implications

Identifies a Gap in MMTL: Current architectures (TFMs with frozen embeddings or jointly finetuned models like TabSTAR/AutoGluon) are suboptimal. TFMs lack TAR support, while joint models compromise tabular performance or incur high computational costs from repeated finetuning.
Defines a Path for Multimodal TFMs: Extends the desiderata for TFMs by Van Breugel and Van Der Schaar [93] with a fifth core property: (D5) Target-Aware Multimodal Tabular Learning. The ideal future model should combine the robustness and in-context learning efficiency of TFMs with the contextualization benefits of TAR.
Provides a Critical Benchmarking Tool: MulTaBench is explicitly designed to evaluate architectures that incorporate joint modeling and TAR, filling a void in the research landscape. Its automated curation pipeline allows for continuous expansion and refinement as the field progresses.
Highlights Computational Challenges: The TAR preprocessing step, while effective, introduces significant computational overhead (Appendix F.3), especially for text with multiple columns. This underscores the need for more efficient architectures.

Conclusion

MulTaBench is introduced as a benchmark to advance research on Multimodal Tabular Learning, specifically focusing on tasks that benefit from joint modeling and Target-Aware Representations. The paper demonstrates that frozen embeddings are insufficient for many MMTL tasks and that TAR gains generalize robustly. While the curation pipeline entangles problem properties with algorithmic solutions (a noted limitation), it provides a mechanism to isolate challenging datasets and refresh the benchmark.

Future directions include expanding to dedicated text-image-tabular benchmarks, exploring other modalities (audio, video), and most importantly, using MulTaBench to develop true Multimodal Tabular Foundation Models that efficiently achieve target-awareness while maintaining strong tabular performance.