Summary of "Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling"

Summary (Overview)

Problem Formulation: The paper frames the problem of predicting an unfamiliar AI agent's next decision (e.g., in negotiation) as a target-adaptive text-tabular prediction task, where each decision point is a row combining structured game state, offer history, and dialogue text.
Key Method: Introduces LLM-as-Observer, a novel feature block where a small, frozen LLM reads the public interaction state; its hidden state (not its answer) is used as a decision-oriented feature for a downstream tabular foundation model (TabPFN).
Core Finding: The full text-tabular model, combining game features, text embeddings, and Observer hidden states, outperforms direct prompting of a large frontier LLM (LLM-as-Predictor) and a strong game+text baseline. Observer hidden states provide complementary predictive signal not reliably surfaced by direct prompting.
Empirical Validation: Demonstrates cross-population transfer—training on a source population of 13 agents varying by underlying LLM and testing on a held-out target population of 91 scaffolded agents (varying by prompts and logic). At K=16 adaptation games, Observer features improve response-prediction AUC by ~4 points and reduce bargaining offer-prediction error by 14%.
Architectural Insight: The predictive signal resides in the LLM's hidden state representations, not its direct output logits. Using the LLM as an encoder and a tabular model as the adapter is more effective than using the LLM as the final few-shot predictor.

Introduction and Theoretical Foundation

AI agents increasingly engage in language-mediated commerce (e.g., buyer bots negotiating with unknown sellers). In these interactions, the counterpart's internal logic (LLM, prompts, control rules) is hidden, yet each decision has consequences. The paper asks: Can an agent predict an unfamiliar counterpart's next decision from only a few prior interactions?

To study this systematically, the authors use controlled bargaining and negotiation games from the GLEE framework, which preserve key elements like private valuations, monetary payoffs, multi-turn offers, and free-text dialogue. The prediction target is an individual target agent. The predictor is given K complete prior games from that same agent as labeled adaptation examples and must predict its next move in a new game. Two complementary tasks are studied (see Figure 1):

Response Prediction (Classification): Will the target accept the current offer?
Proposal Prediction (Regression): If the target rejects, what offer will it propose next?

The theoretical foundation bridges several areas:

Opponent Modeling & Ad-hoc Teamwork: Classical work on predicting behavior from limited histories, but adapted for open-ended LLM-based agents.
LLMs as Strategic Agents: Prior work characterizes population-level behavior; this work focuses on per-agent prediction.
Multi-modal Text–Tabular Learning: The task naturally combines structured game variables with free-form dialogue text.
Frozen LM Representations as Features: Leverages the insight that intermediate hidden states of frozen language models can encode task-relevant signals not captured in their final outputs.

Methodology

The core methodology is target-adaptive text-tabular prediction. The predictor is a tabular foundation model (TabPFN) that conditions on a large source population of labeled decisions and the K labeled games from the current target agent.

Feature Modalities

Each decision point is represented as a multimodal row with three feature blocks (see Figure 4):

Game-State Features: Structured variables like public configuration, round number, current offer, previous offers/decisions, discount factors, and valuations.
Dialogue Representation: The dialogue history is encoded using a sentence encoder (all-MiniLM-L6-v2) and reduced via PCA to 5 dimensions.
LLM-as-Observer Representation: A small frozen LLM (e.g., Gemma-2-2B, ~2B params) reads the public decision-time state and dialogue. It is prompted toward the target's decision (e.g., suffix {"decision": "), but its direct answer is discarded. Instead, a hidden state from its mid-to-late layers (relative depth 0.6–0.9) is extracted and used as an additional feature vector. This treats the LLM as a decision-oriented encoder.

Prediction Model and Baselines

Tabular Predictor: TabPFN v2.6 is used in classification (response) or regression (proposal) mode. It is conditioned on source population rows and the target's K rows. An agent-identity indicator (one-hot) helps the model distinguish between source and target data.
Baselines for Comparison:
- Game+Text Features: The tabular model using only game-state features and dialogue representation (no Observer).
- LLM-as-Predictor: Directly prompts a large frontier LLM (Gemini 2.5 Flash) with the current game state, dialogue, and the target's K past games, asking it to predict the decision. This is the natural few-shot prompting alternative.

Data and Evaluation Protocol

Two agent populations are used (see Table 1):

Source Population (Training): The 13-agent GLEE frontier-LLM tournament, where agents vary only in the underlying LLM (Claude, GPT, Gemini, etc.).
Target Population (Held-out Test): A new 91-agent university hackathon dataset, where agents share the same underlying LLM (Gemini 2.5 Flash) but vary in scaffolding (prompts, control logic, rule-based fallbacks).

Evaluation: Cross-population transfer. Train on the source population, test on each held-out hackathon agent. For each target, sample K ∈ {0, 2, 4, 8, 16} games as adaptation examples. Metrics: AUC for response prediction, R² for proposal prediction (on normalized offers).

Empirical Validation / Results

The main results are from the cross-population transfer experiment (Table 2).

Response Prediction (AUC)

The LLM-as-Observer model consistently outperforms both baselines across all K and both game families.
At K=16, the best Observer model improves AUC by:
- +4.0 percentage points (pp) over the Game+Text baseline in Bargaining.
- +6.1 pp over the LLM-as-Predictor in Bargaining.
- +4.9 pp over the Game+Text baseline in Negotiation.
- +6.7 pp over the LLM-as-Predictor in Negotiation.
LLM-as-Predictor, despite using a much larger model, is consistently weaker.

Proposal Prediction (R²)

In Bargaining, Observer features provide a clear improvement over the Game+Text baseline across all K. Using the Gemma-2-2B Observer at K=16: $\text{Median } R^2_{\text{Observer}} = 0.676 \quad \text{vs.} \quad \text{Median } R^2_{\text{Game+Text}} = 0.622$ This corresponds to reducing the typical one-offer prediction error on a $10,000 split from **$ 552 to $473, a 14% reduction**.
In Negotiation, the Game+Text baseline is already very strong at high K (R² = 0.857), and Observer features do not provide a clear additional gain.
LLM-as-Predictor performs poorly at numerical regression, often yielding negative R² values, showing autoregressive token decoding is poorly suited for calibrated regression.

Ablation and Robustness Analysis

Feature Hierarchy (Table 3): An ablation study at K=16 shows:

Game features are essential. Removing them causes the largest performance drop.
Observer features are highly valuable. Removing them hurts performance, especially in bargaining.
Text embeddings become redundant once Observer features are added, suggesting the Observer captures the decision-relevant linguistic signal more effectively.

Hidden States vs. Direct Output (Appendix E): A key finding is that the Observer's hidden states are far more predictive than its direct output logits (p(accept)). Adding logits to the Game+Text baseline provides minimal gain, while adding hidden states provides a substantial boost. This holds across different Observer LLM providers (Gemma, Qwen3, Llama).

Layer Stability (Figure 3): The performance gain from Observer features is stable across mid-to-late layers (relative depth 0.6–0.9), not dependent on a single tuned layer.

Theoretical and Practical Implications

Theoretical Implication: The paper demonstrates that frozen LLM hidden states encode decision-relevant strategic signals that are not reliably extracted via direct prompting. This supports and extends probing literature into a dynamic, strategic domain.
Methodological Implication: It establishes target-adaptive text-tabular learning as a superior framework over direct few-shot LLM prediction for this problem. Separating representation (LLM-as-Observer) from adaptation (tabular learner) is more effective than asking an LLM to perform both roles.
Practical Implication: The approach enables effective adaptation to newly encountered, engineered AI agents (varying in scaffolding) based on a model trained on a different axis of variation (underlying LLM). This is relevant for real-world deployment where agents are black-box.
Efficiency: The method is computationally cheaper at inference than repeatedly calling a large frontier LLM, as it uses a small frozen LLM for feature extraction and a lightweight tabular model.

Conclusion

The paper presents a framework for predicting decisions of unfamiliar language-based AI agents. The core recipe is: separate representation from adaptation.

Use structured game features for the strategic backbone.
Use an LLM-as-Observer to extract decision-oriented representations from the public state and dialogue.
Let a tabular foundation model perform the actual prediction, adapting by conditioning on a source population and the target's few observed games.

This formulation outperforms direct LLM prompting. The Observer's hidden states provide complementary predictive signal, especially for response prediction and in bargaining where language interpretation is key. The cross-population evaluation shows the method can transfer from agents varying in LLM to agents varying in scaffolding.

Future Directions & Limitations:

Extending the approach to more complex, real-world market interactions beyond controlled games.
Exploring online adaptation where the predictor updates continuously.
The method assumes access to a relevant source population for training.
The Observer's contribution is task-dependent (stronger for response prediction and bargaining).