# Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

> RNG-Bench reveals top multimodal models struggle with non-Markov memory-for-action, achieving only ~62% on hardest configurations despite fine-tuning improvements.

- **Source:** [arXiv](https://arxiv.org/abs/2606.19338)
- **Published:** 2026-06-19
- **Permalink:** https://picx.dev/p/66IC4G
- **Whiteboard:** https://picx.dev/p/66IC4G/image

## Summary

## Summary (Overview)

- **Non-Markov evaluation**: RNG-Bench isolates a model's ability to **reconstruct past observations and act on them** during multi-step interaction, separating memory-for-action from post-hoc question answering.
- **Two complementary environments**: *Matching Pairs* (static, discrete hidden state: card identities at locations) and *3D Maze* (dynamic, spatial hidden state: map from egocentric views), both with controllable difficulty axes.
- **Diagnostic metrics**: A **Memory Gap** metric (Eq. 3) that disentangles forgetting from suboptimal action selection, and a **head-to-head duel protocol** to control for instance-level variance.
- **State-of-the-art results leave headroom**: On hardest configurations (10×10 Matching Pairs, 13×13 3D Maze), top models achieve only 62.3% Score% and 49.7% Game Score respectively, far from optimal.
- **Fine-tuning transfers**: Supervised fine-tuning of Qwen3.5-9B on optimal and filtered model rollouts improves RNG-Bench performance and transfers to external memory/spatial benchmarks without degrading general multimodal ability.

---

## Introduction and Theoretical Foundation

The paper addresses a critical gap in evaluating multimodal large language models (MLLMs) deployed as closed-loop policies: many real-world tasks require **actions conditioned on observations that are no longer visible** (non-Markov regime). A single recall error can compound through subsequent observations.

**Non-Markov condition** (Definition from Section 3.1):  
A game is non-Markov when the current observation alone is insufficient for optimal action:

$$
\exists \mathbf{h}_t, \tilde{\mathbf{h}}_t \text{ s.t. } Z(s_t) = Z(\tilde{s}_t), \quad \mathcal{A}^*(\mathbf{h}_t) \neq \mathcal{A}^*(\tilde{\mathbf{h}}_t)
$$

where $\mathcal{A}^*(\mathbf{h}_t)$ is the set of optimal actions under history $\mathbf{h}_t$.

The agent must maintain an **internal belief state** $b_t = f(\mathbf{h}_t)$ from its interaction history, summarizing hidden task-relevant information. RNG-Bench is designed to test this **in-context state tracking for action**.

**Limitations of existing benchmarks** (Table 1):
- Fully-observed games (chess, Go) do not require recall.
- Agent suites (AgentBench, BALROG) bundle hidden-state reconstruction with exploration, rule discovery, and free-form actions.
- Long-context/memory benchmarks (EMemBench) probe **remember-to-answer** (post-hoc QA) rather than **remember-to-act** (where recall errors affect subsequent observations).

RNG-Bench instantiates two complementary games that isolate the latter regime.

---

## Methodology

### Problem Formulation

Each instance is a Partially Observable Markov Decision Process (POMDP) $(\mathcal{S}, \mathcal{O}, \mathcal{A}, T, Z, R)$, where $\mathcal{S}$ is state, $\mathcal{O}$ observation, $\mathcal{A}$ action, $T$ transition, $Z$ observation function, $R$ reward. Models act as history-based policies $\pi(a_t | \mathbf{h}_t)$ using raw in-context history with no external belief module by default.

### Environments

- **Matching Pairs**: $R \times C$ grid of card pairs, initially face-down. Each turn: flip two cards; matched cards removed, unmatched flipped back. Hidden state = revealed but currently hidden identity-location bindings. Difficulty axes: board size, visual pattern (ASCII, noise, poker suits, etc.), observation modality (text, image), action feedback.
- **3D Maze**: Procedurally generated grid maze, agent navigates from top-left to bottom-right using egocentric views (no top-down map). Hidden state = maze topology, visited cells, position, orientation. Difficulty axes: maze size ($5\times5$ to $15\times15$), minimap availability, observation modality (text, 2D patch, 3D scene).

### Duel Protocol (Matching Pairs)

Two models play on the **same board** alternately, observing each other's revealed cards but not reasoning. Successful match grants extra turn. Controls for board randomness and tests use of opponent's information. Each pair plays twice with swapped turn order to control first-mover bias.

### Evaluation Metrics

**Matching Pairs**:  
- **Score%** = fraction of matched pairs  
- **Resp./Score** = average responses per matched pair (lower better)  
- Parse failure (PF%) and invalid action (IA%) as diagnostics  

**3D Maze**:  
- **Success Rate (SR)** = fraction of episodes reaching goal  
- **Efficiency (Eff)** = $L^*/L_\text{actual}$ (shortest path / actual path)  
- **Explore** = ratio of visited cells, **Walls** = wall collisions  
- **Game Score (GS)**:

$$
\text{GS} = \text{SR} + \text{SR} \times \text{Eff} + \frac{(1 - \text{SR}) \times \text{Explore}}{2}
$$

**Memory Gap**:  
To separate belief-state tracking from action selection, an oracle condition provides the true hidden state $s_t$ at each step. Let $S(m)$ and $S^*(m)$ be scores under normal and oracle conditions. Then:

$$
\text{MemoryGap}(m) = \left(1 - \frac{S(m)}{S^*(m)}\right) \times 100\%
$$

Large Memory Gap → forgetting is the main bottleneck.

---

## Empirical Validation / Results

All models evaluated with unified harness (VLMEvalKit). Key results:

### Main Single-Player Results (Table 2, 10×10 Matching Pairs, 13×13 3D Maze)

| Model Name | Matching Pairs | 3D Maze |
|---|---|---|
| | PF%↓ | IA%↓ | Resp./Score↓ | Score%↑ | SR%↑ | Explore%↑ | Walls↓ | Eff.%↑ | GS%↑ |
| GPT-5.4 | 0.0 | 4.3 | 8.01 | **62.3** | 20.0 | 32.3 | 3.2 | 75.7 | 30.5 |
| Gemini-3.1-Pro | 0.4 | 2.5 | 10.00 | 50.0 | **50.0** | 36.4 | 0.1 | 62.5 | **49.7** |
| Seed-2.0-Lite | 1.2 | 4.3 | 11.57 | 43.2 | 20.0 | 19.4 | 16.6 | 38.9 | 21.7 |
| Kimi-K2.5 | 1.8 | 2.8 | 13.16 | 38.0 | 10.0 | 17.9 | 7.1 | 61.1 | 16.1 |
| Qwen3.5-397B | 0.0 | 3.0 | 19.74 | 25.3 | 0.0 | 21.0 | 9.9 | 0.0 | 10.5 |

GPT-5.4 leads on Matching Pairs; Gemini-3.1-Pro leads on 3D Maze, indicating distinct hidden-state demands.

### Duel Results (Table 3, 16 games per model)

| Model Name | Win% | W | T | L | Score% | ELO |
|---|---|---|---|---|---|---|
| Gemini-3.1-Pro | **100.0** | 16 | 0 | 0 | 36.5 | 1803 |
| GPT-5.4 | 50.0 | 7 | 2 | 7 | 25.3 | 1492 |
| Qwen3.5-397B | 46.7 | 7 | 1 | 8 | 18.0 | 1476 |
| Kimi-K2.5 | 37.5 | 5 | 2 | 9 | 18.0 | 1423 |
| Seed-2.0-Lite | 15.6 | 2 | 1 | 13 | 12.3 | 1306 |

Gemini-3.1-Pro wins all games, suggesting better use of opponent-revealed information.

### Diagnostic Analyses

- **Scale sweep** (Fig. 3): Performance drops sharply as hidden state grows. Qwen3.5-397B: Matching Pairs Score% from 90.6% ($4\times4$) to 0.7% ($12\times12$); 3D Maze GS peaks at $7\times7$ then declines.
- **External memory** (Fig. 4): Memory map/minimap roughly doubles Matching Pairs Score% (MemGap ~46-51) but recovers smaller share on 3D Maze (MemGap ~31-41), spatial navigation couples tracking with planning.
- **Modality ablation** (Table 4): Text-only dominates; visual recognition is the main bottleneck.

| Model | Matching Pairs | 3D Maze |
|---|---|---|
| | Text | ASSCII | Noise | Text-Sym | 2D Patch | 3D Scene |
| Qwen3.5-397B | 100.0 | 75.8 | 38.3 | 70.9 (GS) | 20.1 (GS) | 23.7 (GS) |
| Kimi-K2.5 | 100.0 | 72.5 | 43.3 | 60.0 (GS) | 25.7 (GS) | 39.7 (GS) |

- **Removing action feedback** (Table 5): Stripping the model's own action trace collapses Matching Pairs Score% by ~75% (GPT-5.4: 62.3→15.3; Qwen3.5-397B: 25.3→6.3), showing that explicit textual record is a load-bearing channel for belief-state tracking.

### Fine-Tuning Results (Qwen3.5-9B, Table 6)

| Model | Matching Pairs | 3D Maze |
|---|---|---|
| | Score%↑ | Resp./Score↓ | SR%↑ | GS%↑ |
| Baseline | 0.0 | – | 0.0 | 1.5 |
| +opt32k | 14.6 | 14.7 | 0.0 | 5.0 |
| +rmix32k | **29.5** | 6.8 | **10.0** | **16.3** |

Supervised fine-tuning on optimal (32K) and filtered model rollouts (6K) improves performance on unseen larger sizes; mixed data (rmix32k) is more effective than optimal-only.

### Transfer to External Benchmarks (Table 7, selected benchmarks)

| Benchmark | Metric | Baseline | SFT | Δ |
|---|---|---|---|---|
| EMeMBench | Visual DIF_50 | 49.5 | 54.7 | +5.2 |
| VGRPBench | Macro Perception Acc | 24.9 | 29.5 | +4.6 |
| **Memory/spatial group mean** | | 30.9 | 34.3 | +3.4 |
| **General multimodal group mean** | | 77.4 | 77.9 | +0.5 |

Fine-tuning improves targeted memory/spatial benchmarks without degrading general ability.

---

## Theoretical and Practical Implications

**Theoretical**: RNG-Bench formalizes a distinct capability—**in-context belief-state tracking for action**—that is not captured by existing memory (post-hoc) or fully-observed reasoning benchmarks. The Memory Gap analysis shows that forgetting earlier observations, not poor decision-making given correct state, is the primary failure mode. This suggests that improvements in **long-context retention** and **visual recognition from history** would directly benefit closed-loop MLLMs.

**Practical**: The benchmark provides a controlled testbed for developers to isolate memory-for-action from other agent skills. The fine-tuning results demonstrate that synthetic trajectories (optimal + model rollouts) can teach smaller MLLMs to track hidden state and transfer to related benchmarks (EMeMBench, VGRPBench). This opens a path to improve deployed models without sacrificing general performance.

Key takeaways for practitioners:
- Visual observation modality is a stronger bottleneck than history length alone.
- Explicit action trace is critical for belief-state updating.
- External memory (minimap, memory map) helps but does not fully close the gap, especially in spatial navigation.
- Duel protocol reveals strategic differences not visible in single-agent scores.

---

## Conclusion

RNG-Bench introduces a **controllable, non-Markov benchmark** that isolates a model's ability to reconstruct hidden state from interaction history and use it for action. Across leading MLLMs:
- Performance collapses as latent state grows.
- Visual observations drive the bottleneck more than history length.
- Stripping the action trace reduces Matching Pairs to near-chance.
- Fine-tuning on simulator rollouts improves performance and transfers to external benchmarks without regression.

**Limitations**: Only two environments studied; broader coverage of game genres, model families, and visual styles is left to future work. Memory Gap is a practical diagnostic, not a standalone causal decomposition. Fine-tuning is a feasibility demonstration on a single base model. The benchmark provides a foundation for future work on interactive MLLMs that must **remember to act** rather than just **remember to answer**.

---

_Markdown view of https://picx.dev/p/66IC4G, served by PicX — AI-generated visual whiteboard summaries of research papers._