Summary of "Can Vision-Language Models Solve the Shell Game?"

Summary (Overview)

Exposes a Fundamental Weakness: Current state-of-the-art Vision-Language Models (VLMs) perform at or near chance level on the visual entity tracking task (e.g., the shell game) when forced to rely solely on spatiotemporal motion, as shown by a new diagnostic benchmark, VET-Bench.
Identifies a Shortcut in Existing Benchmarks: An audit of the Perception Test reveals that models exploit static appearance cues (e.g., distinct or transparent cups) to bypass true temporal tracking, inflating performance. When these cues are filtered out, performance collapses to chance.
Provides Theoretical Justification: The paper proves the visual entity tracking problem is NC¹-complete for $k \geq 5$ objects, suggesting fixed-depth transformers have inherent expressivity limitations for this task without intermediate computation/state.
Proposes an Effective Solution: Introduces Spatiotemporal Grounded Chain-of-Thought (SGCoT), a method that aligns a VLM (Molmo2) to generate explicit, fine-grained object trajectories as intermediate reasoning steps. This approach achieves over 90% accuracy on VET-Bench.

Introduction and Theoretical Foundation

Visual entity tracking—maintaining the identity of objects over time—is a core human cognitive ability but a critical bottleneck for VLMs. This limitation is often masked in existing video benchmarks by visual shortcuts. The authors investigate this via the "shell game" paradigm, where visually identical objects are shuffled, forcing tracking through motion continuity alone.

The theoretical foundation connects visual entity tracking to the computational complexity of state tracking. The core problem, TRACK_k (tracking $k$ indistinguishable objects in a video), is formalized. Drawing on prior work in circuit complexity and transformer expressivity (Merrill & Sabharwal, 2023, 2024), the authors analyze whether transformer-based VLMs can inherently solve this task.

Key Mathematical Definition:

Definition 1 (Visual Entity Tracking, TRACK_k). TRACK_k is the problem of tracking $k$ visually indistinguishable objects in a video $V = (F_0, ..., F_T)$ of $T+1$ frames on an $H \times W$ grid, where $k$ , $H$ , and $W$ are constants. The input is assumed to satisfy localization and continuity conditions. Let $\pi$ be the global permutation that maps the $k$ objects from their initial lexicographic ordering of positions in frame $0$ to their final ordering in frame $T$ . The problem asks whether $\pi$ is the identity permutation.

The analysis links this to the word problem for the symmetric group $S_5$ , a canonical NC¹-complete problem.

Key Theorem:

Theorem 1. For any fixed $k \geq 5$ , TRACK_k is NC¹-complete.

Proof Sketch: Membership in NC¹ is shown by computing the permutation between each adjacent frame pair and composing them. Hardness is proven via a reduction from the word problem for $S_5$ (WORD_{S_5}) by constructing a video that physically realizes adjacent transposition generators.

This result implies that, under the conjecture that $\text{TC}^0 \subsetneq \text{NC}^1$ , fixed-depth transformers cannot solve general visual entity tracking on arbitrary-length sequences without intermediate computation (e.g., Chain-of-Thought). The problem's difficulty stems from the non-solvable algebraic structure of $S_k$ for $k \geq 5$ .

Methodology

VET-Bench (Visual Entity Tracking Benchmark): A synthetic diagnostic dataset is introduced to isolate spatiotemporal perception.
- Tasks: Cups Game (shell game) and Cards Game (Three-Card Monte).
- Key Design: All objects (cups/cards) are visually identical and opaque, eliminating static appearance-based shortcuts. The continuity constraint ( $2d < \Delta$ , where $d$ is max displacement between frames and $\Delta$ is min object separation) prevents identity aliasing.
- Control: Allows systematic variation of object count ( $N$ ) and swap count.
- Data Generation: Uses a three.js-based pipeline for unlimited, varied episodes.
Model Evaluation: A comprehensive suite of proprietary and open-source VLMs is evaluated on VET-Bench using Top-1 Accuracy in a multiple-choice QA format. Models include Gemini-3/2.5, Qwen-3.5/VL, GLM-4.6V, Ernie-4.5, Doubao-Seed, Kimi-K2.5, PerceptionLM, and Molmo2.
Spatiotemporal Grounded Chain-of-Thought (SGCoT): A novel method to enable reliable tracking.
- Core Idea: Elicit the VLM (Molmo2) to generate an explicit, fine-grained object trajectory as a CoT before answering.
- Format: The model outputs a <tracks> tag containing timestamped coordinates: <tracks coords="timestamp object_idx x y;...">Object</tracks>.
- Training: The model is fine-tuned on synthetic text-only data (300 samples) where the SGCoT trajectory and final answer are provided. The loss is masked on the synthetic trajectory tokens, supervising only the final answer. This aligns the model to use its inherent tracking capability for QA without expensive video retraining.

Empirical Validation / Results

Main Result on VET-Bench: All evaluated VLMs perform near the random guessing baseline (33% for 3 objects) on the core VET-Bench task (3 objects, 5 swaps).

Table: Performance on VET-Bench (3 objects, 5 swaps)

Model Category	Example Models	Accuracy (Cups)	Accuracy (Cards)	~Avg.
Closed-source (Reasoning)	Gemini-3-Pro, Gemini-3-Flash	0.34, 0.30	0.40, 0.30	~0.33
Closed-source (Non-reasoning)	Gemini-2.5-Pro/Flash	0.25, 0.35	0.34, 0.28	~0.30
Open-source (Reasoning)	Qwen3-VL-30B-Thinking	0.32	0.30	~0.31
Open-source (Non-reasoning)	Qwen3-VL-8B-Instruct, Molmo2	0.30, 0.37	0.33, 0.34	~0.33
Proposed Method	Molmo2-SGCoT (Ours)	0.93	0.89	0.91

Failure Mode Analysis:

Direct Answer: Non-reasoning models often output a final answer without CoT, akin to random guessing.
Coarse Description: Some models describe the video at a high level (e.g., "cups are shuffled") but fail to perceive individual swaps.
Inaccurate Perception & Hallucination: Reasoning models (e.g., Gemini-3) generate logically valid CoTs but are grounded in incorrect perceptions (misidentifying swaps), leading to cascading errors.

Ablation Studies:

Swap Count: Performance drops sharply from near-perfect at 0 swaps (object permanence) to chance level with just 1 swap.
Object Count: Even for the simple case of $N=2$ (a parity problem), models fail to outperform the random baseline (50%).

Comparison with Existing Benchmarks:

Perception Test Audit: After filtering out videos with visual shortcuts (distinct/transparent cups), model performance on the remaining 65 hard clips collapses to chance (~31-42%), aligning with VET-Bench results.
VideoReasonBench: Contains explicit visual swap cues (arrows), allowing models to achieve higher accuracy (56% for Gemini-2.5-Pro) without genuine motion tracking, unlike VET-Bench.

Empirical Verification of Theoretical Limit:

Training a VLM (Qwen2.5-VL-3B) on 500 VET-Bench videos with only direct-answer supervision failed; the loss stagnated at the random chance level, confirming the difficulty of learning the task end-to-end.

Success of SGCoT:

Molmo2-SGCoT achieves 91% accuracy on VET-Bench after lightweight fine-tuning on text-only trajectory data.
Errors primarily occur due to misidentification during the SGCoT perception stage (trajectory "jumps").

Theoretical and Practical Implications

Theoretical Implications:

Establishes visual entity tracking as an NC¹-complete problem for $k \geq 5$ , providing a rigorous complexity-theoretic explanation for why fixed-depth transformers struggle without intermediate state representations.
Explains the ease of tracking distinct objects: The task reduces to a parallelizable visual search problem (in $\text{AC}^0$ ) when unique identifiers are present, unlike the sequential state tracking required for identical objects.

Practical Implications:

Benchmark Design: Highlights the need for diagnostics like VET-Bench that eliminate shortcuts to properly evaluate core spatiotemporal reasoning.
Model Architecture & Training: Suggests that for complex temporal reasoning, VLMs need mechanisms akin to Spatiotemporal Grounded Chain-of-Thought (SGCoT)—explicit, fine-grained intermediate state updates—to overcome expressivity limits.
Efficient Alignment: Demonstrates that strong performance on a challenging perceptual task can be unlocked via minimal, targeted alignment (text-only fine-tuning) that repurposes a model's existing capabilities (Molmo2's tracking).

Conclusion

The paper identifies and rigorously analyzes a fundamental weakness in current VLMs: the inability to perform robust visual entity tracking without appearance shortcuts. The introduction of VET-Bench provides a clear diagnostic tool, while the NC¹-completeness proof offers a theoretical foundation for the observed failures. The proposed SGCoT method effectively addresses this limitation by grounding reasoning in explicit spatiotemporal trajectories, enabling a VLM to solve the shell game with high reliability. This work underscores the importance of intermediate, grounded state representations for complex video reasoning and points toward future VLMs that integrate such mechanisms for more human-like visual perception.