Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model - Summary

Summary (Overview)

Extreme Compression for Planning: Proposes CompACT, a discrete tokenizer that compresses each visual observation into as few as 8 tokens (~128 bits), drastically reducing computational cost compared to conventional tokenizers that use hundreds of tokens.
Semantic over Photorealistic: Prioritizes preserving planning-critical high-level semantic information (object identities, spatial layouts) while discarding high-frequency perceptual details (textures, lighting), enabled by leveraging frozen pretrained vision foundation models (DINOv3).
Generative Decoding: Employs a generative decoding strategy that synthesizes fine-grained perceptual details from the compact tokens by unmasking tokens from a high-capacity target tokenizer (e.g., VQGAN), transforming an ill-posed decompression problem into a tractable conditional generation task.
Efficient Planning: An action-conditioned world model trained in the CompACT latent space achieves competitive planning accuracy with orders-of-magnitude faster planning (e.g., ~40× speedup on navigation tasks) compared to models using continuous latent spaces with hundreds of tokens.
Modular Representation: Learns a modular latent structure where each token attends to semantically coherent scene components (e.g., objects, end-effectors), which inherently benefits action-relevant modeling and planning.

Introduction and Theoretical Foundation

World models are neural networks that learn environment dynamics, enabling agents to simulate future states for planning and policy learning. While recent generative models can produce photorealistic future observations, their application to real-time decision-time planning is computationally prohibitive. The bottleneck lies in latent representations: conventional tokenizers encode each image into hundreds of tokens (e.g., 784 for SD-VAE), leading to quadratic computational cost in attention-based architectures.

This work challenges the design philosophy of prioritizing reconstruction fidelity for planning. Instead, it hypothesizes that aggressive compression, forcing the model to retain only action-relevant information, can be beneficial. The theoretical foundation builds on the idea that human planning relies on compact, abstract mental models rather than pixel-perfect recall. Therefore, the authors propose CompACT, which pushes compression to an extreme limit to investigate if such representations can still support effective planning.

Methodology

The method follows a three-stage pipeline (Fig. 1): train a tokenizer, train a latent world model, and perform decision-time planning.

1. Latent World Model Formulation: A world model $f_\theta$ predicts the future observation distribution conditioned on the current state and action: $f_\theta: (o_t, a_t) \mapsto p_\theta(o_{t+1} | o_t, a_t)$ . To avoid high-dimensional pixel space, it operates on latent tokens $z_t = E(o_t)$ obtained via a tokenizer $(E, D)$ . The latent world model is:

f_\phi: (z_t, a_t) \mapsto p_\phi(z_{t+1} | z_t, a_t) \quad \text{(Eq. 2)}

Planning involves rolling out this model over a horizon $H$ to find an action sequence $\mathbf{a}^* = \arg\min_{\mathbf{a}} d(\hat{o}_H, o_{\text{goal}})$ .

2. CompACT Tokenizer Design (Fig. 2): The core innovation is the CompACT tokenizer $D_{\text{compact}} \circ E_{\text{compact}}$ .

Semantic Encoding ( $E_{\text{compact}}$ ): Uses a frozen DINOv3 encoder to extract semantic patch features. A latent resampler with learnable query tokens attends to these features via cross-attention, distilling high-level semantics. The output is discretized via Finite Scalar Quantization (FSQ) into $N$ tokens ( $N \le 16$ ), $z \in \{1,...,K\}^N$ .
Generative Decoding ( $D_{\text{compact}}$ ): Instead of direct pixel reconstruction, the decoder learns to unmask target tokens $z_\psi$ from a pretrained, high-capacity tokenizer (e.g., MaskGIT's VQGAN with $N_\psi = 196$ ), conditioned on the compact tokens $z$ . The training objective is the negative log-likelihood of masked tokens: $\mathcal{L}_{\text{tok}} = -\mathbb{E}_{z_\psi}[\log p(z_\psi | z, \mathcal{M}(z_\psi))] \quad \text{(Eq. 4)}$ Final pixel reconstruction is $\hat{o} = (D_\psi \circ D_{\text{compact}} \circ E_{\text{compact}})(o)$ .

3. World Model Training: The world model $f_\phi$ is trained in the discrete CompACT latent space using masked generative modeling:

\mathcal{L}_{\text{world}} = -\mathbb{E}_{z_t,a_t,z_{t+1}}[\log p(z_{t+1} | z_t, a_t, \mathcal{M}(z_{t+1}))] \quad \text{(Eq. 5)}

Two architectures are used: an autoregressive DiT-based model for navigation and a block-causal transformer for multi-frame robotic manipulation prediction.

Empirical Validation / Results

Tokenizer Reconstruction Quality (Table 1): CompACT achieves competitive reconstruction performance (rFID, IS) with extreme compression.

Model	Type	#tok	rFID ↓	IS ↑
SD-VAE [57]	cont.	1024	0.64	223.8
MaskGIT-VQGAN [7]	disc.	256	1.83	186.7
TA-TiTok-VQ [39]	disc.	32	3.95	219.6
CompACT	disc.	16	2.40	209.0
CompACT	disc.	8	3.21	207.5

Ablation Studies (Table 2): Key findings:

Frozen encoder is crucial: Fine-tuning DINOv3 degrades rFID (5.22 vs. 2.40), as it shifts features away from semantics toward reconstruction.
Generative decoding is essential: Replacing $D_{\text{compact}}$ with a feedforward decoder severely degrades quality (rFID 28.80).

Characterizing Latent Tokens:

Modular Attention (Fig. 4): Visualization shows each compact token attends to semantically coherent regions (objects, building structures, robot end-effectors).
Action Relevancy (Table 3): An Inverse Dynamics Model (IDM) trained on CompACT latents (16 tokens) outperforms one trained on 256-token VQGAN latents, indicating better preservation of action-critical information.

Tokenizer	#tok	L1 err ↓	$R^2$ ↑
Target tokenizer [7]	256	0.093	0.684
CompACT	16	0.091	0.716

Planning Performance (Table 4): On RECON and SCAND navigation datasets, a world model (NWM) using CompACT tokens achieves comparable accuracy to the SD-VAE baseline (784 tokens) with ~40× faster planning.

Tokenizer	#tok	RECON ATE ↓	SCAND ATE ↓	Latency (s) ↓
SD-VAE [57]	784	1.262	1.065	178.78
FlexTok [2]	64	1.484	1.578	16.68
CompACT	16	1.330	1.358	5.78
CompACT	8	1.373	1.391	4.83

Design Choice Analysis (Table 5):

History masking during world model training improves planning accuracy.
Using a latent-space cost function (vs. pixel-space LPIPS) offers a significant speedup (2.15s vs. 5.78s) with a minor accuracy drop.
Freezing the vision encoder during tokenizer training is critical for planning performance.

Action-Conditioned Video Prediction (Table 6 & Fig. 6): On RoboNet, CompACT enables video prediction with significantly lower Action Prediction Error (APE) and faster generation than the 256-token baseline, confirming its superiority in modeling action-driven dynamics.

Model	#tok	APE ↓	Latency (s) ↓
Target tokenizer [7]	256	0.3383	3.826
CompACT	16	0.1122	0.740

Theoretical and Practical Implications

Theoretical: Challenges the prevailing notion that planning requires photorealistic world models. Demonstrates that extreme compression guided by semantic priors can yield representations more suitable for planning by filtering out perceptually rich but decision-irrelevant noise.
Practical: Provides a practical path toward real-time deployment of world model-based planners in robotics and autonomous systems. The orders-of-magnitude reduction in planning latency (from minutes to seconds) bridges a critical gap between research and real-world application. The method's modular latent structure may also facilitate interpretability and robustness.

Conclusion

CompACT introduces a paradigm shift in designing tokenizers for world models by prioritizing extreme compression and semantic preservation over reconstruction fidelity. By leveraging frozen vision foundation models and generative decoding, it successfully compresses images into as few as 8 discrete tokens while retaining information crucial for planning. World models operating in this compact space achieve competitive performance with dramatic speedups, validating that effective planning is more about semantic abstraction than pixel-perfect simulation. Future work may explore extending this principle to other modalities and more complex, long-horizon reasoning tasks.