WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Summary (Overview)

Proposes WildWorld: A large-scale, action-conditioned video dataset for world modeling, automatically collected from the AAA game Monster Hunter: Wilds. It contains over 108 million frames with synchronized, per-frame ground-truth annotations including RGB, depth, camera pose, character skeletons, world states (e.g., health, location), and a rich action space of over 450 semantically meaningful actions.
Introduces WildBench: A benchmark derived from WildWorld for evaluating interactive world models. It features two novel metrics: Action Following (agreement between generated videos and input actions) and State Alignment (accuracy of state transitions, measured via skeletal keypoint tracking).
Highlights Key Findings: Extensive experiments on WildBench reveal that current models struggle with semantically rich actions and maintaining long-horizon state consistency. The results demonstrate the utility of the dataset and the need for state-aware video generation to improve interactive world modeling.

Introduction and Theoretical Foundation

Understanding and predicting world evolution from observations is a central goal in AI, often framed through dynamical systems theory and reinforcement learning. In these frameworks, the world evolves through latent-state dynamics driven by actions, with visual observations being partial projections of the true state. Learning predictive world models therefore requires inferring these latent states and modeling their action-conditioned transitions, which is crucial for planning and long-horizon reasoning.

Recent progress in video generation has led to models that attempt to learn environment dynamics from large-scale video data by predicting future frames conditioned on past observations and actions. However, existing datasets are insufficient for learning structured, action-conditioned dynamics because they typically:

Provide only simple action annotations (e.g., basic movement) with limited semantic meaning.
Tie actions directly to observable pixel-level changes (e.g., "move left" changes the viewpoint), rather than having them mediated by underlying implicit state transitions.

For example, the action "shoot" affects an internal state variable like "remaining ammunition," which cannot be reliably inferred from pixels alone but critically determines future visual outcomes (e.g., no projectile when ammo is zero). This entanglement makes it difficult for models to disentangle state transitions from observation variations, hindering the learning of stable, interpretable dynamics and leading to poor long-horizon prediction where errors accumulate.

WildWorld addresses this gap by providing a large-scale dataset with explicit state annotations, enabling the study of action-conditioned dynamics where actions manifest through state transitions, not just pixel changes.

Methodology

The curation of the WildWorld dataset involves a four-part pipeline: Data Acquisition Platform, Automated Gameplay Pipeline, Data Processing, and Caption Annotation.

1. Data Acquisition Platform

A dedicated platform was built to record three categories of interaction data from Monster Hunter: Wilds:

Actions: Player control inputs.
States: Underlying world evolution (player/monster location, rotation, velocity, animation ID, health, resources).
Observations: Visual manifestations (RGB frames, depth maps, camera intrinsic/extrinsic parameters). HUD elements were removed via shader disabling.

2. Automated Game Record Pipeline

Automated Gameplay System: Uses the game's UI components and built-in rule-based companion AI (behavior trees) to programmatically navigate menus, select quests, and execute combat without human input, ensuring diverse coverage.
Recording System: A custom system based on OBS Studio and Reshade partitions the screen to simultaneously record RGB (lossy HEVC, ~16 Mbps) and depth (lossless) streams. All data sources (text-based actions/states and video streams) have embedded timestamps for synchronization.

3. Data Processing and Annotation Pipeline

Raw recordings were filtered to remove low-quality samples:

Duration: Discard samples < 81 frames.
Temporal Continuity: Discard samples with frame gaps > 1.5x target interval (~50 ms at 30 FPS).
Luminance: Remove samples with >15 consecutive frames of extreme brightness.
Camera Occlusion: Remove samples where camera-character distance is abnormally small due to foreground blockage.
Character Occlusion: Remove samples where projected skeletal overlap between characters exceeds 30%.

Hierarchical Caption Annotations: Each sample was segmented into action sequences (where action ID is constant). For each sequence, frames were sampled at 1 FPS and captioned using Qwen3-VL-235B, with action/state ground-truth provided as context. Sample-level captions were then generated by summarizing these sequence captions with Gemini 3 Flash.

4. Dataset Statistics

Statistic Category	Details
Scale	108 million frames, 119 annotation columns per frame.
Entity Diversity	29 monster species, 4 player characters, 4 weapon types (near-uniform distribution for characters/weapons, long-tail for monsters).
Scene Complexity	5 distinct stages (deserts, mountains, etc.), varying weather/time. 66% combat clips, 34% traversal clips.
Temporal Dynamics	Majority of clips span 4k-28k frames; some exceed 40k frames (>30 mins).
Action Richness	Character: 5,960 unique action triplets (weapon, bank, motion) across 455 motion IDs. Monster: 2,132 unique action pairs across 527 motion IDs. Distribution is long-tailed.

5. WildBench Benchmark Construction

A manually curated set of 200 representative samples was selected, covering diverse difficulty, scenarios, character/monster types, and events (skill usage, knockdowns, etc.). 100 samples involve player+NPC cooperation, 100 are 1v1 combat.

Evaluation Metrics:

Video Quality: Uses VBench metrics (Motion Smoothness, Dynamic Degree, Aesthetic Quality, Image Quality).
Camera Control: Measures discrepancy between ground-truth and estimated (via ViPE) camera trajectories using Absolute Trajectory Error (ATE) and Relative Pose Error (RPE).
Action Following: At the action-sequence level, uses Gemini 3 Flash to judge if generated and ground-truth clips express the same action. Score is 1 for consistent, 0 otherwise, averaged over segments. Actions are categorized (movement, fast displacement, attack) for tailored prompts.
State Alignment: Uses character skeleton poses as a state proxy. Ground-truth 2D keypoint trajectories are obtained via projection. For generated videos (where first frame is ground-truth in image-to-video setting), keypoints are tracked with TAPNext. The score is the mean coordinate accuracy:

For each keypoint, accuracy is the average fraction of frames where the predicted location is within thresholds of $\{4, 8, 16, 32\}$ pixels from ground truth, averaged over all keypoints.

6. Experimental Approaches

Models were trained on WildWorld at 544×960 resolution, 81 frames/sample, 16 FPS.

Baseline: Wan2.2-TI2V-5B (text-to-video).
CamCtrl: Fine-tuned Wan2.2-Fun-5B-Control-Camera conditioned on ground-truth camera poses.
SkelCtrl: Fine-tuned Wan2.2-Fun-5B-Control conditioned on skeleton videos (rendered from ground-truth 3D keypoints).
StateCtrl: State-aware model based on CamCtrl. States are divided into discrete (embedding) and continuous (MLP), modeled hierarchically (entity-level & global-level) via a transformer to produce a unified state embedding. This embedding is injected into DiT layers. Includes a state decoder (reconstruction loss) and a state predictor (next-state prediction loss).
StateCtrl-AR: Autoregressive variant of StateCtrl that uses only the first-frame ground-truth state and autoregressively predicts subsequent states for conditioning.

Empirical Validation / Results

Metric Validation: The proposed Action Following metric achieved 85% agreement with human judgments. The State Alignment metric achieved 43.23% coordinate accuracy when evaluated on ground-truth videos (using tracked keypoints), demonstrating its reliability.

Overall Evaluation on WildBench:

Table 1: Comparison of interactive video generation approaches on WildBench. Lower is better for ATE/RPE; higher for others.

Method	Video Quality (MS/DD/AQ/IQ)	Camera Control (ATE↓ / RPE↓)	Action Following	State Alignment
Baseline	96.38 / 99.00 / 50.81 / 65.62	4.63 / 0.18	53.77	11.29
CamCtrl	97.85 / 97.00 / 48.29 / 62.88	2.02 / 0.13	83.46	15.18
SkelCtrl	97.85 / 95.00 / 47.92 / 62.43	2.55 / 0.10	92.81	22.03
StateCtrl	97.45 / 99.00 / 50.86 / 67.78	0.94 / 0.07	85.66	16.06
StateCtrl -AR	97.43 / 99.00 / 50.90 / 67.76	1.01 / 0.08	74.66	16.13

Key Findings:

All WildWorld-trained approaches improve over baseline on interaction-related metrics (Camera Control, Action Following, State Alignment), demonstrating the dataset's utility.
VBench metrics appear saturated (MS, DD >95% for all), yet actual motion/dynamics capability differs greatly (per Action Following/State Alignment), highlighting the need for more nuanced benchmarks like WildBench.
Trade-off with visual signals: SkelCtrl (visual skeleton input) yields larger gains on interaction metrics but at the cost of lower AQ/IQ scores compared to StateCtrl (learned soft embeddings). Qualitatively, StateCtrl generates clearer subjects, while SkelCtrl better reproduces occlusion/effect details.
Promise and challenge of autoregression: StateCtrl-AR achieves comparable performance to StateCtrl but shows a noticeable drop in Action Following, attributed to error accumulation in iterative state prediction—a known challenge in autoregressive generation.

Theoretical and Practical Implications

Theoretical: The work formalizes the need to bridge the gap between latent-state dynamical systems theory and practical video generation by providing data with explicit state annotations. This enables research into disentangling action effects from pixel changes and modeling state transitions directly.
Practical: WildWorld serves as a valuable foundation for building, training, and evaluating state-aware interactive world models, which are crucial for applications like AI-native games, embodied AI, and robotic planning. The WildBench benchmark provides concrete, fine-grained metrics (Action Following, State Alignment) beyond perceptual quality, guiding the development of more controllable and consistent models.

Conclusion

WildWorld is a large-scale dataset with explicit state annotations, automatically curated from a photorealistic game, designed to advance action-conditioned world modeling. It provides a rich action space and diverse ground-truth annotations. The derived WildBench benchmark reveals that current models face significant challenges in handling semantically rich actions and maintaining long-horizon state consistency. These findings underscore the importance of incorporating explicit state information to improve action-conditioned video generation and develop more robust world models. The dataset and benchmark are released to facilitate future research.