Summary (Overview)
- Introduces ArcANE (Arc-Aware Narrative Evaluation), a novel benchmark for role-playing language agents (RPLAs) that measures whether responses align with a character’s evolving psychology across narrative timepoints, rather than static persona.
- Constructs 544 Character Arcs and 4,601 probes across 17 novels and 80 principal characters, covering both in-scenario and out-of-narrative situations.
- Evaluates six models (Qwen3-8B/32B, DeepSeek-V4-Flash/Pro, ArcANE-8B/32B) under six context modes; Arc-grounded context consistently outperforms all baselines, with the largest gains on Out-of-World scenarios where retrieval has nothing to find.
- Fine-tunes ArcANE-8B and ArcANE-32B on the training split, further widening the Arc advantage especially on out-of-source scenarios (e.g., +12.5 points for ArcANE-32B-DPO on Out-of-World).
- Includes human validation (87.1% plausibility) and ablation studies confirming the lift comes from per-phase content, not structural leakage or register artifacts.
Introduction and Theoretical Foundation
Role-playing language agents (RPLAs) are expected to portray characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks (e.g., TimeCHARA, LifeChoice) primarily test factual recall or static trait consistency, failing to assess whether responses shift appropriately with the character’s psychological trajectory.
The authors ground their approach in McAdams’ Layer 2 of personality (contextualized traits expressed at the right moment), moving beyond Layer 1 (stable trait inventories). They focus on novels as testbeds because they provide extended narratives with rich internal description and explicit temporal structure. The core intuition: posing the same question at different points in the story should elicit different responses.
The paper introduces two key constructs:
- Character Arc: aligns key events with evolving psychological states, organized into phase-segmented trajectories along a psychological axis (e.g., Harry Potter’s moral axis from Punitive Justice to Empathic Forgiveness).
- Probe: a scenario-question pair with phase-specific reference responses, testing whether the model expresses the right phase at the right moment. Scenarios fall into three categories of increasing difficulty:
- In-Scenario: lifted from a verbatim passage.
- In-World: plausible unwritten situation within the source’s setting.
- Out-of-World: transposed to a non-source era, requiring pure arc understanding.
Methodology
3.1 Character Arc Construction (Figure 2, top)
Three stages:
-
Candidate Generation: The novel is processed through two independent chapter-level streams:
- Event stream: extracts psychologically impactful events.
- State stream: emits cross-sectional psychological profiles. Both streams induce candidate axes: intrapersonal (beliefs, motives, coping) and relational (trust, esteem, intimacy, antagonism), grounded in literary/psychological scholarship.
-
Reconciliation: An analyst LLM matches candidates from both streams. Matched pairs are merged; unmatched ones are classified as missed axes (kept), ambiguous (flagged), or artifacts (discarded).
-
Validation and Dataset Split: For evaluation novels, three human annotators independently assess each axis; only those deemed valid by ≥2 annotators enter the evaluation set. A three-perspective LLM critic ensemble (structuralist, depth-psychological, historical-cultural) provides reference ratings.
3.2 Probe Generation (Figure 2, bottom)
Three stages:
-
Per-arc preparation: Extract behavioral contrast (yes/no decision differing across phases), life-stage tags (child, adolescent, etc.), and era-agnostic axis for Out-of-World probes.
-
Probe drafting: For each (target phase, category), an LLM drafts one probe with phase responses. The target phase’s response reflects actual behavior; the other are counterfactual projections grounded in the arc.
-
Validation and filtering:
- Per-response checks: Q-Voice (stays in character, avoids anachronism) and Q-PhaseFit (blind judge matches phase).
- Per-probe checks: Q-Anchor/Q-World (setting rules respected) and Q-Discrim (adjacent phase responses sufficiently separated).
3.3 Dataset
| Slice | Novels | Characters | Arcs | Probes |
|---|---|---|---|---|
| Training | 10 | – | – | 2,545 |
| Validated Evaluation | 5 | 25 | 205 | 1,754 |
| Low-popularity (unvalidated) | 2 | – | – | 302 |
| Total | 17 | 80 | 544 | 4,601 |
3.4 Training Arc-aware RPLAs
Two-stage pipeline on the training set:
- SFT: learn response format.
- DPO: distinguish ground-truth behavior from temporally displaced alternatives (same character, different phase).
Models: ArcANE-8B and ArcANE-32B, fine-tuned from Qwen3-8B and Qwen3-32B.
Empirical Validation / Results
4.3 Main Results (Table 2)
Six models × six context modes on the validated slice. Context modes: Vanilla, Summary, RAG, LifeChoice, TimeCHARA, Arc (our). Key metrics:
- APF: Action Phase-Fidelity (overt action alignment)
- RPF: Reasoning Phase-Fidelity (reasoning mechanism alignment)
- RAE: Reasoning–Action Entailment
- PTF: Phase Trajectory Fidelity (judges all phase responses as a sequence)
Key finding: Arc leads Overall on every model. Example (DeepSeek-V4-Pro):
- Arc Overall: 62.4
- Best non-Arc (LifeChoice): 57.7 → gap +4.7
- On Out-of-World: Arc = 64.1; LifeChoice = 56.7 → gap +7.7
The gap widens from In-Scenario → In-World → Out-of-World, because retrieval methods have nothing to find in Out-of-World scenarios, while Arc supplies the character’s state via the arc.
Table 2 (condensed, showing Overall and Out-of-World for key models):
| Model | Mode | Overall | Out-of-World |
|---|---|---|---|
| DeepSeek-V4-Flash | Arc | 59.7 | 62.2 |
| DeepSeek-V4-Flash | Best non-Arc | 56.1 | 57.2 |
| DeepSeek-V4-Pro | Arc | 62.4 | 64.1 |
| DeepSeek-V4-Pro | Best non-Arc | 57.7 | 56.7 |
| Qwen3-8B | Arc | 43.1 | 47.4 |
| Qwen3-8B | Best non-Arc | 40.9 | 42.2 |
| Qwen3-32B | Arc | 50.1 | 54.8 |
| Qwen3-32B | Best non-Arc | 47.4 | 48.6 |
| ArcANE-8B (DPO) | Arc | 56.9 | 62.5 |
| ArcANE-8B (DPO) | Best non-Arc | 48.5 | 49.6 |
| ArcANE-32B (DPO) | Arc | 60.4 | 68.0 |
| ArcANE-32B (DPO) | Best non-Arc | 52.0 | 54.4 |
Full Table 2 in paper includes all per-category metrics.
4.4 Additional Results
- Low-popularity novels: Arc lift carries over to two low-download titles (The Underdogs, East Lynne), with gaps of +4.1 to +15.3 over best non-Arc modes.
- Other RPLA models: Similar Arc lift for HER-32B, CoSER-8B, CoSER-70B (Figure 3). In-World and Out-of-World gains are consistently positive; In-Scenario mixed.
- SFT vs DPO: SFT raises Arc Overall from 50.1 to 58.4; DPO further improves Out-of-World (+12.5 on 32B) but slightly reduces In-Scenario (tradeoff due to stronger trajectory focus).
5.1 Source-of-effect ablation
- MixedArc (swap with unrelated character’s arc): performance drops below Vanilla in trained models, ruling out structured-context bonus.
- ArcHint (strip per-phase prose to axis+phase label): tracks full Arc within ±2.6 on general models, proving the label-index pair suffices for prompting; but on ArcANE-32B-DPO, ArcHint recovers only half the gain, showing fine-tuning leverages the full per-phase prose.
5.2 Evaluation validation
- Human annotators judge 87.1% of LLM judge verdicts as plausible (95.7% for Out-of-World).
- Pearson between human and judge re-scores on a subset.
- Cross-judge replication with Claude Sonnet 4.5, Opus 4.5, GPT-5.5 ranks ArcANE-32B-DPO under Arc first across all judges.
5.3 Training-effect analysis
Register control experiment: forcing Qwen3-32B to use first-person present tense lowers its Overall from 53.8 to 50.0, while ArcANE holds at 56.7, disproving that the gain is merely a register artifact. The real gain comes from category-conditional canon discipline.
Theoretical and Practical Implications
- Theoretical: Advances evaluation of RPLAs from static persona matching to temporal behavioral consistency, operationalizing McAdams’ personality layers in a narrative context. Shows that character evolution can be captured via structured arcs and that models benefit from explicit arc conditioning, especially when source text is absent.
- Practical: Provides an automated pipeline to construct character arcs and probes from any novel, enabling scalable evaluation. ArcANE-8B/32B demonstrate that fine-tuning on arc-grounded data improves trajectory fidelity, making them more suitable for immersive role-playing applications (entertainment, education, interactive storytelling).
- Benchmark design: Out-of-World probes are a critical stress test for RPLAs, revealing that retrieval-based context modes fail when the scenario has no source passage; only arc-aware context supplies the necessary psychological state.
- Limitations: English-only, novel domain, single-character focus, period-bound social attitudes in source texts, potential for misuse (impersonation). Released artifacts intended for research only.
Conclusion
The paper asks whether RPLAs track the right character state at the right narrative timepoint. ArcANE provides a benchmark of 544 Character Arcs and 4,601 probes across 17 novels. Across six models and six context modes, Arc-grounded context achieves the best performance on every model, with the largest gains on out-of-source scenarios. The lift extends to low-popularity novels and to other role-playing model families. Fine-tuned ArcANE-8B and ArcANE-32B further widen the advantage.
Future directions: Multi-turn dialogue where the arc advances through interaction, extension to other media (films, games), and handling character-character interactions.
Related papers
- MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism
MemDreamer shows that agentic reasoning, not token scaling, drives long-video understanding, achieving SOTA with 2% context via hierarchical graph memory and tool-augmented retrieval.
- SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research
Training a 30B-A3B model on harness-elicited delegation trajectories yields state-of-the-art on long-horizon benchmarks, rivaling 10x larger models.
- SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
A persistent Python kernel as an action interface yields 59.9% accuracy, outperforming prior spatial agents by 11 points without adaptation.