Summary of "Lost in Stories: Consistency Bugs in Long Story Generation by LLMs"

Summary (Overview)

  • Benchmark Introduction: The paper introduces ConStory-Bench, a benchmark with 2,000 prompts across four task scenarios (Generation, Continuation, Expansion, Completion) designed to evaluate narrative consistency in long-form (8,000-10,000 word) story generation by LLMs.
  • Automated Evaluation Pipeline: The authors develop ConStory-Checker, an automated LLM-as-judge pipeline that detects contradictions and grounds each judgment in explicit textual evidence, using a taxonomy of 5 error categories and 19 fine-grained subtypes.
  • Key Findings on Errors: Evaluation of a wide range of LLMs reveals that consistency errors are most common in the Factual & Detail Consistency and Timeline & Plot Logic categories. Errors tend to appear around the middle of narratives, occur in text segments with higher token-level entropy (indicating model uncertainty), and certain error types (especially factual errors) tend to co-occur.
  • Performance Ranking: GPT-5-Reasoning achieves the best consistency performance (lowest error density), followed by other top proprietary models (Gemini-2.5-Pro, Claude-Sonnet-4.5). Some open-source models (GLM-4.6, Qwen3-32B) show competitive results.
  • Error-Length Relationship: Consistency errors accumulate approximately linearly with output length, but models exhibit highly diverse preferences for how long they generate.

Introduction and Theoretical Foundation

Large Language Models (LLMs) have advanced in generating long-form narratives spanning tens of thousands of words, a capability crucial for applications like content creation and storytelling. However, a critical challenge remains: maintaining narrative consistency—avoiding contradictions in established facts, character traits, world rules, and plot logic across the entire text.

Existing research and benchmarks for story generation primarily focus on plot coherence and fluency, leaving systematic evaluation of global consistency largely unexplored. Furthermore, while LLM-as-a-judge evaluation methods are promising, they often lack explicit textual evidence and interpretable rationales for their judgments.

This paper addresses this gap by framing the problem around five core Research Questions (RQs) concerning the extent, scaling, predictive signals, co-occurrence, and positional distribution of consistency errors in LLM-generated long stories.

Methodology

1. ConStory-Bench Dataset Construction

  • Sources: Seed stories were collected from seven public corpora including LongBench, WritingPrompts, and WikiPlots.
  • Prompt Construction: Using o4-mini, collected stories were rewritten into prompts for four distinct task types:
    • Generation: Produce a free-form narrative from a minimal plot setup.
    • Continuation: Extend an initial story fragment.
    • Expansion: Develop a long story from a concise plot outline.
    • Completion: Write a full story with predefined beginning and ending.
  • Final Dataset: After deduplication and quality filtering, the benchmark contains 2,000 high-quality prompts targeting 8,000–10,000 word outputs.

2. Consistency Error Taxonomy

A hierarchical taxonomy of five top-level categories with 19 fine-grained subtypes was developed:

Error CategorySub Error Types (Examples)
Timeline & Plot LogicAbsolute Time Contradictions, Duration Contradictions, Causeless Effects
CharacterizationMemory Contradictions, Knowledge Contradictions, Skill Fluctuations
World-building & SettingCore Rules Violations, Social Norms Violations, Geographical Contradictions
Factual & Detail ConsistencyAppearance Mismatches, Nomenclature Confusions, Quantitative Mismatches
Narrative & StylePerspective Confusions, Tone Inconsistencies, Style Shifts

3. ConStory-Checker Pipeline

An automated, four-stage LLM-as-judge pipeline (o4-mini as judge):

  1. Category-Guided Extraction: Scan the narrative to extract contradiction-prone text spans for each of the five error categories.
  2. Contradiction Pairing: Compare extracted spans pairwise, classifying them as Consistent or Contradictory.
  3. Evidence Chains: For each contradiction, produce a structured record including:
    • Reasoning: Explanation of the contradiction.
    • Evidence: Quoted conflicting text spans with character-level positions.
    • Conclusion: Assigned error type.
  4. JSON Reports: Output standardized JSON for scalable analysis.

4. Evaluation Metrics

Two complementary metrics were introduced to control for output length and prompt difficulty:

  • Consistency Error Density (CED): Normalizes error count by story length (errors per 10k words). For model m on story i: CEDm,i=em,iwm,i/10000CED_{m,i} = \frac{e_{m,i}}{w_{m,i} / 10000} The model-level score is: CEDm=1Ni=1NCEDm,iCED_{m} = \frac{1}{N} \sum_{i=1}^{N} CED_{m,i} (Lower is better)
  • Group Relative Rank (GRR): Ranks models within each prompt group to account for inherent difficulty. A length-aware quality score is computed per story: Qm,i=wm,i1+em,iQ_{m,i} = \frac{w_{m,i}}{1 + e_{m,i}} Models are ranked by Q within each story i, and GRR is the average rank: GRRm=1NmiImranki(Qm,i)GRR_{m} = \frac{1}{N_m} \sum_{i \in I_m} rank_i(Q_{m,i}) (Lower is better)

Empirical Validation / Results

The study evaluated models across four categories: Proprietary, Open-source, Capability-enhanced (fine-tuned for long-form), and Agent-enhanced systems.

RQ1: Extent and Distribution of Errors

  • GPT-5-Reasoning performed best (CED: 0.113, GRR: 3.05).
  • Factual & Detail Consistency and Timeline & Plot Logic were the dominant error categories across most models.
  • Generation tasks (most open-ended) consistently yielded higher error densities than other task types.
  • Key Result Table (Performance Overview):
ModelCED ↓GRR ↓Avg WordsAvg ErrorsTotal Stories
GPT-5-Reasoning0.1133.0590500.091990
Gemini-2.5-Pro0.3057.7955840.161996
Claude-Sonnet-4.50.5204.9089290.371998
GLM-4.6 (Open-source)0.5288.4549490.182000
Qwen3-32B (Open-source)0.5376.3962370.272000

RQ2: Error Scaling with Output Length

  • Models exhibited highly diverse length preferences (e.g., GPT-5-Reasoning generated mostly >6K words, while GPT-4o-1120 generated mostly <3K words).
  • Error counts increased approximately linearly with output length across models, with correlation strength varying (e.g., r=0.973 for DeepSeek-V3.2-Exp).

RQ3: Predictive Signals for Errors

  • Token-level entropy was significantly higher in text segments containing errors compared to the whole-text baseline.
  • For Qwen3-4B, error content entropy was 19.24% higher; for Qwen3-30B, it was 12.03% higher.
  • This indicates models are more uncertain when they produce inconsistent content, making entropy a potential early-warning signal.

RQ4: Co-occurrence of Error Types

  • Factual & Detail Consistency errors acted as a central hub, correlating with Characterization (r=0.304), World-building (r=0.255), and Timeline (r=0.176) errors, suggesting shared failure mechanisms.
  • Narrative & Style errors showed near-zero correlation with other categories, indicating they arise from distinct mechanisms.

RQ5: Positional Distribution of Errors

  • Facts are typically established early in the narrative (15-30% position), while contradictions appear later, clustering in the 40-60% range.
  • Geographical contradictions had the largest average gap (31.0%) between fact and contradiction, indicating long-range memory failures.
  • Perspective confusions had the smallest gap (4.7%), suggesting local context failures.

Theoretical and Practical Implications

  • Theoretical: The work provides a formal framework and taxonomy for studying a previously underexplored aspect of long-form generation: systematic consistency. It demonstrates that errors are not random but follow predictable patterns related to model uncertainty, narrative position, and error type interdependencies.
  • Practical:
    • For Model Developers: Highlights specific weaknesses (factual/temporal tracking) to target for improvement. Suggests that enhancing long-range coherence mechanisms is crucial.
    • For Evaluation: Provides a reproducible, evidence-grounded benchmark and automated tool (ConStory-Checker) for assessing narrative consistency at scale.
    • For System Design: The finding that high entropy predicts errors suggests practical mitigation strategies, such as triggering verification routines when local uncertainty exceeds a threshold.

Conclusion

The paper establishes that maintaining narrative consistency in long-form story generation remains a significant challenge for current LLMs. The introduced ConStory-Bench and ConStory-Checker provide the community with tools for systematic evaluation. Key findings include the linear accumulation of errors with length, the predictive power of token entropy, and the positional clustering of contradictions. These insights can guide future efforts to improve coherence in long-context LLM generation. Future work may extend the benchmark to multilingual, cross-cultural, and non-fiction domains.