# Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

> The paper introduces ConStory-Bench, a benchmark revealing that LLMs make the most narrative consistency errors in factual details and plot logic, with errors accumulating linearly as story length increases.

- **Source:** [arXiv](https://arxiv.org/abs/2603.05890)
- **Published:** 2026-03-11
- **Permalink:** https://picx.dev/p/zjiwxd
- **Whiteboard:** https://picx.dev/p/zjiwxd/image

## Summary

# Summary of "Lost in Stories: Consistency Bugs in Long Story Generation by LLMs"

## Summary (Overview)
*   **Benchmark Introduction:** The paper introduces **ConStory-Bench**, a benchmark with 2,000 prompts across four task scenarios (Generation, Continuation, Expansion, Completion) designed to evaluate narrative consistency in long-form (8,000-10,000 word) story generation by LLMs.
*   **Automated Evaluation Pipeline:** The authors develop **ConStory-Checker**, an automated LLM-as-judge pipeline that detects contradictions and grounds each judgment in explicit textual evidence, using a taxonomy of 5 error categories and 19 fine-grained subtypes.
*   **Key Findings on Errors:** Evaluation of a wide range of LLMs reveals that consistency errors are most common in the **Factual & Detail Consistency** and **Timeline & Plot Logic** categories. Errors tend to appear around the middle of narratives, occur in text segments with higher token-level entropy (indicating model uncertainty), and certain error types (especially factual errors) tend to co-occur.
*   **Performance Ranking:** **GPT-5-Reasoning** achieves the best consistency performance (lowest error density), followed by other top proprietary models (Gemini-2.5-Pro, Claude-Sonnet-4.5). Some open-source models (GLM-4.6, Qwen3-32B) show competitive results.
*   **Error-Length Relationship:** Consistency errors accumulate approximately linearly with output length, but models exhibit highly diverse preferences for how long they generate.

## Introduction and Theoretical Foundation
Large Language Models (LLMs) have advanced in generating long-form narratives spanning tens of thousands of words, a capability crucial for applications like content creation and storytelling. However, a critical challenge remains: maintaining **narrative consistency**—avoiding contradictions in established facts, character traits, world rules, and plot logic across the entire text.

Existing research and benchmarks for story generation primarily focus on **plot coherence and fluency**, leaving systematic evaluation of **global consistency** largely unexplored. Furthermore, while LLM-as-a-judge evaluation methods are promising, they often lack explicit textual evidence and interpretable rationales for their judgments.

This paper addresses this gap by framing the problem around five core Research Questions (RQs) concerning the extent, scaling, predictive signals, co-occurrence, and positional distribution of consistency errors in LLM-generated long stories.

## Methodology

### 1. ConStory-Bench Dataset Construction
*   **Sources:** Seed stories were collected from seven public corpora including LongBench, WritingPrompts, and WikiPlots.
*   **Prompt Construction:** Using `o4-mini`, collected stories were rewritten into prompts for four distinct task types:
    *   **Generation:** Produce a free-form narrative from a minimal plot setup.
    *   **Continuation:** Extend an initial story fragment.
    *   **Expansion:** Develop a long story from a concise plot outline.
    *   **Completion:** Write a full story with predefined beginning and ending.
*   **Final Dataset:** After deduplication and quality filtering, the benchmark contains **2,000 high-quality prompts** targeting 8,000–10,000 word outputs.

### 2. Consistency Error Taxonomy
A hierarchical taxonomy of five top-level categories with 19 fine-grained subtypes was developed:
| Error Category | Sub Error Types (Examples) |
| :--- | :--- |
| **Timeline & Plot Logic** | Absolute Time Contradictions, Duration Contradictions, Causeless Effects |
| **Characterization** | Memory Contradictions, Knowledge Contradictions, Skill Fluctuations |
| **World-building & Setting** | Core Rules Violations, Social Norms Violations, Geographical Contradictions |
| **Factual & Detail Consistency** | Appearance Mismatches, Nomenclature Confusions, Quantitative Mismatches |
| **Narrative & Style** | Perspective Confusions, Tone Inconsistencies, Style Shifts |

### 3. ConStory-Checker Pipeline
An automated, four-stage LLM-as-judge pipeline (`o4-mini` as judge):
1.  **Category-Guided Extraction:** Scan the narrative to extract contradiction-prone text spans for each of the five error categories.
2.  **Contradiction Pairing:** Compare extracted spans pairwise, classifying them as `Consistent` or `Contradictory`.
3.  **Evidence Chains:** For each contradiction, produce a structured record including:
    *   `Reasoning`: Explanation of the contradiction.
    *   `Evidence`: Quoted conflicting text spans with character-level positions.
    *   `Conclusion`: Assigned error type.
4.  **JSON Reports:** Output standardized JSON for scalable analysis.

### 4. Evaluation Metrics
Two complementary metrics were introduced to control for output length and prompt difficulty:
*   **Consistency Error Density (CED):** Normalizes error count by story length (errors per 10k words). For model `m` on story `i`:
    $$ CED_{m,i} = \frac{e_{m,i}}{w_{m,i} / 10000} $$
    The model-level score is: $$ CED_{m} = \frac{1}{N} \sum_{i=1}^{N} CED_{m,i} $$ **(Lower is better)**
*   **Group Relative Rank (GRR):** Ranks models within each prompt group to account for inherent difficulty. A length-aware quality score is computed per story:
    $$ Q_{m,i} = \frac{w_{m,i}}{1 + e_{m,i}} $$
    Models are ranked by `Q` within each story `i`, and GRR is the average rank: $$ GRR_{m} = \frac{1}{N_m} \sum_{i \in I_m} rank_i(Q_{m,i}) $$ **(Lower is better)**

## Empirical Validation / Results
The study evaluated models across four categories: Proprietary, Open-source, Capability-enhanced (fine-tuned for long-form), and Agent-enhanced systems.

### RQ1: Extent and Distribution of Errors
*   **GPT-5-Reasoning** performed best (CED: 0.113, GRR: 3.05).
*   **Factual & Detail Consistency** and **Timeline & Plot Logic** were the dominant error categories across most models.
*   **Generation** tasks (most open-ended) consistently yielded higher error densities than other task types.
*   **Key Result Table (Performance Overview):**

| Model | CED ↓ | GRR ↓ | Avg Words | Avg Errors | Total Stories |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **GPT-5-Reasoning** | **0.113** | 3.05 | 9050 | 0.09 | 1990 |
| Gemini-2.5-Pro | 0.305 | 7.79 | 5584 | 0.16 | 1996 |
| Claude-Sonnet-4.5 | 0.520 | 4.90 | 8929 | 0.37 | 1998 |
| GLM-4.6 (Open-source) | 0.528 | 8.45 | 4949 | 0.18 | 2000 |
| Qwen3-32B (Open-source) | 0.537 | 6.39 | 6237 | 0.27 | 2000 |

### RQ2: Error Scaling with Output Length
*   Models exhibited highly diverse length preferences (e.g., GPT-5-Reasoning generated mostly >6K words, while GPT-4o-1120 generated mostly <3K words).
*   **Error counts increased approximately linearly with output length** across models, with correlation strength varying (e.g., `r=0.973` for DeepSeek-V3.2-Exp).

### RQ3: Predictive Signals for Errors
*   **Token-level entropy** was significantly higher in text segments containing errors compared to the whole-text baseline.
*   For Qwen3-4B, error content entropy was **19.24% higher**; for Qwen3-30B, it was **12.03% higher**.
*   This indicates models are more uncertain when they produce inconsistent content, making entropy a potential early-warning signal.

### RQ4: Co-occurrence of Error Types
*   **Factual & Detail Consistency** errors acted as a central hub, correlating with Characterization (`r=0.304`), World-building (`r=0.255`), and Timeline (`r=0.176`) errors, suggesting shared failure mechanisms.
*   **Narrative & Style** errors showed near-zero correlation with other categories, indicating they arise from distinct mechanisms.

### RQ5: Positional Distribution of Errors
*   Facts are typically established early in the narrative (15-30% position), while contradictions appear later, clustering in the **40-60%** range.
*   **Geographical contradictions** had the largest average gap (31.0%) between fact and contradiction, indicating long-range memory failures.
*   **Perspective confusions** had the smallest gap (4.7%), suggesting local context failures.

## Theoretical and Practical Implications
*   **Theoretical:** The work provides a formal framework and taxonomy for studying a previously underexplored aspect of long-form generation: systematic consistency. It demonstrates that errors are not random but follow predictable patterns related to model uncertainty, narrative position, and error type interdependencies.
*   **Practical:**
    *   **For Model Developers:** Highlights specific weaknesses (factual/temporal tracking) to target for improvement. Suggests that enhancing long-range coherence mechanisms is crucial.
    *   **For Evaluation:** Provides a reproducible, evidence-grounded benchmark and automated tool (ConStory-Checker) for assessing narrative consistency at scale.
    *   **For System Design:** The finding that high entropy predicts errors suggests practical mitigation strategies, such as triggering verification routines when local uncertainty exceeds a threshold.

## Conclusion
The paper establishes that maintaining narrative consistency in long-form story generation remains a significant challenge for current LLMs. The introduced **ConStory-Bench** and **ConStory-Checker** provide the community with tools for systematic evaluation. Key findings include the linear accumulation of errors with length, the predictive power of token entropy, and the positional clustering of contradictions. These insights can guide future efforts to improve coherence in long-context LLM generation. Future work may extend the benchmark to multilingual, cross-cultural, and non-fiction domains.

---

_Markdown view of https://picx.dev/p/zjiwxd, served by PicX — AI-generated visual whiteboard summaries of research papers._
