Summary (Overview)
- Hierarchical Memory Framework: MemSlides separates long-term memory (user profile memory + tool memory) from working memory, enabling persistent personalization across jobs while tracking session-specific constraints.
- Scoped Localized Revision: Instead of full-deck regeneration, MemSlides uses a Plan–Act–Guard pipeline to apply patch-level edits to the smallest affected slide region, reducing context pressure and drift.
- Improved Persona Alignment: In controlled persona-alignment evaluations (0–10 scale), MemSlides achieves all-column wins over baselines (DeepPresenter, SlideTailor) on GLM-5 and Gemini 3.1 Pro, with average gains of +1.37 (Content), +0.53 (Structure), +1.66 (Visual), and +1.19 (Specificity) over DeepPresenter across model families.
- Reliable Localized Modification: Tool-memory injection in diagnostic matched-pair tests raises closed-loop completion from 0.815 to 0.963, strict verification from 0.310 to 0.534, and reduces first-correct-edit time from 609.5 s to 242.5 s.
- Cross-Job Profile Consolidation: Qualitative evidence shows that local revision feedback becomes reusable organization preferences (e.g., evidence-boundary tables, IO-responsibility schemas) in later jobs.
Introduction and Theoretical Foundation
Automatic presentation generation has progressed from document compression [40, 26] to LLM-based systems that produce complete decks via multi-modal workflows [6, 58, 51, 49, 59, 30]. However, existing systems lack persistent personalization: users must repeatedly specify their preferences (domain, style, layout) in every interaction. Prior work such as PPTAgent [58] and DeepPresenter [59] improve general generation and agentic refinement but do not model user-specific profiles. SlideTailor [55] conditions generation on reference slides but ties personalization to provided examples rather than an accumulated user profile.
The central gap is twofold:
- Personalization is often revealed through revision, yet existing agents handle edits by re-contextualizing or re-generating large deck portions, making multi-turn local modification fragile.
- Current systems treat personalization as an implicit byproduct of prompting rather than a direct service enabled by memory design, in contrast to agent-memory work [61, 31, 47].
MemSlides addresses these by introducing a hierarchical memory framework that separates long-term memory (persistent user preferences and execution experience) from working memory (session constraints), paired with scoped slide-local revision that operates on the smallest affected region.
Methodology
Problem Formulation
The system models personalized presentation generation as a stateful, multi-turn authoring problem. Given source material , user profile memory , and optional task-time template , the initial deck is:
At revision round , user feedback updates session state and edits the deck:
Three personalization signals have different lifetimes: user profile memory (cross-job), task-time template (job-local), and session state (turn-specific). Conflicts are resolved by precedence: explicit session feedback > task-time template > user profile memory.
Multi-Turn Localized Modify Execution
The Plan–Act–Guard pipeline ensures targeted editing:
- Plan: Converts each revision request into an execution contract recording scope (local / global / hybrid), target slide paths, active rule IDs, and coverage requirements.
- Act: Applies minimal edits via batch CSS updates, semantic batch styling, or snapshot-bound local patches. Page insertion/deletion remains explicit; whole-slide rewriting is reserved for new slides or corrupt states.
- Guard: Checks completion against snapshot content hashes, blocks premature finalization until coverage is satisfied, and triggers rebinding hints on stale snapshots.
Working memory carries active preferences, carryover instructions, and edit-state records across rounds, enabling multi-turn operation.
User Profile Memory
User profile memory organizes stored items by intent and presentation dimensions (theme, content, visual, layout, template, general). At job start, the intent-matched bucket is selected, request constraints are extracted, and reconciled into active temporary memory:
During revision, evolves with feedback. At job end, stable signals are consolidated back into long-term memory via , preventing transient requests from becoming persistent.
Tool Memory
Tool memory is organized at two granularities:
- Round-scope task experience: Available at job start, buffered in working memory, updated via agent lessons and tool-error summaries.
- Operation-scope tool-chain experience: Raw reasoning–tool–observation chains segmented into reusable fragments indexed by operation context.
This separation helps the agent execute edits with fewer repeated errors and more reliable verification.
Empirical Validation / Results
Personalization Alignment
Table 1 reports persona-alignment judgments (0–10 scale) averaged over three personas. MemSlides achieves all-column wins over both baselines on GLM-5 and Gemini 3.1 Pro, and leads on Content, Specificity, and Visual for GPT-5.
| Framework | Model | Content ↑ | Structure ↑ | Visual ↑ | Specificity ↑ |
|---|---|---|---|---|---|
| DeepPresenter | GPT-5 | 6.22 | 7.56 | 5.76 | 5.89 |
| DeepPresenter | GLM-5 | 6.67 | 7.61 | 5.28 | 7.22 |
| DeepPresenter | Gemini 3.1 Pro | 6.89 | 8.00 | 6.78 | 7.44 |
| SlideTailor | GPT-5 | 6.78 | 6.00 | 6.39 | 6.33 |
| SlideTailor | GLM-5 | 4.44 | 4.89 | 4.00 | 3.89 |
| SlideTailor | Gemini 3.1 Pro | 4.48 | 5.00 | 4.03 | 4.67 |
| MemSlides (Ours) | GPT-5 | 7.11 | 7.33 | 6.00 | 6.67 |
| MemSlides (Ours) | GLM-5 | 9.00 | 8.78 | 8.56 | 8.89 |
| MemSlides (Ours) | Gemini 3.1 Pro | 7.77 | 8.64 | 8.24 | 8.56 |
General Quality
Table 2 shows DeepPresenter-style quality metrics (1–5 scale, Diversity via DINOv2-Vendi). MemSlides achieves the best Avg. on GPT-5 (4.17) and competitive scores on other models, indicating persona gains are not a trade-off against ordinary presentation quality.
| Framework | Model | Constraint ↑ | Content ↑ | Style ↑ | Avg. ↑ | Diversity ↑ |
|---|---|---|---|---|---|---|
| DeepPresenter | GPT-5 | 4.83 | 3.50 | 3.63 | 3.99 | 0.387 |
| DeepPresenter | Gemini 3.1 Pro | 4.17 | 3.33 | 4.00 | 3.83 | 0.370 |
| DeepPresenter | GLM-5 | 4.00 | 3.57 | 4.00 | 3.86 | 0.366 |
| SlideTailor | GPT-5 | 3.83 | 2.93 | 4.03 | 3.60 | 0.399 |
| SlideTailor | Gemini 3.1 Pro | 3.83 | 3.20 | 4.00 | 3.68 | 0.364 |
| SlideTailor | GLM-5 | 3.83 | 2.97 | 4.00 | 3.60 | 0.348 |
| MemSlides (Ours) | GPT-5 | 5.00 | 3.60 | 3.90 | 4.17 | 0.380 |
| MemSlides (Ours) | Gemini 3.1 Pro | 3.33 | 3.37 | 4.10 | 3.60 | 0.463 |
| MemSlides (Ours) | GLM-5 | 3.83 | 3.34 | 4.03 | 3.74 | 0.391 |
Localized Revision (Tool Memory Ablation)
Table 3 reports results on nine diagnostic modify pairs. Tool-memory injection improves Closed-Loop Completion (0.963 vs. 0.815), Strict Verify (0.534 vs. 0.310), reduces Time to First Correct Edit (242.5 s vs. 609.5 s), and lowers Core Tool Time Ratio (0.327× geometric mean). Pair-level counts (W-L-T-NA) show 3-1-5-0 for completion and 8-1-0-0 for verification.
| Model | Memory Injected | Closed-Loop Completion ↑ | Strict Verify ↑ | First Correct Edit (s) ↓ | Core Tool Time Ratio ↓ |
|---|---|---|---|---|---|
| GPT-5 | ✓ | 1.000 | 0.646 | 211.3 | 0.740× |
| GPT-5 | ✗ | 0.667 | 0.294 | 234.2 | 1.000× |
| GLM-5 | ✓ | 1.000 | 0.488 | 195.9 | 0.344× |
| GLM-5 | ✗ | 0.889 | 0.434 | 500.9 | 1.000× |
| Gemini 3.1 Pro | ✓ | 0.889 | 0.469 | 309.9 | 0.137× |
| Gemini 3.1 Pro | ✗ | 0.889 | 0.201 | 968.2 | 1.000× |
| Overall | ✓ | 0.963 | 0.534 | 242.5 | 0.327× |
| Overall | ✗ | 0.815 | 0.310 | 609.5 | 1.000× |
Qualitative Evidence
- Figure 5 shows that given a local edit request (“change ‘4 groups’ to ‘8 heads’”), DeepPresenter alters non-target regions (formula block removed, layout rewritten) while MemSlides applies a targeted patch preserving aligned content.
- Figure 6 illustrates cross-job profile consolidation: local feedback cues (e.g., concept-clarification cues, next-step questions) become reusable organization patterns (evidence-boundary tables, owner/timeline tables) in later jobs.
Theoretical and Practical Implications
- Theoretical: The results demonstrate that effective personalization in presentation generation requires separating signals by lifetime: persistent user profile memory (cross-job), session-level working memory (within-job), and reusable execution experience (tool memory). This separation reduces context pressure and prevents drift during multi-turn revision.
- Practical: MemSlides provides an interaction substrate that learns user preferences from revision feedback without requiring users to fully specify preferences upfront. The Plan–Act–Guard pipeline with localized scope can be adapted to other document editing domains (e.g., reports, web pages). The profile consolidation mechanism (Eq. 6) generalizes local feedback into reusable templates, reducing future manual input.
- Evaluation: The persona-alignment judgment protocol and diagnostic matched-pair modify setting offer reusable evaluation methods for personalized generation systems, including multi-persona, multi-intent profile banks and locality-sensitive metrics (closed-loop completion, core-tool-time ratio).
Conclusion
MemSlides introduces a hierarchical memory framework for personalized presentation generation that separates user profile memory, active temporary memory, and tool memory. Controlled experiments show improved round-0 persona alignment (up to +3 points on Specificity) and diagnostic gains in localized modify reliability (tool memory injection reduces first-correct-edit time by 60% and increases strict verification from 0.31 to 0.53). Qualitative results confirm that scoped patch-level edits preserve aligned content while carrying session preferences across rounds.
Limitations and Future Work: The evidence is scoped to controlled persona-alignment judgments, diagnostic matched-pair settings, and qualitative cross-job consolidation. Future work should include broader human studies, randomized edit sets, and stronger safeguards for memory consent, deletion, and sensitive-preference handling. The framework’s hierarchical memory design may generalize to other personalized document authoring tasks.
Related papers
- EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions
EnterpriseClawBench reveals that enterprise agent tasks remain unsaturated (best score 0.663), with performance critically dependent on harness-model combinations, not just the base model.
- PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models
PerceptionDLM enables parallel multi-region captioning, achieving 3.44× throughput speedup over autoregressive models while maintaining competitive accuracy.
- HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
HarnessX evolves the agent harness as a typed, first-class interface, achieving average +14.5% and up to +44% gains across benchmarks.