Visual Summary | MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

Summary (Overview)

Hierarchical Memory Framework: MemSlides separates long-term memory (user profile memory + tool memory) from working memory, enabling persistent personalization across jobs while tracking session-specific constraints.
Scoped Localized Revision: Instead of full-deck regeneration, MemSlides uses a Plan–Act–Guard pipeline to apply patch-level edits to the smallest affected slide region, reducing context pressure and drift.
Improved Persona Alignment: In controlled persona-alignment evaluations (0–10 scale), MemSlides achieves all-column wins over baselines (DeepPresenter, SlideTailor) on GLM-5 and Gemini 3.1 Pro, with average gains of +1.37 (Content), +0.53 (Structure), +1.66 (Visual), and +1.19 (Specificity) over DeepPresenter across model families.
Reliable Localized Modification: Tool-memory injection in diagnostic matched-pair tests raises closed-loop completion from 0.815 to 0.963, strict verification from 0.310 to 0.534, and reduces first-correct-edit time from 609.5 s to 242.5 s.
Cross-Job Profile Consolidation: Qualitative evidence shows that local revision feedback becomes reusable organization preferences (e.g., evidence-boundary tables, IO-responsibility schemas) in later jobs.

Introduction and Theoretical Foundation

Automatic presentation generation has progressed from document compression [40, 26] to LLM-based systems that produce complete decks via multi-modal workflows [6, 58, 51, 49, 59, 30]. However, existing systems lack persistent personalization: users must repeatedly specify their preferences (domain, style, layout) in every interaction. Prior work such as PPTAgent [58] and DeepPresenter [59] improve general generation and agentic refinement but do not model user-specific profiles. SlideTailor [55] conditions generation on reference slides but ties personalization to provided examples rather than an accumulated user profile.

The central gap is twofold:

Personalization is often revealed through revision, yet existing agents handle edits by re-contextualizing or re-generating large deck portions, making multi-turn local modification fragile.
Current systems treat personalization as an implicit byproduct of prompting rather than a direct service enabled by memory design, in contrast to agent-memory work [61, 31, 47].

MemSlides addresses these by introducing a hierarchical memory framework that separates long-term memory (persistent user preferences and execution experience) from working memory (session constraints), paired with scoped slide-local revision that operates on the smallest affected region.

Methodology

Problem Formulation

The system models personalized presentation generation as a stateful, multi-turn authoring problem. Given source material $x$ , user profile memory $P_u$ , and optional task-time template $\tau$ , the initial deck is:

S_0 = G_{\text{init}}(x, P_u, \tau) \tag{1}

At revision round $t$ , user feedback $f_t$ updates session state $z_t$ and edits the deck:

z_t = U(z_{t-1}, f_t; S_{t-1}), \quad S_t = G_{\text{edit}}(S_{t-1}, x, P_u, \tau, z_t), \quad t \ge 1 \tag{2}

Three personalization signals have different lifetimes: user profile memory $P_u$ (cross-job), task-time template $\tau$ (job-local), and session state $z_t$ (turn-specific). Conflicts are resolved by precedence: explicit session feedback > task-time template > user profile memory.

Multi-Turn Localized Modify Execution

The Plan–Act–Guard pipeline ensures targeted editing:

Plan: Converts each revision request into an execution contract recording scope (local / global / hybrid), target slide paths, active rule IDs, and coverage requirements.
Act: Applies minimal edits via batch CSS updates, semantic batch styling, or snapshot-bound local patches. Page insertion/deletion remains explicit; whole-slide rewriting is reserved for new slides or corrupt states.
Guard: Checks completion against snapshot content hashes, blocks premature finalization until coverage is satisfied, and triggers rebinding hints on stale snapshots.

Working memory carries active preferences, carryover instructions, and edit-state records across rounds, enabling multi-turn operation.

User Profile Memory

User profile memory organizes stored items by intent and presentation dimensions (theme, content, visual, layout, template, general). At job start, the intent-matched bucket $\tilde{P}_u$ is selected, request constraints $C_0$ are extracted, and reconciled into active temporary memory:

\tilde{P}_u = S(P_u, i_0), \quad C_0 = E(q_0), \quad A_0 = R(\tilde{P}_u, C_0) \tag{4}

During revision, $A_t$ evolves with feedback. At job end, stable signals are consolidated back into long-term memory via $P_u^+ = C(P_u, H)$ , preventing transient requests from becoming persistent.

Tool Memory

Tool memory is organized at two granularities:

Round-scope task experience: Available at job start, buffered in working memory, updated via agent lessons and tool-error summaries.
Operation-scope tool-chain experience: Raw reasoning–tool–observation chains segmented into reusable fragments indexed by operation context.

This separation helps the agent execute edits with fewer repeated errors and more reliable verification.

Empirical Validation / Results

Personalization Alignment

Table 1 reports persona-alignment judgments (0–10 scale) averaged over three personas. MemSlides achieves all-column wins over both baselines on GLM-5 and Gemini 3.1 Pro, and leads on Content, Specificity, and Visual for GPT-5.

Framework	Model	Content ↑	Structure ↑	Visual ↑	Specificity ↑
DeepPresenter	GPT-5	6.22	7.56	5.76	5.89
DeepPresenter	GLM-5	6.67	7.61	5.28	7.22
DeepPresenter	Gemini 3.1 Pro	6.89	8.00	6.78	7.44
SlideTailor	GPT-5	6.78	6.00	6.39	6.33
SlideTailor	GLM-5	4.44	4.89	4.00	3.89
SlideTailor	Gemini 3.1 Pro	4.48	5.00	4.03	4.67
MemSlides (Ours)	GPT-5	7.11	7.33	6.00	6.67
MemSlides (Ours)	GLM-5	9.00	8.78	8.56	8.89
MemSlides (Ours)	Gemini 3.1 Pro	7.77	8.64	8.24	8.56

General Quality

Table 2 shows DeepPresenter-style quality metrics (1–5 scale, Diversity via DINOv2-Vendi). MemSlides achieves the best Avg. on GPT-5 (4.17) and competitive scores on other models, indicating persona gains are not a trade-off against ordinary presentation quality.

Framework	Model	Constraint ↑	Content ↑	Style ↑	Avg. ↑	Diversity ↑
DeepPresenter	GPT-5	4.83	3.50	3.63	3.99	0.387
DeepPresenter	Gemini 3.1 Pro	4.17	3.33	4.00	3.83	0.370
DeepPresenter	GLM-5	4.00	3.57	4.00	3.86	0.366
SlideTailor	GPT-5	3.83	2.93	4.03	3.60	0.399
SlideTailor	Gemini 3.1 Pro	3.83	3.20	4.00	3.68	0.364
SlideTailor	GLM-5	3.83	2.97	4.00	3.60	0.348
MemSlides (Ours)	GPT-5	5.00	3.60	3.90	4.17	0.380
MemSlides (Ours)	Gemini 3.1 Pro	3.33	3.37	4.10	3.60	0.463
MemSlides (Ours)	GLM-5	3.83	3.34	4.03	3.74	0.391

Localized Revision (Tool Memory Ablation)

Table 3 reports results on nine diagnostic modify pairs. Tool-memory injection improves Closed-Loop Completion (0.963 vs. 0.815), Strict Verify (0.534 vs. 0.310), reduces Time to First Correct Edit (242.5 s vs. 609.5 s), and lowers Core Tool Time Ratio (0.327× geometric mean). Pair-level counts (W-L-T-NA) show 3-1-5-0 for completion and 8-1-0-0 for verification.

Model	Memory Injected	Closed-Loop Completion ↑	Strict Verify ↑	First Correct Edit (s) ↓	Core Tool Time Ratio ↓
GPT-5	✓	1.000	0.646	211.3	0.740×
GPT-5	✗	0.667	0.294	234.2	1.000×
GLM-5	✓	1.000	0.488	195.9	0.344×
GLM-5	✗	0.889	0.434	500.9	1.000×
Gemini 3.1 Pro	✓	0.889	0.469	309.9	0.137×
Gemini 3.1 Pro	✗	0.889	0.201	968.2	1.000×
Overall	✓	0.963	0.534	242.5	0.327×
Overall	✗	0.815	0.310	609.5	1.000×

Qualitative Evidence

Figure 5 shows that given a local edit request (“change ‘4 groups’ to ‘8 heads’”), DeepPresenter alters non-target regions (formula block removed, layout rewritten) while MemSlides applies a targeted patch preserving aligned content.
Figure 6 illustrates cross-job profile consolidation: local feedback cues (e.g., concept-clarification cues, next-step questions) become reusable organization patterns (evidence-boundary tables, owner/timeline tables) in later jobs.

Theoretical and Practical Implications

Theoretical: The results demonstrate that effective personalization in presentation generation requires separating signals by lifetime: persistent user profile memory (cross-job), session-level working memory (within-job), and reusable execution experience (tool memory). This separation reduces context pressure and prevents drift during multi-turn revision.
Practical: MemSlides provides an interaction substrate that learns user preferences from revision feedback without requiring users to fully specify preferences upfront. The Plan–Act–Guard pipeline with localized scope can be adapted to other document editing domains (e.g., reports, web pages). The profile consolidation mechanism (Eq. 6) generalizes local feedback into reusable templates, reducing future manual input.
Evaluation: The persona-alignment judgment protocol and diagnostic matched-pair modify setting offer reusable evaluation methods for personalized generation systems, including multi-persona, multi-intent profile banks and locality-sensitive metrics (closed-loop completion, core-tool-time ratio).

Conclusion

MemSlides introduces a hierarchical memory framework for personalized presentation generation that separates user profile memory, active temporary memory, and tool memory. Controlled experiments show improved round-0 persona alignment (up to +3 points on Specificity) and diagnostic gains in localized modify reliability (tool memory injection reduces first-correct-edit time by 60% and increases strict verification from 0.31 to 0.53). Qualitative results confirm that scoped patch-level edits preserve aligned content while carrying session preferences across rounds.

Limitations and Future Work: The evidence is scoped to controlled persona-alignment judgments, diagnostic matched-pair settings, and qualitative cross-job consolidation. Future work should include broader human studies, randomized edit sets, and stronger safeguards for memory consent, deletion, and sensitive-preference handling. The framework’s hierarchical memory design may generalize to other personalized document authoring tasks.