GEMS: Agent-Native Multimodal Generation with Memory and Skills - Summary

Summary (Overview)

Proposes GEMS, an agent-native framework for multimodal (text-to-image) generation that integrates Agent Loop, Agent Memory, and Agent Skill to address complex instructions and specialized downstream tasks.
Achieves significant performance gains across diverse benchmarks, enabling a lightweight 6B model (Z-Image-Turbo) to surpass the state-of-the-art closed-source model Nano Banana 2 on the challenging GenEval2 benchmark.
Introduces Hierarchical Compression within Agent Memory to efficiently manage historical context, storing raw factual artifacts and distilling verbose reasoning into concise experiences, reducing redundancy.
Develops an extensible Agent Skill module with on-demand loading, allowing the system to incorporate domain-specific expertise (e.g., Creative Drawing, Text Rendering) without overwhelming the core reasoning process.
Demonstrates robust generalizability by showing consistent improvements across multiple generative backends (Z-Image-Turbo and Qwen-Image-2512) on both mainstream and downstream tasks.

Introduction and Theoretical Foundation

Recent multimodal generation models excel at general tasks but struggle with complex, multi-faceted instructions and specialized applications—the "long-tail" challenge. Inference-time scaling strategies like iterative refinement or multi-agent systems have been explored to bridge this gap. However, existing approaches face limitations: some rely on simple successive updates or context accumulation leading to insufficient guidance or redundancy, while others are highly specialized and difficult to integrate into mainstream pipelines.

Inspired by advanced agent frameworks like Claude Code, this paper proposes GEMS (Agent-Native Multimodal Generation with Memory and Skills), redesigned from an agentic perspective. The core hypothesis is that a structured framework combining iterative closed-loop optimization, persistent trajectory-level memory, and on-demand domain expertise can effectively push beyond the inherent limitations of foundational models, enabling them to handle both complex and specialized tasks with higher fidelity.

Methodology

GEMS consists of three core collaborative components, as shown in Figure 2 of the paper.

3.1 Agent Loop

The Agent Loop is the backbone, comprising several specialized modules that work in sequence:

Planner ( $F_{plan}$ ): The strategic entry point. It interacts with the Skill Manager to retrieve relevant domain expertise $S_{trig} \subseteq S$ based on the user prompt $U$ , synthesizing an enhanced initial prompt $P_1$ . It also dispatches $U$ to the Decomposer. $(P_1, U) = F_{plan}(U, S) \tag{1}$
Decomposer ( $F_{dec}$ ): Partitions the original user prompt $U$ into a set of atomic, binary (yes/no) visual criteria $C = \{c_1, c_2, ..., c_n\}$ for fine-grained evaluation. $C = F_{dec}(U) \tag{2}$
Generator ( $F_{gen}$ ): A model-agnostic module that produces an image $I_i$ at each iteration $i$ based on the current prompt $P_i$ . $I_i = F_{gen}(P_i) \tag{3}$
Verifier ( $F_{ver}$ ): A Multimodal Large Language Model (MLLM) that assesses the generated image $I_i$ against the criteria set $C$ , producing a binary feedback vector $V_i = \{v_{i,1}, ..., v_{i,n}\}$ . $V_i = F_{ver}(I_i, C), \quad v_{i,j} \in \{0, 1\} \tag{4}$ The loop terminates if all criteria are met. If not, and the iteration limit $N_{max}$ is not reached, $V_i$ is sent to the Refiner. If $N_{max}$ is reached, it returns the best historical image: $I_{best} = \arg \max_{I_k} \sum_{j=1}^{n} v_{k,j}, \quad k \in \{1, ..., N_{max}\} \tag{5}$
Refiner ( $F_{ref}$ ): Closes the feedback loop by synthesizing the next prompt $P_{i+1}$ using the current state and the historical memory state $M_{i-1}$ . $P_{i+1} = F_{ref}(P_i, I_i, V_i, T_i, M_{i-1}) \tag{6}$

3.2 Agent Memory

To overcome the limitations of simple successive updates, GEMS implements a persistent memory mechanism that maintains a global record of the optimization trajectory. It uses a Hierarchical Compression strategy:

Factual artifacts (prompt $P_i$ , image $I_i$ , feedback $V_i$ ) with minimal token footprints are stored in raw form.
Verbose reasoning traces $T_i$ are processed by a Compressor $F_{comp}$ to distill them into concise, high-level experiences $E_i$ : $E_i = F_{comp}(P_i, I_i, V_i, T_i, M_{i-1}) \tag{7}$

The memory state $M_i$ is updated as a sequence of these hybrid tuples:

M_i = \{(P_1, I_1, V_1, E_1), ..., (P_i, I_i, V_i, E_i)\} \tag{8}

3.3 Agent Skill

This module is an extensible repository of domain-specific expertise (e.g., Creative Drawing, Text Rendering). It features an on-demand loading and progressive exposure mechanism:

Only skill names and descriptions are "always loaded" as a lightweight manifest.
Comprehensive instructions containing dense domain knowledge are fetched only when a skill is triggered by the Planner.
This design ensures scalability and allows easy contribution via simple markdown files (SKILL.md).

Empirical Validation / Results

Extensive experiments were conducted across five mainstream benchmarks (GenEval, GenEval2, DPG-Bench, OneIG, WISE) and four downstream tasks (LongText-Bench, SpatialGenEval, CREA, ArtiMuse) using two generative backends: the lightweight Z-Image-Turbo (6B) and Qwen-Image-2512 (20B).

Main Results on Mainstream Tasks

Table 1 shows that GEMS consistently outperforms baselines, including other inference-time scaling methods (Rewrite, Promptist, Search, Maestro, CRAFT).

Model	Method	GenEval	GenEval2	DPG-Bench	OneIG-EN	OneIG-ZH	WISE	Avg.
Z-Image-Turbo	GEMS (Ours)	0.86 ↑0.09	63.5 ↑32.5	86.01 ↑0.93	0.569 ↑0.043	0.552 ↑0.051	0.81 ↑0.24	74.51
	Original	0.77	31.0	85.08	0.526	0.501	0.57	60.29
	CRAFT [30]	0.80	62.4	85.29	0.582	0.542	0.78	72.38
	Maestro [59]	0.82	44.6	85.29	0.548	0.519	0.85	70.05
Qwen-Image-2512	GEMS (Ours)	0.79 ↑0.13	70.4 ↑41.4	85.59 ↑0.90	0.542 ↑0.055	0.532 ↑0.043	0.80 ↑0.21	73.74
	Original	0.66	29.0	84.69	0.487	0.489	0.59	57.50
	CRAFT [30]	0.79	66.3	85.87	0.533	0.518	0.79	72.54

Key Finding: With Z-Image-Turbo, GEMS achieves an average normalized score gain of 14.22, and its score of 63.5 on GenEval2 surpasses Nano Banana 2 (44.6).

Main Results on Downstream Tasks

Table 2 shows even more pronounced advantages on specialized tasks.

Model	Method	LongText-EN	LongText-ZH	SpatialGenEval	CREA	ArtiMuse	Avg.
Z-Image-Turbo	GEMS (Ours)	0.952 ↑0.040	0.940 ↑0.008	61.4 ↑2.7	22.55 ↑10.71	58.58 ↑15.31	72.44
	Original	0.912	0.932	58.7	11.84	43.27	58.41
	CRAFT [30]	0.951	0.760	60.6	13.63	54.95	61.63
	Maestro [59]	0.877	0.807	60.3	15.81	56.86	63.52

Key Finding: GEMS yields an average improvement of 14.03 with Z.

Ablation Studies and Analysis

Component Ablation (Fig. 4-left): Sequentially adding Agent Loop, Memory, and Skill to the baseline (score 31.0) improves scores to 52.4, 61.4, and finally 63.5 on GenEval2.
Agent Memory Analysis (Fig. 4-right): Including compressed "Experiences" (+2.5 pts) is more effective than including raw "Thoughts" (+0.3 pts), confirming the value of hierarchical compression.
Agent Loop Efficacy (Fig. 5): GEMS shows a consistent upward trajectory in passed criteria over iterations, indicating active directed optimization, not just random variation.
Efficiency-Performance Trade-off (Fig. 6): GEMS achieves superior performance with fewer average images generated (~3) compared to other methods.
Agent Skill Analysis (Fig. 9): Skills are correctly triggered for relevant tasks (e.g., Spatial Intelligence for SpatialGenEval, Creative Drawing for CREA), providing significant performance gains (e.g., +51.6% on CREA).

Theoretical and Practical Implications

Theoretical Implications:

Agentic Paradigm for Generation: GEMS successfully reframes text-to-image generation as an iterative, multi-agent optimization problem, demonstrating that agentic reasoning can effectively extend the capabilities of foundational models beyond their inherent limits.
Memory Design: The hierarchical compression strategy for Agent Memory provides a blueprint for managing long-context, multi-turn trajectories in agent systems, balancing information preservation with token efficiency.
Skill Modularity: The on-demand skill loading mechanism offers a scalable and user-friendly approach to integrating domain expertise, reducing fragmentation in specialized system design.

Practical Implications:

Performance Enhancement: GEMS provides a practical framework for significantly boosting the performance of existing open-source and lightweight generative models on complex and specialized tasks, potentially reducing reliance on massive closed-source models.
Accessibility and Extensibility: The modular skill system allows users and developers to easily contribute new domain expertise, making high-fidelity generation for niche applications more accessible.
Efficiency Gains: By shifting the iteration distribution towards earlier termination (average iterations reduced from 3.26 to 2.80 with Memory and Skill), GEMS improves quality while managing computational cost.

Conclusion

GEMS is an agent-native multimodal generation framework that integrates iterative refinement (Agent Loop), persistent trajectory memory (Agent Memory), and extensible domain skills (Agent Skill) to address the challenges of complex instruction following and specialized downstream tasks. Extensive experiments validate that GEMS delivers substantial performance gains across diverse benchmarks and generative backends. Most notably, it enables a lightweight 6B model to surpass a state-of-the-art closed-source model on a challenging benchmark, demonstrating the transformative potential of agentic systems in unlocking model capabilities. Future work will focus on optimizing inference latency, exploring greater agent autonomy, and extending the framework to other modalities like video generation.