Macaron-A2UI: A Model for Generative UI in Personal Agents

Summary (Overview)

  • Generative UI as a New Interface Layer: The paper argues that static plain-text chat is a bottleneck for personal agents handling complex tasks. Generative UI, which dynamically synthesizes appropriate controls and interfaces in real-time, is presented as the necessary evolution.
  • The Macaron-A2UI Model: The authors introduce a model that enables agents to generate natural language responses alongside lightweight, executable UI actions (using the A2UI declarative protocol) for tasks like information collection and preference refinement.
  • Scalable Training Pipeline: The work presents a comprehensive pipeline for the problem: 1) constructing a large-scale Generative UI corpus (14,245 samples) from heterogeneous dialogue sources, 2) introducing a dedicated benchmark (A2UI-Bench) for evaluation, and 3) training models (30B, 235B, 754B) via parameter-efficient LoRA-based SFT followed by reward-driven reinforcement learning (GRPO).
  • Strong Performance with Internalized Knowledge: The best model (Macaron-A2UI-Venti, 754B) achieves an overall score of 75.6 on A2UI-Bench without explicit schema hints at inference, surpassing the strongest frontier baseline (GPT-5.4) that is provided with the full schema. This demonstrates that Generative UI competence can be successfully internalized through training.

Introduction and Theoretical Foundation

The core assumption is that human-computer interaction is shifting from fixed, population-designed interfaces to flexible, personalized interfaces generated on-demand to match a user's immediate goal and context. This establishes Generative UI as an essential direction. When plain text is insufficient for tasks requiring structured interaction (e.g., providing information, comparing options, confirming decisions), lightweight generative interfaces can reduce cognitive load and make interactions more efficient.

Current research lacks a unified formulation for agent-side UI generation. The field is missing: 1) large-scale UI-grounded dialogue supervision, 2) evaluation benchmarks separating protocol validity from interaction quality, and 3) evidence that models can internalize this capability without long, explicit schema prompts.

This paper formulates Generative UI for personal agents as a learning problem: given a system instruction, dialogue history, and current user message, the model must produce a unified response containing both natural language and an executable UI action sequence. This is instantiated using A2UI, a declarative UI protocol that provides a renderable, automatically checkable foundation.

Methodology

1. A2UI Corpus Construction

A hybrid rule-and-LLM pipeline converts four heterogeneous dialogue sources into a Generative UI corpus.

  • Sources: MultiWOZ 2.2 & Schema-Guided Dialogue (SGD) for task-oriented assistance; ESConv for emotional support; AnnoMI for motivational interviewing.
  • Process: Source dialogues are normalized (merged speaker utterances). Dataset-specific annotations (e.g., dialogue acts, support strategies) are mapped to a unified intermediate interaction representation, which is then mapped to A2UI component families (e.g., selections, sliders).
  • Annotation Strategy:
    • Task-oriented data (MultiWOZ, SGD): Primarily rule-driven using a state-machine tracker.
    • Open-domain data (ESConv, AnnoMI): Uses a two-stage LLM process (Editor for global UI planning, Author for local component generation).
  • Post-Processing & Augmentation: Rule-based fixes ensure structural validity. Component-targeted augmentation (4,165 samples) expands coverage of under-represented components like sliders, date inputs, and modals.
  • Validation: A four-level linting pipeline (format, structure, data-binding, semantic) with error-feedback retry achieves a final renderability rate of 99.2%.

Corpus Statistics (Table 1):

Source DomainBase dlg.Orig. / Aug. SamplesTotal SamplesUI / TextUI ratio
MultiWOZ9973,673 / 1,7515,4244,361 / 1,06380.4%
SGD3,1093,692 / 1,0654,7573,761 / 99679.1%
ESConv100760 / 3381,098604 / 49455.0%
AnnoMI1001,955 / 1,0112,9661,484 / 1,48250.0%
Total4,30610,080 / 4,16514,24510,210 / 4,03571.7%

2. A2UI-Bench Benchmark

A dedicated benchmark of 300 tasks for controlled evaluation, organized by task structure:

  • Atomic Tasks: Single-turn, single-intent. Measures core turn-level UI decision and generation.
  • Depth Tasks: Multi-turn episodes. Evaluates cross-turn consistency and state management.
  • Width Tasks: Single-turn, compositionally broad (multiple intents). Tests structural organization.

Evaluation Metrics: A two-perspective framework.

  • Language-side (Three Levels):
    • L1 - Protocol Correctness: Automated checks for JSON parsing, schema compliance, reference integrity, etc.
    • L2 - Task Construction Quality: LLM-judged dimensions like trigger appropriateness, component-intent alignment, and text-UI grounding.
    • L3 - User Experience Quality: LLM-judged dimensions like value-addition over text, conversational naturalness, and cognitive load.
  • Visual-side (Three Dimensions): Rendered UI screenshots are scored by a VLM judge for V1 visual integrity, V2 task alignment, and V3 action clarity.

3. Training Pipeline

A parameter-efficient two-stage pipeline using LoRA adaptation.

  1. Supervised Fine-Tuning (SFT): Teaches the basic response format. The objective is the standard autoregressive negative log-likelihood: LSFT=t=1Tlogpθ(ytx,y<t)L_{\text{SFT}} = -\sum_{t=1}^{T} \log p_{\theta}(y_t | x, y_{<t}) where xx is the context and yy is the target response containing both text_response and a2ui actions.
  2. Group-Relative Policy Optimization (GRPO): Refines behavior with an interaction-oriented reward. For a prompt xix_i, a group of GG candidate responses {yi,1,...,yi,G}\{y_{i,1}, ..., y_{i,G}\} is sampled. The group-relative advantage for candidate jj is: Ai,j=Ri,j1Gk=1GRi,kA_{i,j} = R_{i,j} - \frac{1}{G}\sum_{k=1}^{G} R_{i,k} The GRPO objective is: LGRPO=ij=1Gt=1yi,jAi,jlogpθ(yi,j,txi,yi,j,<t)L_{\text{GRPO}} = -\sum_{i} \sum_{j=1}^{G} \sum_{t=1}^{|y_{i,j}|} A_{i,j} \log p_{\theta}(y_{i,j,t} | x_i, y_{i,j,<t})

Reward Design: The GRPO reward RR is a weighted combination of scores mirroring the evaluation metrics, gated by hard structural checks (malformed JSON leads to zero reward):

R=1[pass](λ1SL1+λ2SL2+λ3SL3)R = \mathbf{1}[\text{pass}] \cdot (\lambda_1 S_{L1} + \lambda_2 S_{L2} + \lambda_3 S_{L3})

Empirical Validation / Results

Main Results

Models are evaluated under two regimes: w/o schema (lightweight instructions, testing internalized knowledge) and w/ schema (full protocol provided, an upper bound).

Key Result Table (Table 2 - Excerpt):

ModelPromptL1L2L3V1V2V3Avg.
GPT-5.4w/ schema4.023.593.273.463.733.173.54
Gemini-3.1-Prow/ schema4.253.202.963.533.553.043.42
Macaron-A2UI-Grandew/o schema4.673.222.913.953.743.473.66
Macaron-A2UI-Ventiw/o schema4.473.363.283.953.763.523.72
  • Effectiveness of Training: SFT provides massive gains (e.g., Qwen-235B overall score from 21.6 to 63.6). RL further improves scores (to 74.2 for Qwen-235B).
  • Internalization Achieved: The best model (Macaron-A2UI-Venti, GLM-5.1 backbone) achieves an overall score of 75.6 without schema hints, surpassing the strongest full-schema frontier baseline (GPT-5.4 at 74.1).
  • Untuned Models Struggle: Frontier models (GPT-5.4, DeepSeek-V3.1) perform poorly (scores ~20-26) without schema hints, confirming that lightweight instructions are insufficient.

Analysis

  • Per-Dataset/Task Performance: The trained models show strong, robust performance across all four source datasets and three task types (Atomic, Depth, Width).
  • RL Dynamics (Figure 6): During GRPO training, the L1 reward (protocol correctness) increases first and most rapidly. Improvements in L2 (task construction) and L3 (user experience) are more gradual, with L3 being the hardest to optimize, especially for the 30B model.

Theoretical and Practical Implications

  • Establishes a Learning Formulation: The work provides a complete framework (data, benchmark, training) for studying Generative UI as a learnable capability for agents, moving beyond text-only or code-generation paradigms.
  • Demonstrates Internalization is Possible: A key finding is that models can internalize UI generation knowledge, reducing or eliminating the need for lengthy, explicit schema prompts at inference time. This is crucial for practical deployment where latency and prompt efficiency matter.
  • Enables More Efficient Human-Agent Interaction: By generating appropriate lightweight interfaces on-demand, personal agents can make complex interactions (multi-goal organization, preference refinement) shorter, clearer, and less cognitively demanding for users.
  • Provides Open Resources: The release of models, benchmark, and evaluation protocol supports future research and development in Generative UI for personal agents.

Conclusion

The paper presents Macaron-A2UI, a comprehensive study demonstrating that Generative UI capability for personal agents can be formulated as a learning problem and successfully internalized by models through a scalable training pipeline. The best model, trained with schema-light SFT and reward-driven RL, performs competitively with or surpasses frontier models that require full schema prompting.

Limitations & Future Work: The approach is tied to the evolving A2UI v0.8 protocol. Model capability, especially for complex multi-turn interaction and user experience, remains a bottleneck. Latency in real-time generation, validation, and rendering is also a practical concern. Future work will explore more general, flexible, and token-efficient methods for building Generative UI systems.