Summary of "TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas"

Summary (Overview)

Problem Definition: The paper identifies a critical limitation in existing Text-to-SQL research: the reliance on the Full Schema Assumption, where the complete database schema is pre-loaded into the model's context. This is impractical in real-world enterprise environments with massive, evolving schemas. The paper formalizes the Unknown Schema setting, where an agent must actively explore an unobservable database to retrieve relevant metadata.
Core Solution: The authors propose TRUST-SQL, an autonomous agent framework that interacts with databases via a structured four-phase interaction protocol (Explore, Propose, Generate, Confirm) to ground reasoning in verified metadata, preventing hallucinations.
Key Innovation: To train this agent, they introduce Dual-Track GRPO, a novel reinforcement learning strategy. It leverages the structural boundary of the Propose phase to isolate optimization signals for schema exploration and SQL generation, effectively resolving the credit assignment problem in multi-turn trajectories.
Main Results: Extensive experiments across five benchmarks (BIRD, Spider and its variants) show that TRUST-SQL achieves massive performance gains over base models (avg. +30.6% for 4B, +16.6% for 8B) in the Unknown Schema setting. Remarkably, it matches or surpasses strong baselines that rely on schema prefilling, despite having no pre-loaded metadata.

Introduction and Theoretical Foundation

Text-to-SQL parsing has advanced significantly but predominantly under the Full Schema Assumption, where models act as passive translators given a complete, static schema. This assumption fails in real enterprise settings where databases contain hundreds of tables with noisy, evolving metadata. Injecting full schemas is impractical due to context limits and can be harmful due to distractions from irrelevant information.

The paper formalizes the necessary shift to the Unknown Schema setting (illustrated in Figure 1 of the original text). Here, the agent cannot see the database schema upfront and must actively explore to retrieve necessary metadata. This transforms the task from static translation into a multi-turn, tool-integrated decision-making process.

However, this introduces new challenges:

Architectural: LLMs struggle with coherent reasoning over long interaction horizons and tend to hallucinate schema elements.
Algorithmical: Credit assignment is difficult; it's hard to attribute a final SQL execution success/failure to specific exploration or generation actions within a long trajectory.

To address these, the paper frames the problem as a Partially Observable Markov Decision Process (POMDP), where the true database state is hidden, and the agent must act based on partial observations (tool feedback).

Methodology

TRUST-SQL consists of two core components: a four-phase interaction protocol and the Dual-Track GRPO training strategy.

1. Four-Phase Interaction Protocol: A pilot study (Figure 3) justified this design by showing that a Propose phase drastically reduces hallucination errors. The protocol defines a strict action space:

Explore: Query database metadata (e.g., list tables, describe table).
Propose (Mandatory Checkpoint): Commit to a verified schema subset $K_t$ based on exploration. This prevents subsequent generation from using unverified metadata.
Generate: Produce a candidate SQL query grounded in $K_t$ .
Confirm: Submit the final SQL answer.

The agent maintains an internal context state $c_t = (q, h_t, K_t)$ , where $q$ is the user question, $h_t$ is the interaction history, and $K_t$ is the verified schema knowledge (initially empty).

2. Reward Design: Three reward signals are defined:

Execution Reward ( $R_{exec}$ ): Evaluates the final SQL $y$ against ground truth $y^*$ via database execution. $\begin{cases} 1.0 & \text{if } Exec(y) = Exec(y^*) \\ 0.2 & \text{if } Exec(y) \neq \emptyset \\ 0.0 & \text{if } Exec(y) = \emptyset \end{cases}$$$
Format Reward ( $R_{fmt}$ ): A trajectory-level bonus (0.1) for fully adhering to the interaction protocol.
Schema Reward ( $R_{schema}$ ): Evaluates the quality of the proposed schema $\hat{K}$ against the minimal ground truth schema $K^*$ : $R_{schema}(\hat{K}, K^*) = f_{match}(\hat{K}, K^*)$ .

3. Dual-Track GRPO: This is the key innovation for credit assignment. It leverages the Propose phase as a structural boundary to split the optimization into two tracks (see Figure 2 bottom):

Schema Track ( $\tau_{schema}$ ): Spans from the start to the Propose step ( $T_{schema}=t_{propose}$ ). It is optimized using only the schema reward $R_{schema}$ .
Full Track ( $\tau_{full}$ ): Spans the entire interaction to the final Confirm step ( $T_{full}=T$ ). It is optimized using the combined execution and format rewards $R_{exec} + R_{fmt}$ .

For a batch of $G$ trajectories, advantages are computed per track using group-relative normalization:

A_k^i = \frac{R_k^i - \mu_k}{\sigma_k + \epsilon}, \quad k \in \{\text{schema}, \text{full}\}

where $\mu_k$ and $\sigma_k$ are the mean and standard deviation of rewards for track $k$ . Token-level masking ensures advantages are broadcast only to tokens generated within each track's active steps.

The total loss combines the GRPO losses from both tracks:

L(\theta) = L_{full}(\theta) + \lambda \cdot L_{schema}(\theta)

where $\lambda$ controls the relative weight of the Schema Track.

Empirical Validation / Results

Experimental Setup: Models are built on Qwen3-4B and Qwen3-8B. They are compared against strong single-turn (OmniSQL, SQL-R1) and multi-turn RL baselines (MTIR-SQL, SQL-Trail) on five benchmarks: BIRD-Dev, Spider-Test, Spider-DK, Spider-Syn, and Spider-Realistic. Execution Accuracy (EX%) is the primary metric.

Main Results (Table 1):

TRUST-SQL-4B achieves 64.9% (greedy) and 67.2% (majority voting) on BIRD-Dev, outperforming the strong MTIR-SQL-4B baseline (63.1%).
TRUST-SQL-8B achieves the highest scores on BIRD-Dev (65.8% greedy, 67.7% majority) and shows strong generalization on robustness benchmarks (Spider-Syn, Spider-Realistic).
Crucially, TRUST-SQL achieves these results without schema prefilling, matching or surpassing baselines that have full schema access.

Key Ablation and Analysis Results:

Value of Autonomous Exploration (Table 2): Base Qwen3 models collapse without schema prefilling (e.g., Qwen3-4B drops 17.0% on BIRD). TRUST-SQL provides massive gains over base models (avg. +30.6% for 4B, +16.6% for 8B). Furthermore, injecting full schema into TRUST-SQL provides negligible or even negative benefits, proving its exploration is sufficient and robust to noisy metadata.
Dual-Track GRPO Effectiveness (Figure 4): The optimal setting ( $\lambda=0.25$ ) yields 64.5% on BIRD-Dev, a +3.6% gain over a pure execution baseline and a +5.8% gain over naively mixing schema and execution rewards. This confirms the method resolves credit assignment.
Schema Reward Design (Figure 5): The best formulation is Sparse + Coupled (binary $f_{match}$ , reward given only if $R_{exec}=1.0$ ), achieving 64.5%. Decoupling the reward or using a dense reward leads to worse performance.
Training Dynamics (Table 3): A two-stage pipeline (SFT warm-up + RL) is necessary. RL-only training leads to a degenerate policy that "hacks" the reward by exhaustively querying all metadata upfront, bypassing genuine exploration.

Theoretical and Practical Implications

Theoretical: The work provides a formal POMDP formulation for the Unknown Schema Text-to-SQL task. It demonstrates the necessity of structural boundaries in agent protocols for effective credit assignment in hierarchical RL for language agents. The Dual-Track mechanism offers a generalizable principle for co-optimizing disparate sub-tasks (exploration vs. generation) within a single trajectory.
Practical: TRUST-SQL enables the deployment of Text-to-SQL systems in real-world enterprise environments where full schema injection is impossible. It reduces context window waste and improves robustness by filtering out irrelevant or noisy metadata. The framework demonstrates that active exploration can be more effective than passive consumption of full schemas, establishing a new paradigm for reliable database interaction.

Conclusion

TRUST-SQL successfully addresses the limitations of the Full Schema Assumption by introducing an autonomous agent framework for the Unknown Schema setting. Its structured four-phase protocol grounds reasoning and prevents hallucinations, while the novel Dual-Track GRPO training strategy effectively resolves credit assignment, leading to a 9.9% relative improvement over standard GRPO. The framework achieves state-of-the-art or competitive performance across diverse benchmarks without relying on pre-loaded metadata, proving the feasibility and effectiveness of active database exploration. Future work may address inference overhead, support for other SQL dialects, and dynamic turn budgets.