# TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas

> TRUST-SQL introduces a multi-turn agent trained with dual-track reinforcement learning to solve Text-to-SQL over unknown schemas, matching full-schema performance without pre-loaded metadata.

- **Source:** [arXiv](https://arxiv.org/abs/2603.16448)
- **Published:** 2026-03-19
- **Permalink:** https://picx.dev/p/EWB5ah
- **Whiteboard:** https://picx.dev/p/EWB5ah/image

## Summary

# Summary of "TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas"

## Summary (Overview)
*   **Problem Definition:** The paper identifies a critical limitation in existing Text-to-SQL research: the reliance on the **Full Schema Assumption**, where the complete database schema is pre-loaded into the model's context. This is impractical in real-world enterprise environments with massive, evolving schemas. The paper formalizes the **Unknown Schema** setting, where an agent must actively explore an unobservable database to retrieve relevant metadata.
*   **Core Solution:** The authors propose **TRUST-SQL**, an autonomous agent framework that interacts with databases via a structured **four-phase interaction protocol** (Explore, Propose, Generate, Confirm) to ground reasoning in verified metadata, preventing hallucinations.
*   **Key Innovation:** To train this agent, they introduce **Dual-Track GRPO**, a novel reinforcement learning strategy. It leverages the structural boundary of the *Propose* phase to isolate optimization signals for schema exploration and SQL generation, effectively resolving the **credit assignment** problem in multi-turn trajectories.
*   **Main Results:** Extensive experiments across five benchmarks (BIRD, Spider and its variants) show that TRUST-SQL achieves massive performance gains over base models (avg. +30.6% for 4B, +16.6% for 8B) in the Unknown Schema setting. Remarkably, it matches or surpasses strong baselines that rely on schema prefilling, despite having no pre-loaded metadata.

## Introduction and Theoretical Foundation
Text-to-SQL parsing has advanced significantly but predominantly under the **Full Schema Assumption**, where models act as passive translators given a complete, static schema. This assumption fails in real enterprise settings where databases contain hundreds of tables with noisy, evolving metadata. Injecting full schemas is impractical due to context limits and can be harmful due to distractions from irrelevant information.

The paper formalizes the necessary shift to the **Unknown Schema** setting (illustrated in Figure 1 of the original text). Here, the agent cannot see the database schema upfront and must actively explore to retrieve necessary metadata. This transforms the task from static translation into a **multi-turn, tool-integrated decision-making process**.

However, this introduces new challenges:
1.  **Architectural:** LLMs struggle with coherent reasoning over long interaction horizons and tend to hallucinate schema elements.
2.  **Algorithmical:** **Credit assignment** is difficult; it's hard to attribute a final SQL execution success/failure to specific exploration or generation actions within a long trajectory.

To address these, the paper frames the problem as a **Partially Observable Markov Decision Process (POMDP)**, where the true database state is hidden, and the agent must act based on partial observations (tool feedback).

## Methodology
TRUST-SQL consists of two core components: a **four-phase interaction protocol** and the **Dual-Track GRPO** training strategy.

**1. Four-Phase Interaction Protocol:**
A pilot study (Figure 3) justified this design by showing that a *Propose* phase drastically reduces hallucination errors. The protocol defines a strict action space:
*   **Explore:** Query database metadata (e.g., list tables, describe table).
*   **Propose (Mandatory Checkpoint):** Commit to a verified schema subset $K_t$ based on exploration. This prevents subsequent generation from using unverified metadata.
*   **Generate:** Produce a candidate SQL query grounded in $K_t$.
*   **Confirm:** Submit the final SQL answer.

The agent maintains an internal context state $c_t = (q, h_t, K_t)$, where $q$ is the user question, $h_t$ is the interaction history, and $K_t$ is the verified schema knowledge (initially empty).

**2. Reward Design:**
Three reward signals are defined:
*   **Execution Reward ($R_{exec}$):** Evaluates the final SQL $y$ against ground truth $y^*$ via database execution.
    $$R_{exec}(y, y^*) =
    \begin{cases}
    1.0 & \text{if } Exec(y) = Exec(y^*) \\
    0.2 & \text{if } Exec(y) \neq \emptyset \\
    0.0 & \text{if } Exec(y) = \emptyset
    \end{cases}$$
*   **Format Reward ($R_{fmt}$):** A trajectory-level bonus (0.1) for fully adhering to the interaction protocol.
*   **Schema Reward ($R_{schema}$):** Evaluates the quality of the proposed schema $\hat{K}$ against the minimal ground truth schema $K^*$: $R_{schema}(\hat{K}, K^*) = f_{match}(\hat{K}, K^*)$.

**3. Dual-Track GRPO:**
This is the key innovation for credit assignment. It leverages the *Propose* phase as a structural boundary to split the optimization into two tracks (see Figure 2 bottom):
*   **Schema Track ($\tau_{schema}$):** Spans from the start to the *Propose* step ($T_{schema}=t_{propose}$). It is optimized using only the schema reward $R_{schema}$.
*   **Full Track ($\tau_{full}$):** Spans the entire interaction to the final *Confirm* step ($T_{full}=T$). It is optimized using the combined execution and format rewards $R_{exec} + R_{fmt}$.

For a batch of $G$ trajectories, advantages are computed per track using group-relative normalization:
$$A_k^i = \frac{R_k^i - \mu_k}{\sigma_k + \epsilon}, \quad k \in \{\text{schema}, \text{full}\}$$
where $\mu_k$ and $\sigma_k$ are the mean and standard deviation of rewards for track $k$. **Token-level masking** ensures advantages are broadcast only to tokens generated within each track's active steps.

The total loss combines the GRPO losses from both tracks:
$$L(\theta) = L_{full}(\theta) + \lambda \cdot L_{schema}(\theta)$$
where $\lambda$ controls the relative weight of the Schema Track.

## Empirical Validation / Results
**Experimental Setup:** Models are built on Qwen3-4B and Qwen3-8B. They are compared against strong single-turn (OmniSQL, SQL-R1) and multi-turn RL baselines (MTIR-SQL, SQL-Trail) on five benchmarks: BIRD-Dev, Spider-Test, Spider-DK, Spider-Syn, and Spider-Realistic. **Execution Accuracy (EX%)** is the primary metric.

**Main Results (Table 1):**
*   **TRUST-SQL-4B** achieves **64.9%** (greedy) and **67.2%** (majority voting) on BIRD-Dev, outperforming the strong MTIR-SQL-4B baseline (63.1%).
*   **TRUST-SQL-8B** achieves the highest scores on BIRD-Dev (**65.8%** greedy, **67.7%** majority) and shows strong generalization on robustness benchmarks (Spider-Syn, Spider-Realistic).
*   Crucially, TRUST-SQL achieves these results **without schema prefilling**, matching or surpassing baselines that have full schema access.

**Key Ablation and Analysis Results:**
1.  **Value of Autonomous Exploration (Table 2):** Base Qwen3 models collapse without schema prefilling (e.g., Qwen3-4B drops 17.0% on BIRD). TRUST-SQL provides massive gains over base models (avg. +30.6% for 4B, +16.6% for 8B). Furthermore, injecting full schema into TRUST-SQL provides negligible or even negative benefits, proving its exploration is sufficient and robust to noisy metadata.
2.  **Dual-Track GRPO Effectiveness (Figure 4):** The optimal setting ($\lambda=0.25$) yields **64.5%** on BIRD-Dev, a +3.6% gain over a pure execution baseline and a +5.8% gain over naively mixing schema and execution rewards. This confirms the method resolves credit assignment.
3.  **Schema Reward Design (Figure 5):** The best formulation is **Sparse + Coupled** (binary $f_{match}$, reward given only if $R_{exec}=1.0$), achieving 64.5%. Decoupling the reward or using a dense reward leads to worse performance.
4.  **Training Dynamics (Table 3):** A two-stage pipeline (SFT warm-up + RL) is necessary. RL-only training leads to a degenerate policy that "hacks" the reward by exhaustively querying all metadata upfront, bypassing genuine exploration.

## Theoretical and Practical Implications
*   **Theoretical:** The work provides a formal POMDP formulation for the Unknown Schema Text-to-SQL task. It demonstrates the necessity of **structural boundaries** in agent protocols for effective credit assignment in hierarchical RL for language agents. The Dual-Track mechanism offers a generalizable principle for co-optimizing disparate sub-tasks (exploration vs. generation) within a single trajectory.
*   **Practical:** TRUST-SQL enables the deployment of Text-to-SQL systems in real-world enterprise environments where full schema injection is impossible. It reduces context window waste and improves robustness by filtering out irrelevant or noisy metadata. The framework demonstrates that **active exploration can be more effective than passive consumption of full schemas**, establishing a new paradigm for reliable database interaction.

## Conclusion
TRUST-SQL successfully addresses the limitations of the Full Schema Assumption by introducing an autonomous agent framework for the Unknown Schema setting. Its structured four-phase protocol grounds reasoning and prevents hallucinations, while the novel Dual-Track GRPO training strategy effectively resolves credit assignment, leading to a **9.9% relative improvement** over standard GRPO. The framework achieves state-of-the-art or competitive performance across diverse benchmarks without relying on pre-loaded metadata, proving the feasibility and effectiveness of active database exploration. Future work may address inference overhead, support for other SQL dialects, and dynamic turn budgets.

---

_Markdown view of https://picx.dev/p/EWB5ah, served by PicX — AI-generated visual whiteboard summaries of research papers._