# TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

> TIDE combines iterative discovery and reusable thought templates to proactively uncover multiple hidden problems from context, significantly outperforming baselines.

- **Source:** [arXiv](https://arxiv.org/abs/2606.04743)
- **Published:** 2026-06-06
- **Permalink:** https://picx.dev/p/tZMaHK
- **Whiteboard:** https://picx.dev/p/tZMaHK/image

## Summary

## Summary (Overview)

- **Proactive multi-problem discovery task**: The paper formalizes the problem of discovering multiple hidden problems from a document context \(D\) without explicit user requests, producing predictions as triples (description, evidence subset, action).
- **TIDE framework**: Introduces two complementary mechanisms – *iterative discovery* (multi-round conditioning on cumulative state to broaden coverage) and *thought templates* (reusable schemas distilled from solved cases that specify what contextual signals to attend to and how to connect them).
- **Consistent gains across settings**: Evaluated on personal workspace and software repository settings with four LLM backbones (GPT-5 mini, Gemini 3.5 Flash, Claude Sonnet 4.5, Qwen 3.6 Flash), TIDE substantially outperforms single-shot and parallel multi-agent baselines on retrieval, identification, and resolution coverage and F1.
- **Complementary contributions**: Iteration drives coverage by redirecting capacity toward undiscovered problems; templates sharpen prediction precision by anchoring each discovery in a recognizable problem class.
- **Template transferability**: Thought templates built by one backbone remain effective when used with another backbone, indicating domain-level abstraction.

## Introduction and Theoretical Foundation

Large language model (LLM) agents are widely deployed as digital assistants that read documents, invoke tools, and operate over complex environments. However, they remain **reactive**: they act only after a user issues an explicit request. This interaction model presumes the user already knows what is wrong and what to ask. In practice, many consequential issues go unnoticed: a budget approval given verbally but not recorded, conflicting numbers in duplicate reports, a stale meeting still blocking a slot, etc. These hidden problems coexist in the same digital context, and none is articulated as a request.

Existing work on proactive agents has studied *when* to intervene (Liu et al., 2025; Zhang et al., 2025) or *how to anticipate a single localized need* (Lu et al., 2025; Yang et al., ; Pasternak et al., 2025), but largely sidesteps the multi-problem, context-wide setting that real workflows demand. The authors argue that proactive assistance is best framed as **discovering multiple hidden problems from context**, requiring both:
- **Broad coverage** over coexisting problems that compete for attention with more salient ones.
- **Per-prediction fidelity** – each candidate must be actionable, grounded in evidence, and paired with a concrete resolution.

Single-shot discovery fails on two fronts: (1) salient problems overshadow subtler ones, capping coverage; (2) without a prior on what evidence patterns constitute a problem, predictions drift to generic or speculative claims.

## Methodology

### Task Formulation (Section 2.1)

Consider an agent operating over a collection of documents \(D\) (artifacts in a workspace or functions in a codebase). Within \(D\) there exists a latent set of hidden problems:

$$
P^\star = \{ p^\star_1, p^\star_2, \ldots, p^\star_n \}
$$

where none is articulated as an explicit user request and cardinality \(n\) is unknown. The objective is to produce a predicted set \(\hat{P}\) approximating \(P^\star\), where each prediction is a triple:

$$
\hat{p} = (b, \hat{D}, a)
$$

- \(b\): natural-language description of the candidate problem.
- \(\hat{D} \subseteq D\): supporting subset that grounds the prediction in evidence.
- \(a\): concrete action proposing a resolution.

Solution quality is measured along two axes: **coverage** (how many gold problems are recovered) and **fidelity** (correctness of description, evidence, action).

### TIDE: Template-guided Iterative Discovery and rEsolution (Section 2.2)

TIDE couples two complementary components:

#### Thought Templates
Templates are reusable schemas distilled from previously solved cases. Each template \(t_i\) is a tuple:

$$
t_i = (\text{name}_i, \text{pattern}_i, \text{evidence flow}_i)
\tag{1}
$$

- **Name**: labels a recurring class of hidden problem.
- **Pattern**: structural form of that class.
- **Evidence flow**: ordered sequence of contextual signals to attend to and how to connect them to infer instances.

Templates are constructed once by prompting an LLM to abstract away instance-specific details from training cases \(\langle D_{\text{train}}, p_{\text{train}}, r_{\text{train}} \rangle\). At inference, the full set \(T\) is supplied to the agent as a library.

Example (personal workspace):
> **Template**: Conflicting Source-of-Truth Blocks Sign-off Under Deadline
> **Pattern**: A shared source artifact exists in conflicting versions across channels, and an imminent deadline blocks sign-off until one is made authoritative.
> **Evidence flow**: (i) locate the deliverable and its cited source; (ii) find conflicting copies across channels and confirm a material discrepancy; (iii) tie the conflict to a time-bounded review and the owner who can resolve it.

#### Iterative Discovery and Resolution
Even with templates, a single pass still concentrates capacity on the most salient cases. TIDE instead surfaces predictions over multiple rounds, each conditioned on what has already been found:

Let \(\hat{P}^{(t)}\) be the cumulative state after round \(t\), initialized as \(\hat{P}^{(0)} = \emptyset\). In round \(t\), the agent generates a batch of up to \(k\) new candidates:

$$
\Delta \hat{P}^{(t)} = \text{LLM}\left(D, T, \hat{P}^{(t-1)}, k\right)
\tag{2}
$$

The state updates: \(\hat{P}^{(t)} = \hat{P}^{(t-1)} \cup \Delta \hat{P}^{(t)}\). The loop terminates after \(T\) rounds or earlier if a round returns empty. Each prediction within a round is already an actionable plan (identification + evidence retrieval + action proposal).

### Experimental Setup (Section 3)

- **Datasets**:
  - **Personal Workspace**: 150 problems across 30 multi-problem workspaces, 4–6 problems each, 88–113 artifacts per workspace. Adopted from Pasternak et al. (2025) construction pipeline.
  - **Software Repository**: 146 problems across 20 multi-bug instances from 11 projects (SWE-BENCH, TESTEXPLORA), 2–41 problems, 6–646 candidate functions per instance.
- **Baselines**:
  - **Single-Agent**: one LLM pass over \(D\) to produce all predictions.
  - **Multi-Agent**: multiple independent LLM agents in parallel (same number of calls as TIDE’s rounds).
- **Evaluation Metrics**: For each matched (gold, prediction) pair, three components scored:
  - **Retrieval**: overlap between predicted and gold evidence IDs.
  - **Identification**: LLM judge score (Likert rubric) comparing predicted description vs gold.
  - **Resolution**: LLM judge score comparing predicted action vs gold reference.
  - **Coverage**: per-gold matched score averaged over all golds.
  - **F1**: harmonic mean of coverage and the analogous per-prediction score (penalizing extraneous predictions).
- **Implementation**: 4 LLM backbones (GPT-5 mini, Gemini 3.5 Flash, Claude Sonnet 4.5, Qwen 3.6 Flash). LLM judge fixed to GPT-5 mini. Templates: 40 (workspace), 108 (repository). Iteration rounds: \(T=10\) (workspace), \(T=3\) (repository).

## Empirical Validation / Results

### Main Results (Table 1)

Table 1 shows retrieval, identification, and resolution coverage and F1 across both settings and four LLMs. TIDE consistently outperforms Single-Agent and Multi-Agent.

| | | Workspace | | | | Repository | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| | | Retrieval | Identification | Resolution | | Retrieval | Identification | Resolution | | |
| **Methods** | | Cov. | F1 | Cov. | F1 | Cov. | F1 | Cov. | F1 | Cov. | F1 | Cov. | F1 | Avg. |
| **GPT** | Single-Agent | 47.60 | 54.32 | 47.85 | 54.63 | 49.67 | 56.14 | 8.66 | 10.34 | 11.15 | 12.92 | 12.19 | 13.27 | 31.56 |
| | Multi-Agent | 32.15 | 45.41 | 27.24 | 38.85 | 29.64 | 41.85 | 10.11 | 12.66 | 10.19 | 12.77 | 9.89 | 12.39 | 23.59 |
| | **TIDE (Ours)** | **69.06** | **70.46** | **67.64** | **68.76** | **76.08** | **77.32** | **16.82** | **18.61** | **17.29** | **19.73** | **15.52** | **17.39** | **44.56** |
| **Gemini** | Single-Agent | 50.01 | 61.48 | 42.99 | 52.95 | 35.95 | 44.31 | 13.08 | 17.20 | 13.34 | 16.90 | 15.08 | 18.71 | 31.83 |
| | Multi-Agent | 46.10 | 58.54 | 38.98 | 49.95 | 32.44 | 41.54 | 14.75 | 18.51 | 14.71 | 17.72 | 15.32 | 18.89 | 30.62 |
| | **TIDE (Ours)** | **83.84** | **84.91** | **70.11** | **71.05** | **54.37** | **55.08** | **22.55** | **25.14** | **21.93** | **24.22** | **24.32** | **26.98** | **47.04** |
| **Claude** | Single-Agent | 13.82 | 30.73 | 13.53 | 25.74 | 17.69 | 22.00 | 12.85 | 16.09 | 9.86 | 12.18 | 9.84 | 12.34 | 16.39 |
| | Multi-Agent | 21.60 | 43.33 | 18.04 | 36.04 | 23.99 | 31.04 | 9.37 | 13.36 | 7.96 | 11.18 | 8.88 | 11.01 | 19.65 |
| | **TIDE (Ours)** | **32.01** | **55.51** | **35.77** | **62.44** | **46.49** | **54.88** | **19.99** | **22.70** | **14.50** | **16.50** | **15.79** | **17.76** | **32.86** |
| **Qwen** | Single-Agent | 30.46 | 42.05 | 32.44 | 44.67 | 28.60 | 37.60 | 5.60 | 6.83 | 5.34 | 6.62 | 4.84 | 5.76 | 20.90 |
| | Multi-Agent | 39.34 | 52.12 | 31.04 | 42.27 | 26.21 | 35.37 | 5.00 | 6.72 | 6.58 | 7.50 | 5.83 | 6.50 | 22.04 |
| | **TIDE (Ours)** | **52.39** | **60.21** | **50.50** | **58.13** | **41.87** | **48.06** | **9.94** | **11.33** | **6.87** | **8.07** | **8.47** | **9.70** | **30.46** |

*Table 1: Main results. Bold = best per-LLM. Coverage and F1 over three runs.*

### In-Depth Analyses (Section 4.2)

- **Multi-problem instances (Fig. 2)** : TIDE discovers 4+ problems per instance (vs. 1–2 for baselines) and scales well with gold problem count.
- **Effectiveness of iterative discovery (Fig. 3)** : Multi-Agent re-discovers the same salient problems across rounds; TIDE continues to surface *newly discovered* problems in later rounds.
- **Effect of LLM-call budget (Fig. 4)** : TIDE scales steeply with budget \(k\); Multi-Agent plateaus. Even TIDE at \(k=2\) outperforms Multi-Agent at \(k=10\).
- **Thought templates vs. few-shot (Table 2)** : Replacing templates with raw few-shot demonstrations (Iter. + Demos) yields much lower performance, confirming that templates’ abstraction of reasoning patterns matters.

| Method | Retrieval Cov. | F1 | Identification Cov. | F1 | Resolution Cov. | F1 |
|---|---|---|---|---|---|---|
| Single-Agent | 8.66 | 10.34 | 11.15 | 12.92 | 12.19 | 13.27 |
| Iter. + Demos | 10.40 | 11.43 | 11.09 | 12.80 | 12.71 | 12.80 |
| **TIDE (Ours)** | **16.82** | **18.61** | **17.29** | **19.73** | **15.52** | **17.39** |

*Table 2: Repository setting with GPT. Templates outperform few-shot demonstrations.*

- **Template usage distribution (Fig. 5)** : GPT concentrates on fewer templates; Gemini spreads more evenly.
- **Cross-LLM template transferability (Table 3)** : Templates built by one backbone perform comparably when used with another backbone.

| Inference | Templates | Retrieval Cov. | F1 | Identification Cov. | F1 | Resolution Cov. | F1 |
|---|---|---|---|---|---|---|---|
| GPT | GPT | 16.82 | 12.12 | 17.29 | 11.81 | 15.52 | 11.31 |
| GPT | Gemini | 16.30 | 11.60 | 15.31 | 10.03 | 18.36 | 12.38 |
| Gemini | Gemini | 21.47 | 17.69 | 20.63 | 17.05 | 23.22 | 19.23 |
| Gemini | GPT | 24.03 | 19.03 | 22.84 | 17.86 | 24.70 | 19.19 |

*Table 3: Template transferability (Repository setting).*

- **Effect of template pool size (Fig. 7)** : Adding templates yields further gains over iteration alone, and performance grows with pool size.

### Qualitative Study (Section 4.3)

- **Workspace case**: Gold issue: volunteer-tracking platform double-counts check-ins, blocking vendor patch due to pending IT security access. Single-Agent surfaces unrelated stall. TIDE retrieves gold documents and escalates to the right manager with gating access ticket and deadlines.
- **Repository case (mlxtend)**: Gold bug involves mirrored off-diagonal assignments in two paired McNemar table constructors. Single-Agent treats them as two isolated single-function bottlenecks. TIDE, guided by a mirrored-index-assignment template, retrieves both constructors as one coupled defect and frames the fix as a single multi-function repair.

## Theoretical and Practical Implications

- **Reframes proactive assistance**: The paper shifts the paradigm from anticipating a single user need to an explicit, multi-step discovery process over context. This challenges the reactive interaction model dominant in current LLM agent design.
- **Generalizable recipe**: TIDE provides a modular recipe (iterative discovery + reusable thought templates) applicable to any domain where problems manifest in unstructured evidence (workspaces, software repos, and potentially legal, medical, or enterprise settings).
- **Practical utility**: The framework yields actionable plans (description + evidence + concrete action), not just detection. This makes it suitable for real-world deployment where users need both awareness and a path to resolution.
- **Efficiency insights**: Iteration is more effective than scaling parallel agents at the same compute budget, suggesting that *conditioning on cumulative discovery* is the key lever, not raw parallelism.

## Conclusion

The authors presented TIDE (Template-guided Iterative Discovery and rEsolution), a framework for discovering multiple hidden problems from context. It combines iterative discovery – which broadens coverage by conditioning each round on what has already been found – with thought templates – reusable schemas that sharpen per-prediction fidelity by guiding the agent toward recognizable problem classes. Evaluated on personal workspace and software repository settings across four LLM backbones, TIDE consistently outperforms single-shot and parallel multi-agent baselines on retrieval, identification, and resolution. Analyses confirm that iteration and templates contribute complementary gains, templates transfer across backbones, and the approach scales favorably with compute budget.

**Future directions** include online template updating from agent interactions, automatic augmentation of the template pool, and deeper investigation of the iterative discovery paradigm. The work offers a general recipe for building agents that not only execute user commands but proactively surface what users would not have thought to ask.

---

_Markdown view of https://picx.dev/p/tZMaHK, served by PicX — AI-generated visual whiteboard summaries of research papers._