Summary (Overview)

  • Proactive multi-problem discovery task: The paper formalizes the problem of discovering multiple hidden problems from a document context (D) without explicit user requests, producing predictions as triples (description, evidence subset, action).
  • TIDE framework: Introduces two complementary mechanisms – iterative discovery (multi-round conditioning on cumulative state to broaden coverage) and thought templates (reusable schemas distilled from solved cases that specify what contextual signals to attend to and how to connect them).
  • Consistent gains across settings: Evaluated on personal workspace and software repository settings with four LLM backbones (GPT-5 mini, Gemini 3.5 Flash, Claude Sonnet 4.5, Qwen 3.6 Flash), TIDE substantially outperforms single-shot and parallel multi-agent baselines on retrieval, identification, and resolution coverage and F1.
  • Complementary contributions: Iteration drives coverage by redirecting capacity toward undiscovered problems; templates sharpen prediction precision by anchoring each discovery in a recognizable problem class.
  • Template transferability: Thought templates built by one backbone remain effective when used with another backbone, indicating domain-level abstraction.

Introduction and Theoretical Foundation

Large language model (LLM) agents are widely deployed as digital assistants that read documents, invoke tools, and operate over complex environments. However, they remain reactive: they act only after a user issues an explicit request. This interaction model presumes the user already knows what is wrong and what to ask. In practice, many consequential issues go unnoticed: a budget approval given verbally but not recorded, conflicting numbers in duplicate reports, a stale meeting still blocking a slot, etc. These hidden problems coexist in the same digital context, and none is articulated as a request.

Existing work on proactive agents has studied when to intervene (Liu et al., 2025; Zhang et al., 2025) or how to anticipate a single localized need (Lu et al., 2025; Yang et al., ; Pasternak et al., 2025), but largely sidesteps the multi-problem, context-wide setting that real workflows demand. The authors argue that proactive assistance is best framed as discovering multiple hidden problems from context, requiring both:

  • Broad coverage over coexisting problems that compete for attention with more salient ones.
  • Per-prediction fidelity – each candidate must be actionable, grounded in evidence, and paired with a concrete resolution.

Single-shot discovery fails on two fronts: (1) salient problems overshadow subtler ones, capping coverage; (2) without a prior on what evidence patterns constitute a problem, predictions drift to generic or speculative claims.

Methodology

Task Formulation (Section 2.1)

Consider an agent operating over a collection of documents (D) (artifacts in a workspace or functions in a codebase). Within (D) there exists a latent set of hidden problems:

P={p1,p2,,pn}P^\star = \{ p^\star_1, p^\star_2, \ldots, p^\star_n \}

where none is articulated as an explicit user request and cardinality (n) is unknown. The objective is to produce a predicted set (\hat{P}) approximating (P^\star), where each prediction is a triple:

p^=(b,D^,a)\hat{p} = (b, \hat{D}, a)
  • (b): natural-language description of the candidate problem.
  • (\hat{D} \subseteq D): supporting subset that grounds the prediction in evidence.
  • (a): concrete action proposing a resolution.

Solution quality is measured along two axes: coverage (how many gold problems are recovered) and fidelity (correctness of description, evidence, action).

TIDE: Template-guided Iterative Discovery and rEsolution (Section 2.2)

TIDE couples two complementary components:

Thought Templates

Templates are reusable schemas distilled from previously solved cases. Each template (t_i) is a tuple:

ti=(namei,patterni,evidence flowi)(1)t_i = (\text{name}_i, \text{pattern}_i, \text{evidence flow}_i) \tag{1}
  • Name: labels a recurring class of hidden problem.
  • Pattern: structural form of that class.
  • Evidence flow: ordered sequence of contextual signals to attend to and how to connect them to infer instances.

Templates are constructed once by prompting an LLM to abstract away instance-specific details from training cases (\langle D_{\text{train}}, p_{\text{train}}, r_{\text{train}} \rangle). At inference, the full set (T) is supplied to the agent as a library.

Example (personal workspace):

Template: Conflicting Source-of-Truth Blocks Sign-off Under Deadline Pattern: A shared source artifact exists in conflicting versions across channels, and an imminent deadline blocks sign-off until one is made authoritative. Evidence flow: (i) locate the deliverable and its cited source; (ii) find conflicting copies across channels and confirm a material discrepancy; (iii) tie the conflict to a time-bounded review and the owner who can resolve it.

Iterative Discovery and Resolution

Even with templates, a single pass still concentrates capacity on the most salient cases. TIDE instead surfaces predictions over multiple rounds, each conditioned on what has already been found:

Let (\hat{P}^{(t)}) be the cumulative state after round (t), initialized as (\hat{P}^{(0)} = \emptyset). In round (t), the agent generates a batch of up to (k) new candidates:

ΔP^(t)=LLM(D,T,P^(t1),k)(2)\Delta \hat{P}^{(t)} = \text{LLM}\left(D, T, \hat{P}^{(t-1)}, k\right) \tag{2}

The state updates: (\hat{P}^{(t)} = \hat{P}^{(t-1)} \cup \Delta \hat{P}^{(t)}). The loop terminates after (T) rounds or earlier if a round returns empty. Each prediction within a round is already an actionable plan (identification + evidence retrieval + action proposal).

Experimental Setup (Section 3)

  • Datasets:
    • Personal Workspace: 150 problems across 30 multi-problem workspaces, 4–6 problems each, 88–113 artifacts per workspace. Adopted from Pasternak et al. (2025) construction pipeline.
    • Software Repository: 146 problems across 20 multi-bug instances from 11 projects (SWE-BENCH, TESTEXPLORA), 2–41 problems, 6–646 candidate functions per instance.
  • Baselines:
    • Single-Agent: one LLM pass over (D) to produce all predictions.
    • Multi-Agent: multiple independent LLM agents in parallel (same number of calls as TIDE’s rounds).
  • Evaluation Metrics: For each matched (gold, prediction) pair, three components scored:
    • Retrieval: overlap between predicted and gold evidence IDs.
    • Identification: LLM judge score (Likert rubric) comparing predicted description vs gold.
    • Resolution: LLM judge score comparing predicted action vs gold reference.
    • Coverage: per-gold matched score averaged over all golds.
    • F1: harmonic mean of coverage and the analogous per-prediction score (penalizing extraneous predictions).
  • Implementation: 4 LLM backbones (GPT-5 mini, Gemini 3.5 Flash, Claude Sonnet 4.5, Qwen 3.6 Flash). LLM judge fixed to GPT-5 mini. Templates: 40 (workspace), 108 (repository). Iteration rounds: (T=10) (workspace), (T=3) (repository).

Empirical Validation / Results

Main Results (Table 1)

Table 1 shows retrieval, identification, and resolution coverage and F1 across both settings and four LLMs. TIDE consistently outperforms Single-Agent and Multi-Agent.

WorkspaceRepository
RetrievalIdentificationResolutionRetrievalIdentificationResolution
MethodsCov.F1Cov.F1Cov.F1Cov.F1Cov.F1
GPTSingle-Agent47.6054.3247.8554.6349.6756.148.6610.3411.1512.92
Multi-Agent32.1545.4127.2438.8529.6441.8510.1112.6610.1912.77
TIDE (Ours)69.0670.4667.6468.7676.0877.3216.8218.6117.2919.73
GeminiSingle-Agent50.0161.4842.9952.9535.9544.3113.0817.2013.3416.90
Multi-Agent46.1058.5438.9849.9532.4441.5414.7518.5114.7117.72
TIDE (Ours)83.8484.9170.1171.0554.3755.0822.5525.1421.9324.22
ClaudeSingle-Agent13.8230.7313.5325.7417.6922.0012.8516.099.8612.18
Multi-Agent21.6043.3318.0436.0423.9931.049.3713.367.9611.18
TIDE (Ours)32.0155.5135.7762.4446.4954.8819.9922.7014.5016.50
QwenSingle-Agent30.4642.0532.4444.6728.6037.605.606.835.346.62
Multi-Agent39.3452.1231.0442.2726.2135.375.006.726.587.50
TIDE (Ours)52.3960.2150.5058.1341.8748.069.9411.336.878.07

Table 1: Main results. Bold = best per-LLM. Coverage and F1 over three runs.

In-Depth Analyses (Section 4.2)

  • Multi-problem instances (Fig. 2) : TIDE discovers 4+ problems per instance (vs. 1–2 for baselines) and scales well with gold problem count.
  • Effectiveness of iterative discovery (Fig. 3) : Multi-Agent re-discovers the same salient problems across rounds; TIDE continues to surface newly discovered problems in later rounds.
  • Effect of LLM-call budget (Fig. 4) : TIDE scales steeply with budget (k); Multi-Agent plateaus. Even TIDE at (k=2) outperforms Multi-Agent at (k=10).
  • Thought templates vs. few-shot (Table 2) : Replacing templates with raw few-shot demonstrations (Iter. + Demos) yields much lower performance, confirming that templates’ abstraction of reasoning patterns matters.
MethodRetrieval Cov.F1Identification Cov.F1Resolution Cov.F1
Single-Agent8.6610.3411.1512.9212.1913.27
Iter. + Demos10.4011.4311.0912.8012.7112.80
TIDE (Ours)16.8218.6117.2919.7315.5217.39

Table 2: Repository setting with GPT. Templates outperform few-shot demonstrations.

  • Template usage distribution (Fig. 5) : GPT concentrates on fewer templates; Gemini spreads more evenly.
  • Cross-LLM template transferability (Table 3) : Templates built by one backbone perform comparably when used with another backbone.
InferenceTemplatesRetrieval Cov.F1Identification Cov.F1Resolution Cov.F1
GPTGPT16.8212.1217.2911.8115.5211.31
GPTGemini16.3011.6015.3110.0318.3612.38
GeminiGemini21.4717.6920.6317.0523.2219.23
GeminiGPT24.0319.0322.8417.8624.7019.19

Table 3: Template transferability (Repository setting).

  • Effect of template pool size (Fig. 7) : Adding templates yields further gains over iteration alone, and performance grows with pool size.

Qualitative Study (Section 4.3)

  • Workspace case: Gold issue: volunteer-tracking platform double-counts check-ins, blocking vendor patch due to pending IT security access. Single-Agent surfaces unrelated stall. TIDE retrieves gold documents and escalates to the right manager with gating access ticket and deadlines.
  • Repository case (mlxtend): Gold bug involves mirrored off-diagonal assignments in two paired McNemar table constructors. Single-Agent treats them as two isolated single-function bottlenecks. TIDE, guided by a mirrored-index-assignment template, retrieves both constructors as one coupled defect and frames the fix as a single multi-function repair.

Theoretical and Practical Implications

  • Reframes proactive assistance: The paper shifts the paradigm from anticipating a single user need to an explicit, multi-step discovery process over context. This challenges the reactive interaction model dominant in current LLM agent design.
  • Generalizable recipe: TIDE provides a modular recipe (iterative discovery + reusable thought templates) applicable to any domain where problems manifest in unstructured evidence (workspaces, software repos, and potentially legal, medical, or enterprise settings).
  • Practical utility: The framework yields actionable plans (description + evidence + concrete action), not just detection. This makes it suitable for real-world deployment where users need both awareness and a path to resolution.
  • Efficiency insights: Iteration is more effective than scaling parallel agents at the same compute budget, suggesting that conditioning on cumulative discovery is the key lever, not raw parallelism.

Conclusion

The authors presented TIDE (Template-guided Iterative Discovery and rEsolution), a framework for discovering multiple hidden problems from context. It combines iterative discovery – which broadens coverage by conditioning each round on what has already been found – with thought templates – reusable schemas that sharpen per-prediction fidelity by guiding the agent toward recognizable problem classes. Evaluated on personal workspace and software repository settings across four LLM backbones, TIDE consistently outperforms single-shot and parallel multi-agent baselines on retrieval, identification, and resolution. Analyses confirm that iteration and templates contribute complementary gains, templates transfer across backbones, and the approach scales favorably with compute budget.

Future directions include online template updating from agent interactions, automatic augmentation of the template pool, and deeper investigation of the iterative discovery paradigm. The work offers a general recipe for building agents that not only execute user commands but proactively surface what users would not have thought to ask.

Related papers