Summary (Overview)

  • Proposes a shortcut-aware difficulty framework for multi-constraint agentic retrieval, showing that realized search difficulty is determined by the cheapest identifying route rather than intended structural complexity.
  • Identifies four actionable shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding, which cause intended search processes to collapse.
  • Introduces FORT (Framework of Shortcut-Resistant Training-Data Synthesis), which controls these risks across entity selection, evidence graph construction, question formulation, and adversarial refinement.
  • Presents FORT-Searcher, trained with supervised fine-tuning (SFT) only on FORT trajectories, achieving the best overall performance (66.2) among comparable-size open-source search agents on challenging benchmarks (BrowseComp, BrowseComp-ZH, xbench-DeepSearch, Seal-0).
  • Demonstrates that FORT data induces longer pre-answer search (ŵΩ=141.0, T_hit=46.9) and lower prior-shortcut rates (11.0%) compared to existing open-source deep search datasets.

Introduction and Theoretical Foundation

Background. Large language models (LLMs) are increasingly deployed as tool-empowered search agents that interact with external environments, gather evidence over multiple turns, and synthesize scattered information. Training such agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods increase apparent difficulty by enriching graph structures (hop count, evidence dispersion, treewidth), but structural complexity does not guarantee realized search difficulty: the intended search process can collapse through a cheaper identifying route.

Shortcut-Aware Difficulty Framework. A task instance is represented as

q=(X,Cq,Σ)q = (\mathcal{X}, \mathcal{C}_q, \Sigma)

where (\mathcal{X}) is the answer space, (\mathcal{C}_q) is the set of constraints, and (\Sigma) is the retrieval interface. For any subset of constraints (\mathcal{P} \subseteq \mathcal{C}_q),

Ans(P)={xX:ciPci(x)=1}\text{Ans}(\mathcal{P}) = \left\{ x \in \mathcal{X} : \bigwedge_{c_i \in \mathcal{P}} c_i(x) = 1 \right\}

denotes the remaining candidate pool. The question is well-posed: (\text{Ans}(\mathcal{C}_q) = {y^\star}).

The pure-posterior cost is defined as the minimum expected retrieval turns needed by a no-prior, no-guessing solver:

Dpost(q)=infπΠpostEτπ[τ]D_{\text{post}}(q) = \inf_{\pi \in \Pi_{\text{post}}} \mathbb{E}_{\tau \sim \pi}[|\tau|]

This is lower-bounded by the cheapest identifying route:

QΣ=minP:Ans(P)={y}QΣ(P)Q^\star_\Sigma = \min_{\mathcal{P}: \text{Ans}(\mathcal{P}) = \{y^\star\}} Q_\Sigma(\mathcal{P})

where (Q_\Sigma(\mathcal{P})) is the shortest valid evidence-acquisition route that verifies (\mathcal{P}) for (y^\star). The bound depends on three structural quantities:

  • Subset selectivity: (s(\mathcal{P}) = |\text{Ans}(\mathcal{P})|)
  • Evidence dispersion: (M_{\text{ev}}(\mathcal{P})) = minimum number of queries verifying ((\mathcal{P}, y^\star))
  • Dependency depth: (\text{dep}(\mathcal{P})) = length of longest dependency chain in a valid route

Proposition 1 (Structural lower bound and realized cost). Every identifying subset satisfies:

QΣ(P)max{Mev(P),dep(P)}Q_\Sigma(\mathcal{P}) \ge \max\{ M_{\text{ev}}(\mathcal{P}), \text{dep}(\mathcal{P}) \}

so (D_{\text{post}}(q) \ge Q^\star_\Sigma \ge \min_{\mathcal{P}} \max{M_{\text{ev}}(\mathcal{P}), \text{dep}(\mathcal{P})}). The realized successful-trajectory cost for a concrete solver (\pi_0) is (\Omega(q,\pi_0) = D_{\text{post}}(q) - U_{\pi_0}(q)), where (U_{\pi_0}(q)) is the solver-side cost reduction from prior knowledge.

Shortcut Patterns. Four actionable risks:

ShortcutTypeEffect
Single-clue selectivityRoute-levelSmall (s(\mathcal{P})) allows early answer identification
Evidence co-coverageRoute-levelSmall (M_{\text{ev}}(\mathcal{P})) compresses evidence acquisition
Exposed constantsRoute-levelSmall (\text{dep}(\mathcal{P})) from premature surface constants
Prior-knowledge bindingSolver-levelPositive (U_{\pi_0}(q)) reduces realized retrieval

Trajectory Signatures. Observable diagnostics: realized solving cost (\hat{\Omega}), answer hit time (T_{\text{hit}} = \min{T_{\text{tool}}, T_{\text{model}}}), and prior-shortcut rate (\hat{p}_{\text{prior}}).

Methodology

FORT constructs shortcut-resistant training data across four stages, controlling the shortcut risks identified in Section 2.

![Overview of FORT pipeline](Figure 2) (described in text)

1. Graph Initialization. Selects a long-tail root entity from Wikidata (limited prior-knowledge binding) and initializes a seed subgraph (\mathcal{G}_0). Uses a pre-mined cycle structure to avoid linear exposure of downstream constants, reducing exposed constants risk.

2. Graph Construction. Expands the seed into an evidence graph (\mathcal{G}) by:

  • Multi-source enrichment: collecting facts from heterogeneous sources (Wikidata, web pages, Google Scholar, Maps) to reduce evidence co-coverage.
  • Derived fact construction: creating facts from multiple atomic facts (coincidence bridging, count aggregation, numerical relation, meta-fact extraction) that are less likely to appear verbatim in a single source.
  • Fact verification: filtering inconsistent or mismatched facts (Table 3).
  • Generic fact selection: preferring facts that are reliable but less characteristic of the target entity, reducing single-clue selectivity (Table 4).

3. Question Formulation. Selects an answer-bearing subgraph (\mathcal{G}^* \subseteq \mathcal{G}) and renders it as natural language with:

  • Withholding intermediate entity names (e.g., "the artist", "the institution") to preserve dependency depth.
  • Exact-value fuzzing (Table 5): category generalization, range relaxation, meta-attribute description, arithmetic encoding, contrastive exclusion—reducing exposed constants.

4. Adversarial Refinement. Runs a strong adversary on each draft question. Repairs shortcut-prone drafts (early answer hit, prior binding) and over-fuzzed/underspecified drafts (failed to solve). Calibrates toward solvable but search-heavy difficulty.

Table 1: From shortcut risks to FORT controls.

Shortcut riskCollapsed quantityFORT controlImplemented in
Single-clue selectivitySmall (\mathcal{P}): (s(\mathcal{P})\downarrow)Generic fact selection, low-specificity cluesGraph construction, Question formulation
Evidence co-coverage(M_{\text{ev}}(\mathcal{P})\downarrow)Multi-source enrichment, derived factsGraph construction
Exposed constants(\text{dep}(\mathcal{P})\downarrow)Cycle initialization, name withholding, fuzzingGraph initialization, Question formulation
Prior-knowledge binding(U_{\pi_0}(q)\uparrow, \Omega\downarrow)Long-tail root selection, adversarial refinementGraph initialization, Adversarial refinement

Agent Training. FORT-Searcher starts from Qwen3-30B-A3B-Thinking-2507 (30B total, ~3B active parameters). Trained with supervised fine-tuning (SFT) only on FORT-generated trajectories (6 epochs, bf16, Adam, cosine LR with peak 2e-5). At inference, uses context management: on reaching a turn limit, the interaction history is cleared and the agent restarts from the original question.

Empirical Validation / Results

Benchmarks. FORT-Searcher evaluated on BrowseComp, BrowseComp-ZH, xbench-DeepSearch-2505, xbench-DeepSearch-2510, and Seal-0. Compared against proprietary, large-scale open-source, and comparable-size open-source agents.

Table 6: Benchmark performance (selected rows).

ModelBrowseCompBC-ZHxbench-05xbench-10Seal-0Overall
Comparable-size open-source
MiroThinker-1.7-mini67.972.377.257.248.264.6
Qwen3.5-35B-A3B61.069.577.450.341.459.9
FORT-Searcher72.275.080.857.246.066.2
Large-scale open-source
GLM-575.972.7
Qwen3.5-397B-A17B78.670.346.9

FORT-Searcher achieves best overall score (66.2) among comparable-size agents and remains competitive with large-scale models despite SFT-only training and ~3B active parameters.

Context Management Impact (Table 7). Without context management, scores drop substantially on BrowseComp (55.9→72.2) and BrowseComp-ZH (62.1→75.0). Gains on other benchmarks are smaller.

Training Data Difficulty Analysis (Table 8). Comparing four training sets (12K examples each):

Data(\hat{\Omega})(T_{\text{hit}})(\hat{p}_{\text{prior}}(%))BrowseCompBC-ZH
Open-source (low)40.09.410.547.154.9
Open-source (mid)85.016.015.648.454.6
Open-source (high)140.022.318.149.558.1
FORT140.047.011.452.960.3

FORT data at the same trajectory length induces later answer exposure and lower prior binding, yielding better downstream performance.

Shortcut-Resistance Ablation (Table 9). Cumulative removal of components monotonically increases accuracy (29.0→81.6) while decreasing (\hat{\Omega}) (141.9→43.7) and moving (T_{\text{hit}}) earlier. Fuzzing removal produces the largest difficulty drop.

Dataset Difficulty Comparison (Table 11). Under the same strong agent and retrieval budget, FORT achieves the highest solving cost and latest answer exposure:

Source(\hat{\Omega})(T_{\text{hit}})(\hat{p}_{\text{prior}}(%))
REDSearcher92.118.711.8
FORT141.046.911.0

Trajectory-Level Proxies (Table 12). On 200 successful pairs, FORT consistently improves four proxies compared to open-source data: higher retrieval effort, less selective individual clues, more dispersed evidence, costlier dependency chains, and fewer prior-bound answers.

Theoretical and Practical Implications

  • The shortcut-aware difficulty framework provides a rigorous lens for understanding why apparent structural complexity does not guarantee realized search difficulty. It introduces a lower bound based on the cheapest identifying route and attributes gaps to named factors (selectivity, dispersion, depth, prior utility).
  • FORT operationalizes this theoretical analysis into concrete construction controls, demonstrating that data synthesis should check not only intended task structure but also whether the resulting questions force substantial evidence acquisition in actual agent trajectories.
  • The effectiveness of SFT-only training on shortcut-resistant data shows that high-quality supervision can compensate for the lack of reinforcement learning, achieving strong results with relatively small active parameters (~3B).
  • Practical implications: the trajectory signatures (solving cost, answer hit time, prior-shortcut rate) serve as actionable diagnostics for evaluating and improving deep search training data.

Conclusion

FORT-Searcher introduces a shortcut-resistant data synthesis framework (FORT) for training deep search agents. The core insight is that realized search difficulty is governed by the cheapest identifying route, which can collapse through evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. FORT controls these risks across entity selection, evidence graph construction, question formulation, and adversarial refinement, producing trajectories with extended pre-answer search and reduced shortcut patterns. Using SFT only, FORT-Searcher achieves the best overall performance among comparable-size open-source search agents on challenging deep search benchmarks. Future work includes integrating reinforcement learning with FORT trajectories, richer tool-augmented search, and more complex search-grounded tasks requiring heterogeneous evidence integration and conflict resolution.

Related papers