Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction - Summary
Summary (Overview)
- Bi-Level Multi-Agent Framework: Introduces
Web2BigTable, a novel system for web-to-table search featuring an upper-level orchestrator for task decomposition and lower-level parallel worker agents for execution. - Self-Evolving via External Memory: Employs a closed-loop run–verify–reflect process to jointly improve decomposition (
S_o) and execution (S_w) skills through persistent, human-readable external memory (SKILL.md files), without fine-tuning the underlying LLMs. - Dynamic Coordination via Shared Workboard: Workers coordinate asynchronously through a shared Markdown workboard (
m_e), enabling redundancy avoidance, conflict reconciliation, and coverage gap detection during search. - State-of-the-Art Performance: Achieves new SOTA on the WideSearch benchmark (Avg@4 Success Rate: 38.50, 7.5× the second best) and strong generalization to XBench-DeepSearch (73.0 accuracy).
- Framework-Driven Gains: Ablation studies confirm performance stems from the architecture (bi-level learning, coordination) rather than the capability of the underlying, cost-efficient LLMs (GPT-5 mini, Gemini 3 Flash).
Introduction and Theoretical Foundation
Agentic web search faces two distinct demands: deep reasoning over a single target and structured aggregation across many entities and heterogeneous sources (wide search). The web-to-table search task formalizes the latter: given a natural-language query and target schema, produce a structured table by searching the open web, where each row is a distinct entity and each column a requested attribute.
Current systems struggle with both regimes. Monolithic agents suffer from context saturation and error compounding at scale. While hierarchical frameworks and workflow pipelines offer decomposition, they often rely on fixed planning with limited feedback. Self-evolving and memory-augmented agents improve execution but typically adapt at a single level without jointly refining decomposition and execution.
Web2BigTable addresses these challenges with a bi-level architecture and a memory-mediated self-evolving mechanism. The core idea is to factorize the monolithic policy into two levels that co-adapt through external memory, enabling scalable, reliable extraction for both breadth-oriented (WideSearch) and depth-oriented (XBench-DeepSearch) tasks.
Formal Problem Definition: An instance is a pair , where is the user query specifying the schema and is the open web environment. A valid output is a table . The policy unfolds as an action–observation loop:
where is the interaction history.
Methodology
2.2 System Architecture
Web2BigTable replaces the single-agent policy in Equation (1) with a bi-level factorization:
where:
- : Orchestrator policy that decomposes query into subtasks using orchestrator skills .
- : Worker policy for subtask , conditioned on the shared workboard and a retrieved execution skill from the shared bank .
The architecture externalizes state into two complementary memory regimes:
- Long-term semantic memory: Persistent skill banks (decomposition strategies) and (execution skills), evolved only during training and frozen at inference.
- Short-term working memory: Transient workboard , a scratchpad for coordination within a single episode.
Memory Updates:
- Workboard (short-term): (Equation 3), updated at every inner step.
- Skill Banks (long-term): Updated only across training episodes via reflect operators: where is the structured error report.
2.3 Training Phase: Self-Evolving the Skill Banks
The training phase runs on a small set of queries paired with gold-standard tables. The objective is to maximize the expected global utility .
Run–Verify–Reflect Loop (Algorithm 1):
- Run: Execute one inference pass with current skills.
- Verify: Compare output against gold , produce error report.
- Reflect: Distill error report into skill updates via and .
Orchestrator Skills Evolution ():
- Run: Archive outputs and per-worker JSONL trajectories.
- Verify: Compress trajectories, compare outputs at cell level using type-specific scoring (exact match, numeric tolerance, etc.), aggregate error report.
- Reflect: Cluster training queries by structural pattern, synthesize generalizable decomposition skills (e.g.,
split-by-entity,split-by-time-period) and a router skill. Skills are persisted as human-readableSKILL.mdfiles via monotone appends.
Worker Skills Evolution ():
- Skill Resolution & Creation:
SkillResolversearches for skills via (1) exact-name matching, (2) semantic retrieval (BM25 + embedding search with RRF), (3) on-demand creation bySkillCreator(function or knowledge skills). - Error-Driven Self-Repair: On execution failure, an autonomous reflection loop synthesizes corrected skill code, validated via AST.
2.4 Inference Phase: Query-Time Execution
With frozen skill banks , inference runs a single forward pass (Algorithm 2).
Orchestrator: Classifies query via task-router skill, invokes corresponding decomposition skill from to produce subtasks , initializes the shared workboard .
Workers: Each active worker at step samples an action:
Workers are independent Memento-Skills agents (powered by Gemini 3 Flash) executing a ReAct loop with access to built-in and dynamically discovered tools.
Shared Workboard (m_e): A structured Markdown document partitioned into:
- Task checklist: Global progress visibility.
- Worker slots: Tag-partitioned regions for writing results (e.g.,
<t1_result>). - Shared context: Background info from orchestrator.
Workers interact via
read_workboardandedit_workboardtools. Write operations are protected by file locks. The read-write asymmetry enables dynamic coordination: redundancy avoidance, coverage gap detection, and strategy adaptation.
Empirical Validation / Results
3.1 Datasets & Metrics
- WideSearch: 200 tasks for broad-coverage structured extraction. Metrics: Success Rate (SR) (100% match), Row-level F1, Item-level F1 (cell-level, using type-specific comparators). Reported as Avg@4 and Max@4 over 4 runs.
- XBench-DeepSearch: Chinese benchmark for deep, multi-hop reasoning. Metric: Accuracy (LLM-as-judge).
3.4 Performance Gain Analysis
Ablation studies isolate component contributions (Table 1) and framework vs. LLM capability (Table 2).
Table 1: Contribution of each system component.
| Configuration | WideSearch (Avg@4) | XBench |
|---|---|---|
| SR | Row F1 | |
| Full system | 38.50 | 63.53 |
| w/o learned orch. skills | 7.00 | 45.23 |
| w/o workboard | 27.50 | 54.81 |
| w/o worker skill evolution | 33.00 | 59.67 |
Key Findings:
- Learned orchestrator skills are critical: Largest performance drop across all metrics (e.g., SR from 38.50 to 7.00).
- Workboard coordination enables dynamic gap recovery: Disabling it reduces Row F1 from 63.53 to 54.81.
- Worker skill evolution provides complementary gains.
- Performance stems from the framework, not the underlying LLMs: The same backbone models (GPT-5 mini, Gemini 3 Flash) achieve far lower scores as single agents (Table 2). The full framework provides gains of +46.84 Item F1 and +38.0 Accuracy.
3.5 Benchmark Comparison
WideSearch Results (Table 3, Figure 5):
Web2BigTable achieves state-of-the-art performance:
- Success Rate (Avg@4): 38.50 (7.5× the second best at 5.10).
- Row F1 (Avg@4): 63.53 (+25.03 over the second best).
- Item F1 (Avg@4): 80.12 (+14.42 over the second best).
Table 3: Performance comparison on the WideSearch benchmark (Selected Multi-Agent Frameworks).
| Model / System | Success Rate | Row F1 | Item F1 |
|---|---|---|---|
| Avg@4 | Max@4 | Avg@4 | |
| Claude Sonnet 4 (Thinking) | 3.60 | 6.50 | 38.50 |
| OpenAI o3-high | 5.10 | 9.50 | 37.80 |
| Kimi K2 | 3.00 | 6.50 | 36.20 |
| Web2BigTable (Ours) | 38.50 | 40.00 | 63.53 |
| ∆ vs. second best | +33.40 | +30.50 | +25.03 |
XBench-DeepSearch Results (Table 4, Figure 6):
Web2BigTable achieves 73.0 accuracy, surpassing frontier proprietary systems like Minimax-M2 and MiroFlow (both at 72.0).
Table 4: Performance comparison on XBench-DeepSearch (Selected Results).
| Model / System | Accuracy |
|---|---|
| Minimax-M2 | 72.0 |
| MiroFlow (GPT-5) | 72.0 |
| Kimi-Researcher | 69.0 |
| DeepMiner-32B-RL | 62.0 |
| Web2BigTable (Ours) GPT-5 mini + Gemini 3 Flash | 73.0 |
Case Studies
The paper includes detailed case studies illustrating the impact of learned decomposition skills:
- Case A (ws_en_006): Listing Taylor Swift concerts (534 ground-truth rows). The learned
split-by-entity(tour name) skill outperforms the defaultsplit-by-time-period, achieving 93.8% Row F1 vs. 26.8% without skills. - Case B (ws_en_091): Listing AMD Zen processors (331 rows, 12 columns). The learned
split-by-category(product line) skill achieves 96% Item F1 vs. 32% without skills. - Case C (ws_zh_069): Compiling papers from two organizations. The learned
split-by-sourceskill with a dedicated verification worker achieves 94% Item F1 vs. 58% without skills.
Theoretical and Practical Implications
Theoretical Implications:
- Provides a formal bi-level factorization of the web-to-table search policy, decoupling decomposition from execution.
- Demonstrates effective self-evolution and coordination through non-parametric, external memory, offering a training-free alternative to gradient-based fine-tuning.
- The framework is generalized in ongoing work as Memento-Team, formalizing the orchestrator-worker interaction as a Stackelberg game and establishing convergence guarantees for the multi-agent memory system.
Practical Implications:
- Scalability: Enables reliable extraction of hundreds of rows across heterogeneous sources, a task monolithic agents fail at.
- Cost-Efficiency: Achieves SOTA results using cost-efficient, lightweight LLMs (GPT-5 mini, Gemini 3 Flash), highlighting the value of architectural design over raw model scale.
- Generalizability: The same framework excels at both breadth-oriented (WideSearch) and depth-oriented (XBench-DeepSearch) tasks, showing versatility.
- Interpretability & Control: Skills are stored as human-readable
SKILL.mdfiles, allowing for inspection, editing, and version control.
Conclusion
Web2BigTable is a bi-level multi-agent LLM framework that addresses the scalability and reliability challenges of web-to-table search through memory-mediated self-evolution and asynchronous coordination. Its key innovations are:
- Jointly evolving task decomposition and worker execution skills via a run–verify–reflect loop over external memory.
- Enabling dynamic coordination among parallel workers through a shared Markdown workboard.
- Achieving state-of-the-art performance on WideSearch and strong generalization to XBench-DeepSearch, with gains attributable to the framework design rather than underlying LLM capability.
The work demonstrates that bi-level memory-mediated coordination provides a powerful, scalable, and training-free paradigm for large-scale structured information extraction from the web.