Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction - Summary

Summary (Overview)

Bi-Level Multi-Agent Framework: Introduces Web2BigTable, a novel system for web-to-table search featuring an upper-level orchestrator for task decomposition and lower-level parallel worker agents for execution.
Self-Evolving via External Memory: Employs a closed-loop run–verify–reflect process to jointly improve decomposition (S_o) and execution (S_w) skills through persistent, human-readable external memory (SKILL.md files), without fine-tuning the underlying LLMs.
Dynamic Coordination via Shared Workboard: Workers coordinate asynchronously through a shared Markdown workboard (m_e), enabling redundancy avoidance, conflict reconciliation, and coverage gap detection during search.
State-of-the-Art Performance: Achieves new SOTA on the WideSearch benchmark (Avg@4 Success Rate: 38.50, 7.5× the second best) and strong generalization to XBench-DeepSearch (73.0 accuracy).
Framework-Driven Gains: Ablation studies confirm performance stems from the architecture (bi-level learning, coordination) rather than the capability of the underlying, cost-efficient LLMs (GPT-5 mini, Gemini 3 Flash).

Introduction and Theoretical Foundation

Agentic web search faces two distinct demands: deep reasoning over a single target and structured aggregation across many entities and heterogeneous sources (wide search). The web-to-table search task formalizes the latter: given a natural-language query and target schema, produce a structured table by searching the open web, where each row is a distinct entity and each column a requested attribute.

Current systems struggle with both regimes. Monolithic agents suffer from context saturation and error compounding at scale. While hierarchical frameworks and workflow pipelines offer decomposition, they often rely on fixed planning with limited feedback. Self-evolving and memory-augmented agents improve execution but typically adapt at a single level without jointly refining decomposition and execution.

Web2BigTable addresses these challenges with a bi-level architecture and a memory-mediated self-evolving mechanism. The core idea is to factorize the monolithic policy into two levels that co-adapt through external memory, enabling scalable, reliable extraction for both breadth-oriented (WideSearch) and depth-oriented (XBench-DeepSearch) tasks.

Formal Problem Definition: An instance is a pair $T = \langle q, W\rangle$ , where $q$ is the user query specifying the schema and $W$ is the open web environment. A valid output is a table $X \in \mathcal{X}$ . The policy $\pi$ unfolds as an action–observation loop:

x_{t+1} \sim \pi(\cdot \mid q, h_t) \tag{1}

where $h_t = (o_1, x_1, \dots, o_t, x_t)$ is the interaction history.

Methodology

2.2 System Architecture

Web2BigTable replaces the single-agent policy in Equation (1) with a bi-level factorization:

\begin{aligned} \tau &= (\tau_1, \dots, \tau_N) \sim \pi_o(\cdot \mid q, S_o), \\ x_i &\sim \pi_w^{(i)}(\cdot \mid \tau_i, m_e, s_i), \quad s_i \in S_w, \tag{2} \end{aligned}

where:

$\pi_o$ : Orchestrator policy that decomposes query $q$ into $N$ subtasks $\tau$ using orchestrator skills $S_o$ .
$\pi_w^{(i)}$ : Worker policy for subtask $i$ , conditioned on the shared workboard $m_e$ and a retrieved execution skill $s_i$ from the shared bank $S_w$ .

The architecture externalizes state into two complementary memory regimes:

Long-term semantic memory: Persistent skill banks $S_o$ (decomposition strategies) and $S_w$ (execution skills), evolved only during training and frozen at inference.
Short-term working memory: Transient workboard $m_e$ , a scratchpad for coordination within a single episode.

Memory Updates:

Workboard (short-term): $m_{e}^{t+1} = \mathcal{M}_e(m_e^t, \{h_i^{t+1}\}_{i \in A_t})$ (Equation 3), updated at every inner step.
Skill Banks (long-term): Updated only across training episodes via reflect operators: $S_o^{k+1} = \mathcal{M}_o(S_o^k, r_o^{k+1}) \tag{4}$ $S_w^{k+1} = \mathcal{M}_w(S_w^k, r_o^{k+1}) \tag{5}$ where $r_o^{k+1}$ is the structured error report.

2.3 Training Phase: Self-Evolving the Skill Banks

The training phase runs on a small set of queries paired with gold-standard tables. The objective is to maximize the expected global utility $U(X) = \text{Item-F1}(X, X_{\text{gold}})$ .

Run–Verify–Reflect Loop (Algorithm 1):

Run: Execute one inference pass with current skills.
Verify: Compare output $X^k$ against gold $X_{\text{gold}}^k$ , produce error report.
Reflect: Distill error report into skill updates via $\mathcal{M}_o$ and $\mathcal{M}_w$ .

Orchestrator Skills Evolution ( $\mathcal{M}_o$ ):

Run: Archive outputs and per-worker JSONL trajectories.
Verify: Compress trajectories, compare outputs at cell level using type-specific scoring (exact match, numeric tolerance, etc.), aggregate error report.
Reflect: Cluster training queries by structural pattern, synthesize generalizable decomposition skills (e.g., split-by-entity, split-by-time-period) and a router skill. Skills are persisted as human-readable SKILL.md files via monotone appends.

Worker Skills Evolution ( $\mathcal{M}_w$ ):

Skill Resolution & Creation: SkillResolver searches for skills via (1) exact-name matching, (2) semantic retrieval (BM25 + embedding search with RRF), (3) on-demand creation by SkillCreator (function or knowledge skills).
Error-Driven Self-Repair: On execution failure, an autonomous reflection loop synthesizes corrected skill code, validated via AST.

2.4 Inference Phase: Query-Time Execution

With frozen skill banks $(S_o^*, S_w^*)$ , inference runs a single forward pass (Algorithm 2).

Orchestrator: Classifies query via task-router skill, invokes corresponding decomposition skill from $S_o^*$ to produce subtasks $\tau$ , initializes the shared workboard $m_e$ .

Workers: Each active worker $i$ at step $t$ samples an action:

x_i^{t+1} \sim \pi_w^{(i)}(\cdot \mid \tau_i, m_e^t, s_i) \tag{6}

Workers are independent Memento-Skills agents (powered by Gemini 3 Flash) executing a ReAct loop with access to built-in and dynamically discovered tools.

Shared Workboard (m_e): A structured Markdown document partitioned into:

Task checklist: Global progress visibility.
Worker slots: Tag-partitioned regions for writing results (e.g., <t1_result>).
Shared context: Background info from orchestrator. Workers interact via read_workboard and edit_workboard tools. Write operations are protected by file locks. The read-write asymmetry enables dynamic coordination: redundancy avoidance, coverage gap detection, and strategy adaptation.

Empirical Validation / Results

3.1 Datasets & Metrics

WideSearch: 200 tasks for broad-coverage structured extraction. Metrics: Success Rate (SR) (100% match), Row-level F1, Item-level F1 (cell-level, using type-specific comparators). Reported as Avg@4 and Max@4 over 4 runs.
XBench-DeepSearch: Chinese benchmark for deep, multi-hop reasoning. Metric: Accuracy (LLM-as-judge).

3.4 Performance Gain Analysis

Ablation studies isolate component contributions (Table 1) and framework vs. LLM capability (Table 2).

Table 1: Contribution of each system component.

Configuration	WideSearch (Avg@4)	XBench
	SR	Row F1
Full system	38.50	63.53
w/o learned orch. skills	7.00	45.23
w/o workboard	27.50	54.81
w/o worker skill evolution	33.00	59.67

Key Findings:

Learned orchestrator skills are critical: Largest performance drop across all metrics (e.g., SR from 38.50 to 7.00).
Workboard coordination enables dynamic gap recovery: Disabling it reduces Row F1 from 63.53 to 54.81.
Worker skill evolution provides complementary gains.
Performance stems from the framework, not the underlying LLMs: The same backbone models (GPT-5 mini, Gemini 3 Flash) achieve far lower scores as single agents (Table 2). The full framework provides gains of +46.84 Item F1 and +38.0 Accuracy.

3.5 Benchmark Comparison

WideSearch Results (Table 3, Figure 5): Web2BigTable achieves state-of-the-art performance:

Success Rate (Avg@4): 38.50 (7.5× the second best at 5.10).
Row F1 (Avg@4): 63.53 (+25.03 over the second best).
Item F1 (Avg@4): 80.12 (+14.42 over the second best).

Table 3: Performance comparison on the WideSearch benchmark (Selected Multi-Agent Frameworks).

Model / System	Success Rate	Row F1	Item F1
	Avg@4	Max@4	Avg@4
Claude Sonnet 4 (Thinking)	3.60	6.50	38.50
OpenAI o3-high	5.10	9.50	37.80
Kimi K2	3.00	6.50	36.20
Web2BigTable (Ours)	38.50	40.00	63.53
∆ vs. second best	+33.40	+30.50	+25.03

XBench-DeepSearch Results (Table 4, Figure 6): Web2BigTable achieves 73.0 accuracy, surpassing frontier proprietary systems like Minimax-M2 and MiroFlow (both at 72.0).

Table 4: Performance comparison on XBench-DeepSearch (Selected Results).

Model / System	Accuracy
Minimax-M2	72.0
MiroFlow (GPT-5)	72.0
Kimi-Researcher	69.0
DeepMiner-32B-RL	62.0
Web2BigTable (Ours) GPT-5 mini + Gemini 3 Flash	73.0

Case Studies

The paper includes detailed case studies illustrating the impact of learned decomposition skills:

Case A (ws_en_006): Listing Taylor Swift concerts (534 ground-truth rows). The learned split-by-entity (tour name) skill outperforms the default split-by-time-period, achieving 93.8% Row F1 vs. 26.8% without skills.
Case B (ws_en_091): Listing AMD Zen processors (331 rows, 12 columns). The learned split-by-category (product line) skill achieves 96% Item F1 vs. 32% without skills.
Case C (ws_zh_069): Compiling papers from two organizations. The learned split-by-source skill with a dedicated verification worker achieves 94% Item F1 vs. 58% without skills.

Theoretical and Practical Implications

Theoretical Implications:

Provides a formal bi-level factorization of the web-to-table search policy, decoupling decomposition from execution.
Demonstrates effective self-evolution and coordination through non-parametric, external memory, offering a training-free alternative to gradient-based fine-tuning.
The framework is generalized in ongoing work as Memento-Team, formalizing the orchestrator-worker interaction as a Stackelberg game and establishing convergence guarantees for the multi-agent memory system.

Practical Implications:

Scalability: Enables reliable extraction of hundreds of rows across heterogeneous sources, a task monolithic agents fail at.
Cost-Efficiency: Achieves SOTA results using cost-efficient, lightweight LLMs (GPT-5 mini, Gemini 3 Flash), highlighting the value of architectural design over raw model scale.
Generalizability: The same framework excels at both breadth-oriented (WideSearch) and depth-oriented (XBench-DeepSearch) tasks, showing versatility.
Interpretability & Control: Skills are stored as human-readable SKILL.md files, allowing for inspection, editing, and version control.

Conclusion

Web2BigTable is a bi-level multi-agent LLM framework that addresses the scalability and reliability challenges of web-to-table search through memory-mediated self-evolution and asynchronous coordination. Its key innovations are:

Jointly evolving task decomposition and worker execution skills via a run–verify–reflect loop over external memory.
Enabling dynamic coordination among parallel workers through a shared Markdown workboard.
Achieving state-of-the-art performance on WideSearch and strong generalization to XBench-DeepSearch, with gains attributable to the framework design rather than underlying LLM capability.

The work demonstrates that bi-level memory-mediated coordination provides a powerful, scalable, and training-free paradigm for large-scale structured information extraction from the web.