Visual Summary | K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Summary (Overview)

K-BrowseComp is a new web-browsing agent benchmark with 400 problems grounded in Korean contexts, comprising a 300-question human-verified subset (K-BrowseComp-Verified) and a 100-question synthetic diagnostic split.
Frontier models like GPT-5.5 achieve only 45.67% accuracy on the verified subset, a substantial drop from ~84% on the original English BrowseComp, while DeepSeek-V4-Pro and GLM-5.1 obtain 30.00% and 30.67% respectively.
Korean open-weight models developed through government funding score only 0.00–10.33%, revealing a large gap between global and Korean models for agentic tasks.
A novel synthetic generation pipeline using hard few-shot exemplars and failure-mode-targeted creation produces challenging diagnostic problems on which the strongest model reaches only 26.00%.
Trajectory analysis shows failures are not due to insufficient search but to poor state maintenance: models lose track of candidates, constraints, role bindings, and the final-answer state across multiple turns.

Introduction and Theoretical Foundation

The paper addresses a critical gap in Korean AI evaluation. While frontier models (from the US and China) are shifting from static benchmarks toward compositional, agentic evaluation—including web browsing—the Korean AI community remains anchored to static benchmarks that measure foundational capabilities like instruction following and reasoning. Korean agentic benchmarks are virtually nonexistent.

The motivation is twofold:

For Korean developers and users: Korea’s smaller language population creates a structural disadvantage for AI sovereignty when queries require local and cultural knowledge (Kim et al., 2024; Son et al., 2025).
For the broader research community: As frontier models saturate existing benchmarks, agentic benchmarks grounded in linguistically and culturally distinct contexts provide a principled testbed for measuring generalization (Romanou et al., 2025; Whitehouse et al., 2026).

The authors focus on browsing agents because:

They are uniquely dependent on local/cultural knowledge—their core function retrieves region-specific information from the web.
They are inherently compositional, combining instruction following, tool calling, and multi-turn interaction.

The paper draws on prior work including BrowseComp (Wei et al., 2025) and BrowseComp-ZH (Zhou et al., 2025), but emphasizes that Korean web environments (search conventions, local entities, semi-structured pages, culturally grounded clues) require a dedicated benchmark. It extends Korean evaluation resources—KorQuAD, KoBEST, KMMLU, CLIcK—into the agentic domain.

Methodology

Dataset Construction

K-BrowseComp-Verified (300 problems):

Constructed by 17 human annotators (no LLM assistance allowed).
Each question must be: (a) grounded in Korean contexts with public web evidence, (b) difficult to answer via direct search but easy to verify once the answer is found, (c) require multi-hop reasoning (≥4 steps) or parallel constraint satisfaction (≥4 constraints).
Answers must be unique and temporally stable.
All items manually verified by authors—checking gold answer, intermediate entities, cited sources, and consistency.

Synthetic split (100 problems):

Uses a generation pipeline with Claude Code (claude-opus-4.7) as a proposer.
Leverages the failure taxonomy derived from K-BrowseComp-Verified (Table 1) to target specific weakness modes (F1–F8, excluding F0).
The generation process: opens a seed web page, constructs a question backwards (multi-hop or parallel-constraint), withholds answer/source/identifying entity, and iteratively refines over 4 rounds (draft → test → revise).
Three successive filters: (1) searchability—gold answer should not appear directly in search results; (2) well-formedness—reference solver must recover answer from source page; (3) adversarial difficulty—both gpt-5.4-mini and gemini-3-flash-preview must fail on the question, and failure must be attributable to one of F1–F8.
268 candidates generated; 100 accepted (37.3% yield).

Failure-Mode Taxonomy

Mode	Name	Definition
F0	Incomplete trajectory or malformed output	Model produces incomplete trajectory, malformed output, or no valid final answer.
F1	Ineffective initial search direction	Model fails to choose a useful initial search strategy.
F2	Search-access structure failure	Model fails to access evidence hidden behind difficult page structures.
F3	Cross-source hopping failure	Model fails to connect evidence across weakly linked sources or entity contexts.
F4	Semi-structured parsing failure	Model misreads tables, lists, rankings, databases, or institutional pages.
F5	Search-result selection failure	Model retrieves relevant evidence but selects the wrong source or candidate.
F6	Sparse entity normalization failure	Model fails to resolve rare names, aliases, spelling variants, or historical names.
F7	Constraint-tracking failure	Model finds partial candidates but fails to satisfy all constraints.
F8	Intermediate reasoning failure	Model fails at date arithmetic, ordering, counting, comparison, or filtering.

Evaluation Protocol

Built on the search_evals framework (Perplexity Research, 2025) with the deep-research agent and Perplexity Search backend.
Budget: 10 search calls per question.
Models evaluated include proprietary (GPT-5.5, GPT-5.4-mini, Gemini-3.1-Flash-Lite) and open-weight (DeepSeek-V4-Pro, GLM-5.1, Qwen3.6-35B-A3B, Gemma-4-31B-it, and Korean models K-EXAONE-236B-A23B, A.X-4.0, HyperCLOVAX-SEED-Think-32B, Kanana-2-30B-A3B-Thinking-2601).
Final answer extracted by GPT-5.4-mini and matched against gold answer (pass@1).

Empirical Validation / Results

Main Results on K-BrowseComp-Verified

Model	Access	Pass@1 Acc. (%)	Calib. Err. (%)
GPT-5.5	Closed	45.67	31.86
GPT-5.4-mini	Closed	30.67	37.88
DeepSeek-V4-Pro	Open	30.00	17.72
GLM-5.1	Open	30.67	27.07
Qwen3.6-35B-A3B	Open	12.00	47.89
Gemini-3.1-Flash-Lite	Closed	11.33	56.55
Gemma-4-31B-it	Open	23.33	23.66
K-EXAONE-236B-A23B	Open	10.33	24.09
A.X-4.0	Open	5.33	47.89
HCX-SEED-Think-32B	Open	2.33	77.37
Kanana-2-30B-A3B-Think	Open	0.00	–

Key findings:

GPT-5.5 leads at 45.67%, far below its 84.4% on original BrowseComp.
Korean open-weight models score only 0.00–10.33%.
Calibration error varies widely; Korean models show poor confidence alignment (e.g., HCX-SEED-Think-32B at 77.37%).

Results on Synthetic Split

Model	Pass@1 Acc. (%)
GPT-5.5	26.00
DeepSeek-V4-Pro	22.00
GLM-5.1	19.00
Gemma-4-31B-it	17.00
Qwen3.6-35B-A3B	15.00
K-EXAONE-236B-A23B	13.00
Gemini-3.1-Flash-Lite	11.00
A.X-4.0	1.00
HCX-SEED-Think-32B	2.00
Kanana-2-30B-A3B-Think	0.00

No model exceeds 30%. GPT-5.4-mini scores 0.0% since it was used in adversarial filtering.

Trajectory-Level Failure Patterns

Three recurrent post-retrieval failures identified:

Candidate capture (F5+F7): Model commits to a plausible entity before verifying all constraints; subsequent searches become confirmatory (e.g., searching OST lyrics before identifying the correct drama).
Unmerged evidence branches (F7): Model searches relevant clues but never joins them as filters over a shared candidate set (e.g., K-pop group constraint intersection yields wrong answer Winner instead of Ladies’ Code).
Misbound evidence chains (F3): Model binds intermediate results to wrong roles, swapping entity types across steps (e.g., university cheer song phrase from wrong institution).

Search Effort Analysis

Model	Correct avg. calls	Wrong avg. calls	Δ calls
GPT-5.5	7.08	9.30	+2.22
DeepSeek-V4-Pro	7.47	9.80	+2.33
Gemma-4-31B-it	5.20	8.10	+2.90
A.X-4.0	2.38	1.43	-0.95

Incorrect trials generally use more search calls, not fewer, indicating the bottleneck is not retrieval volume but state maintenance. A.X-4.0 is an exception with very low search usage.

Theoretical and Practical Implications

Theoretical: The benchmark demonstrates that agentic evaluation in non-English contexts reveals capabilities hidden by English-centric benchmarks. The substantial drop from BrowseComp (84.4%→45.67%) shows frontier models are not robustly generalizing to Korean web environments. The failure taxonomy provides a principled framework for diagnosing agent weaknesses at the trajectory level.
Practical for Korean AI ecosystem: Korean open-weight models, despite their training on Korean language, fail dramatically on agentic tasks. This suggests that static benchmarks for Korean are insufficient; investment in tool-use training and long-horizon state tracking is needed. The benchmark serves as a diagnostic target for the government-funded “Proprietary AI Foundation Model Project.”
Synthetic generation methodology: The paper exploits the asymmetry between solving and creating browsing problems—it is easier to verify an answer than to find it, and similarly easier to create hard problems when failure modes are known. This pipeline can scale agentic evaluation for other languages and domains.
Calibration issues: High calibration errors (e.g., 77.37% for HCX-SEED-Think-32B) indicate models are overconfident when wrong, which is especially problematic for deployment.

Conclusion

K-BrowseComp is a Korean web-browsing agent benchmark with 400 problems (300 human-verified, 100 synthetic). Even strong frontier models achieve low scores (30–45.67%), and Korean open-weight models lag substantially behind global counterparts. The synthetic split is similarly difficult, validating the failure-mode-targeted generation pipeline.

Key takeaway: Many failures occur after models retrieve relevant Korean web evidence—they fail to maintain candidates, constraints, source pointers, or final-answer state across the trajectory. Progress requires stronger trajectory-level state maintenance, not just broader language coverage or larger model scale.

Limitations acknowledged:

Modest scale (300 verified) with uneven domain coverage (entertainment/media overrepresented).
Performance measured under a single harness and search backend.
Synthetic split differs in surface form and domain composition from verified set.
Web evidence changes over time requiring continued revalidation.
No multi-lingual comparison or human baseline provided (noted as future work).

The authors release the data and code publicly at https://github.com/prometheus-eval/K-BrowseComp to encourage development of reliable web-browsing agents for Korean web environments.