Summary (Overview)
- K-BrowseComp is a new web-browsing agent benchmark with 400 problems grounded in Korean contexts, comprising a 300-question human-verified subset (K-BrowseComp-Verified) and a 100-question synthetic diagnostic split.
- Frontier models like GPT-5.5 achieve only 45.67% accuracy on the verified subset, a substantial drop from ~84% on the original English BrowseComp, while DeepSeek-V4-Pro and GLM-5.1 obtain 30.00% and 30.67% respectively.
- Korean open-weight models developed through government funding score only 0.00–10.33%, revealing a large gap between global and Korean models for agentic tasks.
- A novel synthetic generation pipeline using hard few-shot exemplars and failure-mode-targeted creation produces challenging diagnostic problems on which the strongest model reaches only 26.00%.
- Trajectory analysis shows failures are not due to insufficient search but to poor state maintenance: models lose track of candidates, constraints, role bindings, and the final-answer state across multiple turns.
Introduction and Theoretical Foundation
The paper addresses a critical gap in Korean AI evaluation. While frontier models (from the US and China) are shifting from static benchmarks toward compositional, agentic evaluation—including web browsing—the Korean AI community remains anchored to static benchmarks that measure foundational capabilities like instruction following and reasoning. Korean agentic benchmarks are virtually nonexistent.
The motivation is twofold:
- For Korean developers and users: Korea’s smaller language population creates a structural disadvantage for AI sovereignty when queries require local and cultural knowledge (Kim et al., 2024; Son et al., 2025).
- For the broader research community: As frontier models saturate existing benchmarks, agentic benchmarks grounded in linguistically and culturally distinct contexts provide a principled testbed for measuring generalization (Romanou et al., 2025; Whitehouse et al., 2026).
The authors focus on browsing agents because:
- They are uniquely dependent on local/cultural knowledge—their core function retrieves region-specific information from the web.
- They are inherently compositional, combining instruction following, tool calling, and multi-turn interaction.
The paper draws on prior work including BrowseComp (Wei et al., 2025) and BrowseComp-ZH (Zhou et al., 2025), but emphasizes that Korean web environments (search conventions, local entities, semi-structured pages, culturally grounded clues) require a dedicated benchmark. It extends Korean evaluation resources—KorQuAD, KoBEST, KMMLU, CLIcK—into the agentic domain.
Methodology
Dataset Construction
K-BrowseComp-Verified (300 problems):
- Constructed by 17 human annotators (no LLM assistance allowed).
- Each question must be: (a) grounded in Korean contexts with public web evidence, (b) difficult to answer via direct search but easy to verify once the answer is found, (c) require multi-hop reasoning (≥4 steps) or parallel constraint satisfaction (≥4 constraints).
- Answers must be unique and temporally stable.
- All items manually verified by authors—checking gold answer, intermediate entities, cited sources, and consistency.
Synthetic split (100 problems):
- Uses a generation pipeline with Claude Code (claude-opus-4.7) as a proposer.
- Leverages the failure taxonomy derived from K-BrowseComp-Verified (Table 1) to target specific weakness modes (F1–F8, excluding F0).
- The generation process: opens a seed web page, constructs a question backwards (multi-hop or parallel-constraint), withholds answer/source/identifying entity, and iteratively refines over 4 rounds (draft → test → revise).
- Three successive filters: (1) searchability—gold answer should not appear directly in search results; (2) well-formedness—reference solver must recover answer from source page; (3) adversarial difficulty—both gpt-5.4-mini and gemini-3-flash-preview must fail on the question, and failure must be attributable to one of F1–F8.
- 268 candidates generated; 100 accepted (37.3% yield).
Failure-Mode Taxonomy
| Mode | Name | Definition |
|---|---|---|
| F0 | Incomplete trajectory or malformed output | Model produces incomplete trajectory, malformed output, or no valid final answer. |
| F1 | Ineffective initial search direction | Model fails to choose a useful initial search strategy. |
| F2 | Search-access structure failure | Model fails to access evidence hidden behind difficult page structures. |
| F3 | Cross-source hopping failure | Model fails to connect evidence across weakly linked sources or entity contexts. |
| F4 | Semi-structured parsing failure | Model misreads tables, lists, rankings, databases, or institutional pages. |
| F5 | Search-result selection failure | Model retrieves relevant evidence but selects the wrong source or candidate. |
| F6 | Sparse entity normalization failure | Model fails to resolve rare names, aliases, spelling variants, or historical names. |
| F7 | Constraint-tracking failure | Model finds partial candidates but fails to satisfy all constraints. |
| F8 | Intermediate reasoning failure | Model fails at date arithmetic, ordering, counting, comparison, or filtering. |
Evaluation Protocol
- Built on the
search_evalsframework (Perplexity Research, 2025) with the deep-research agent and Perplexity Search backend. - Budget: 10 search calls per question.
- Models evaluated include proprietary (GPT-5.5, GPT-5.4-mini, Gemini-3.1-Flash-Lite) and open-weight (DeepSeek-V4-Pro, GLM-5.1, Qwen3.6-35B-A3B, Gemma-4-31B-it, and Korean models K-EXAONE-236B-A23B, A.X-4.0, HyperCLOVAX-SEED-Think-32B, Kanana-2-30B-A3B-Thinking-2601).
- Final answer extracted by GPT-5.4-mini and matched against gold answer (pass@1).
Empirical Validation / Results
Main Results on K-BrowseComp-Verified
| Model | Access | Pass@1 Acc. (%) | Calib. Err. (%) |
|---|---|---|---|
| GPT-5.5 | Closed | 45.67 | 31.86 |
| GPT-5.4-mini | Closed | 30.67 | 37.88 |
| DeepSeek-V4-Pro | Open | 30.00 | 17.72 |
| GLM-5.1 | Open | 30.67 | 27.07 |
| Qwen3.6-35B-A3B | Open | 12.00 | 47.89 |
| Gemini-3.1-Flash-Lite | Closed | 11.33 | 56.55 |
| Gemma-4-31B-it | Open | 23.33 | 23.66 |
| K-EXAONE-236B-A23B | Open | 10.33 | 24.09 |
| A.X-4.0 | Open | 5.33 | 47.89 |
| HCX-SEED-Think-32B | Open | 2.33 | 77.37 |
| Kanana-2-30B-A3B-Think | Open | 0.00 | – |
Key findings:
- GPT-5.5 leads at 45.67%, far below its 84.4% on original BrowseComp.
- Korean open-weight models score only 0.00–10.33%.
- Calibration error varies widely; Korean models show poor confidence alignment (e.g., HCX-SEED-Think-32B at 77.37%).
Results on Synthetic Split
| Model | Pass@1 Acc. (%) |
|---|---|
| GPT-5.5 | 26.00 |
| DeepSeek-V4-Pro | 22.00 |
| GLM-5.1 | 19.00 |
| Gemma-4-31B-it | 17.00 |
| Qwen3.6-35B-A3B | 15.00 |
| K-EXAONE-236B-A23B | 13.00 |
| Gemini-3.1-Flash-Lite | 11.00 |
| A.X-4.0 | 1.00 |
| HCX-SEED-Think-32B | 2.00 |
| Kanana-2-30B-A3B-Think | 0.00 |
No model exceeds 30%. GPT-5.4-mini scores 0.0% since it was used in adversarial filtering.
Trajectory-Level Failure Patterns
Three recurrent post-retrieval failures identified:
- Candidate capture (F5+F7): Model commits to a plausible entity before verifying all constraints; subsequent searches become confirmatory (e.g., searching OST lyrics before identifying the correct drama).
- Unmerged evidence branches (F7): Model searches relevant clues but never joins them as filters over a shared candidate set (e.g., K-pop group constraint intersection yields wrong answer Winner instead of Ladies’ Code).
- Misbound evidence chains (F3): Model binds intermediate results to wrong roles, swapping entity types across steps (e.g., university cheer song phrase from wrong institution).
Search Effort Analysis
| Model | Correct avg. calls | Wrong avg. calls | Δ calls |
|---|---|---|---|
| GPT-5.5 | 7.08 | 9.30 | +2.22 |
| DeepSeek-V4-Pro | 7.47 | 9.80 | +2.33 |
| Gemma-4-31B-it | 5.20 | 8.10 | +2.90 |
| A.X-4.0 | 2.38 | 1.43 | -0.95 |
Incorrect trials generally use more search calls, not fewer, indicating the bottleneck is not retrieval volume but state maintenance. A.X-4.0 is an exception with very low search usage.
Theoretical and Practical Implications
- Theoretical: The benchmark demonstrates that agentic evaluation in non-English contexts reveals capabilities hidden by English-centric benchmarks. The substantial drop from BrowseComp (84.4%→45.67%) shows frontier models are not robustly generalizing to Korean web environments. The failure taxonomy provides a principled framework for diagnosing agent weaknesses at the trajectory level.
- Practical for Korean AI ecosystem: Korean open-weight models, despite their training on Korean language, fail dramatically on agentic tasks. This suggests that static benchmarks for Korean are insufficient; investment in tool-use training and long-horizon state tracking is needed. The benchmark serves as a diagnostic target for the government-funded “Proprietary AI Foundation Model Project.”
- Synthetic generation methodology: The paper exploits the asymmetry between solving and creating browsing problems—it is easier to verify an answer than to find it, and similarly easier to create hard problems when failure modes are known. This pipeline can scale agentic evaluation for other languages and domains.
- Calibration issues: High calibration errors (e.g., 77.37% for HCX-SEED-Think-32B) indicate models are overconfident when wrong, which is especially problematic for deployment.
Conclusion
K-BrowseComp is a Korean web-browsing agent benchmark with 400 problems (300 human-verified, 100 synthetic). Even strong frontier models achieve low scores (30–45.67%), and Korean open-weight models lag substantially behind global counterparts. The synthetic split is similarly difficult, validating the failure-mode-targeted generation pipeline.
Key takeaway: Many failures occur after models retrieve relevant Korean web evidence—they fail to maintain candidates, constraints, source pointers, or final-answer state across the trajectory. Progress requires stronger trajectory-level state maintenance, not just broader language coverage or larger model scale.
Limitations acknowledged:
- Modest scale (300 verified) with uneven domain coverage (entertainment/media overrepresented).
- Performance measured under a single harness and search backend.
- Synthetic split differs in surface form and domain composition from verified set.
- Web evidence changes over time requiring continued revalidation.
- No multi-lingual comparison or human baseline provided (noted as future work).
The authors release the data and code publicly at https://github.com/prometheus-eval/K-BrowseComp to encourage development of reliable web-browsing agents for Korean web environments.
Related papers
- On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
Parameter-efficient fine-tuning scales one shared foundation model into millions of persistent personal model instances, shown with trillion-parameter LoRA RL.
- GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration
Training image restoration models on 100,000 real-world image pairs generated by a multimodal foundation model consistently improves their generalization to diverse real-world degradations.
- Function2Scene: 3D Indoor Scene Layout from Functional Specifications
Function2Scene introduces a novel framework that generates 3D indoor layouts from functional specifications using an iterative check-and-repair pipeline with LLMs, significantly outperforming prior methods in functional design.