RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

Summary (Overview)

Winning Judge-Orchestrated Ensemble: The authors' system achieved 1st place in SemEval-2026 Task 8 (MTRAGEval) by using a heterogeneous ensemble of seven LLMs with two prompting variants. A GPT-4o-mini judge selects the best candidate per instance based on faithfulness, achieving a conditioned harmonic mean HM $_3$ = 0.7827, outperforming the strongest baseline (0.6390).
Diversity Over Scale: The ensemble's success is attributed to complementary failure modes across diverse model families, scales, and prompting strategies. The ensemble consistently outperformed any single model, including the 357B parameter GLM-4.6 (HM $_3$ = 0.748).
Domain-Adapted Lightweight Model: The authors introduce Meno-Lite-0.1, a cost-effective 7B model adapted for RAG tasks. While its direct contribution to the ensemble was minor, it demonstrated a strong cost-performance trade-off, matching 70B models on Russian benchmarks and achieving HM $_3$ = 0.681 on answerable instances.
Benchmark Analysis: The paper identifies critical limitations in the MTRAGEval benchmark, most notably a target leakage shortcut where all unanswerable questions have empty reference passages, allowing trivial abstention. It recommends adding distractor passages for future iterations.
Prompting Strategy: Category-aware few-shot prompting consistently outperformed an iteratively refined system prompt, especially for handling challenging underspecified questions, demonstrating the value of concrete behavioral examples over abstract instructions.

Introduction and Theoretical Foundation

The paper addresses the challenge of faithful multi-turn Retrieval-Augmented Generation (RAG), where a system must answer a user's query by grounding its response strictly in provided reference passages, while also incorporating the context of a multi-turn dialogue history. This requires handling coreference resolution, intent tracking, and abstention for unanswerable or underspecified queries.

SemEval-2026 Task 8 (MTRAGEval) formalizes this challenge. In Task B (generation with reference passages), the system receives a dialogue history and reference passages and must generate a final-turn response. The authors' core hypothesis is that a heterogeneous ensemble of LLMs, combined with an LLM-based judge for per-instance selection, can exploit complementary strengths to outperform any single model.

The theoretical foundation combines two lines of work:

LLM Ensembles: Exploiting complementary strengths across different model families, scales, and prompting strategies (Jiang et al., 2023; Wang et al., 2024).
LLM-as-a-Judge: Using LLMs for scalable, reference-free evaluation of qualities like faithfulness and appropriateness (Zheng et al., 2023; Liu et al., 2023).

Methodology

The system follows a three-stage pipeline:

Prompt Construction: Two distinct strategies are used.
- Group 1 (System Prompt Only): Models use an iteratively refined system prompt ( $P$ ) designed via a Gemini-based procedure (see Appendix B). The prompt emphasizes strict grounding, explicit abstention, concise formatting, and dialogue history use.
- Group 2 (Few-Shot): Models use a simpler base prompt augmented with category-aware few-shot examples. Training instances are clustered into three categories (full context, empty context, empty history), and medoid exemplars are selected via embedding similarity (using gte-multilingual-base).
Candidate Generation: A heterogeneous ensemble of seven LLMs generates candidates in parallel. The ensemble is intentionally diverse in providers, training pipelines, and scales.

Model	Size	Type	Group (Prompting)
Gemini-3-Pro-Preview	–	Proprietary	1 (System Prompt)
GLM-4.6	357B	Open	1 (System Prompt)
Llama-3.3-70B-Instruct	70B	Open	1 (System Prompt)
Qwen3-235B-A22B-Instruct-2507	235B	Open	68 1 (System Prompt)
Claude 4.5 Haiku	–	Proprietary	2 (Few-Shot)
Qwen2.5-32B-Instruct	32B	Open	2 (Few-Shot)
Meno-Lite-0.1	7B	Open	2 (Few-Shot)

Judge-Based Selection: For each instance, GPT-4o-mini evaluates all seven candidate responses for faithfulness (whether all claims are supported by provided passages) and assigns a score from 0 to 1. The top-ranked candidate is selected. A deterministic post-processing step replaces any ensemble output with "I don't have an answer" for instances where the reference context is empty.

Meno-Lite-0.1 Details (Appendix A): This 7B model is derived from Qwen2.5-7B-Instruct via a two-stage pipeline:

Continued Pretraining (1.3B tokens): On a balanced bilingual mix of educational data (English FineWeb-Edu, Russian RuLM subset, Russian educational PDFs).
Supervised Fine-Tuning (50M tokens): On a bilingual instruction set focused on extraction, normalization, summarization, and multi-hop QA skills, including data from MTRAGEval itself. Its design philosophy prioritizes language skills (comprehension, extraction) over world knowledge (facts, dates), aiming for competitive performance on context-grounded tasks with minimal parameters.

Empirical Validation / Results

Dataset: MTRAGEval Task B evaluation set (507 instances across four domains: FiQA, IBMCloud, CLAPnq, Govt). Answerability distribution: 56.2% answerable, 19.1% unanswerable, 15.4% underspecified, 9.3% partially answerable. Metrics: Official score is the conditioned harmonic mean of three metrics: RB_alg (harmonic mean of BERTScore Recall, BERT-K-Precision, ROUGE-L), RB_llm (LLM-judge score), and RL_F (reference-less faithfulness). HM $_3$ denotes this final harmonic mean.

Main Results

The ensemble achieved 1st place out of 26 teams.

System	HM $_3$ (cond.)
Our ensemble (1st place)	0.7827
Baseline: gpt-oss-120b	0.6390

Individual Model and Ensemble Performance

The ensemble outperformed all individual models. The best single model was GLM-4.6.

Model	Grp	RB_a_idk	RB_l_idk	RL_F_idk	HM $_3$
GLM-4.6	1	0.63	0.77	0.89	0.75
Gemini-3-Pro-Preview	1	0.63	0.75	0.90	0.74
Llama-3.3-70B-Instruct	1	0.63	0.75	0.86	0.73
Qwen3-235B-A22B	1	0.53	0.85	0.87	0.72
Claude 4.5 Haiku	2	0.58	0.76	0.84	0.71
Qwen2.5-32B-Instruct	2	0.48	0.67	0.72	0.61
Meno-Lite-0.1	2	0.50	0.59	0.70	0.59
Ensemble (MTRAGEval)	–	0.64	0.84	0.93	0.78
Ensemble (our judge)	–	0.71	0.82	0.98	0.82

Ablation Studies

Key findings from ablation experiments (unconditioned HM $_3$ scores):

Ablation Configuration	ANS	UND
(i) Ensemble vs. individual models
Full ensemble (judge)	0.79	0.37
Best single: GLM + FS	0.78	0.38
(ii) Judge vs. random selection
Full ensemble (judge)	0.79	0.37
Full ensemble (random)	0.76	0.35
(iii) Contribution of Meno-Lite-0.1
With Meno-Lite-0.1	0.79	0.37
Without Meno-Lite-0.1	0.79	0.36
(iv) Few-shot (FS) vs. system prompt (SP)
GLM-4.6 + FS	0.78	0.38
GLM-4.6 + SP	0.78	0.34

Diversity and Selection: The ensemble beats any single model overall. Judge selection provides a clear gain over random selection, especially on faithfulness (RL_F).
Meno-Lite-0.1 Contribution: Its direct contribution to the ensemble was minimal (selected in only 2 of 424 instances), but its selected responses were high-quality (HM ≈ 0.707). Its value lies in cost-effective standalone performance.
Few-Shot Superiority: Few-shot prompting consistently outperformed the refined system prompt, with substantial gains on underspecified questions.

Meno-Lite-0.1 Benchmark Performance

The domain-adapted 7B model showed strong performance on related benchmarks:

MERA (Russian): Overall score 0.555, matching Llama-3.3-70B-Instruct (0.555) and surpassing base Qwen2.5-7B-Instruct (0.482).
NEREL-bench (Knowledge Graph): Highest harmonic mean (0.468), outperforming Qwen2.5-32B-Instruct (0.416).
LIBRA (Long-Context): Maintained near-perfect passkey retrieval and led 7B-class models on real-world multi-hop QA.

Theoretical and Practical Implications

Theoretical Implications:

Complementarity > Scale: For grounded generation tasks, diversity in model architectures and failure modes can outweigh raw parameter count. A carefully orchestrated ensemble of smaller models can surpass a single, much larger model.
Example-Based Learning: For complex, edge-case behaviors like handling underspecified queries, providing concrete few-shot examples is more effective than refining abstract, instructional system prompts.
Faithfulness vs. Appropriateness: The judge optimized for faithfulness, but this sometimes came at the cost of pragmatic appropriateness (e.g., failing to ask for clarification). This highlights the need for multi-objective selection criteria in real-world applications.

Practical Implications:

Cost-Effective RAG Models: The Meno-Lite-0.1 case study demonstrates that targeted domain adaptation and skill training can produce small models (7B) that are highly competent at context-grounded tasks, offering a viable path for resource:constrained deployments.
Ensemble Design: Practitioners building similar systems should prioritize architectural and behavioral diversity in their candidate pool. A simple, lightweight LLM judge can effectively harness this diversity for per-instance improvement.
Benchmark Design: The critical analysis reveals pitfalls in current RAG evaluation. Future benchmarks must avoid shortcuts (like empty contexts for unanswerable questions) and should align the guidelines used by both response generators and evaluators to ensure fair and meaningful assessment.

Conclusion

The paper presents a winning solution for multi-turn faithful RAG based on a judge-orchestrated, heterogeneous LLM ensemble. The core takeaways are:

The ensemble's success stems from diversity in models and prompting, combined with informed per-instance selection.
Meno-Lite-0.1 proves that small, domain-adapted models can achieve strong cost-performance trade-offs for RAG.
Current evaluation benchmarks have critical limitations (e.g., the empty-context shortcut) that must be addressed to drive genuine progress.

Future work includes exploring adaptive routing to invoke specialized model subsets per instance, multi-objective judge selection, scaling domain adaptation, and improving calibration for underspecified queries. The authors also advocate for a fully open-weight pipeline to reduce dependency on proprietary APIs.