OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

Summary (Overview)

OCC-RAG is a family of small language models (0.6B and 1.7B parameters) specifically designed for faithful, context-grounded question answering.
The models are mid-trained on a novel synthetic corpus of >3M examples that targets multi-hop reasoning, strict context faithfulness, and calibrated abstention.
OCC-RAG produces structured reasoning traces with source citations (Query Analysis, Source Analysis, Reasoning, Status, Answer) anchored to literal quotes from the provided context.
Despite being 2–6× smaller, OCC-RAG matches or exceeds general-purpose models up to 4B parameters on multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA) and achieves the best faithfulness and refusal performance across all evaluated scales (ConFiQA, MuSiQue-Un).
The work demonstrates that faithfulness does not require scale alone; it can be effectively learned through the right training curriculum and supervision format in compact, task-specialized architectures.

Introduction and Theoretical Foundation

Background

Frontier language models grow larger and absorb more world knowledge, but many practical applications benefit more from compact, task-specialized architectures (Small Language Models – SLMs). SLMs have shown competitive performance on commonsense reasoning, mathematical reasoning, tool calling, and retrieval-augmented generation.

Context QA and Faithfulness

The paper focuses on Context Question Answering (Context QA): models answer questions based exclusively on a provided context, generating responses strictly derived from that input. A central requirement is faithfulness:

Outputs must be aligned with evidence from the context.
The model must avoid hallucination and ignore parametric knowledge when it conflicts with the context.

Faithfulness thus measures both the alignment of the answer with the evidence and the absence of hallucinated content.

Three Core Capabilities for Context QA

OCC-RAG is built around three capabilities:

Multi-hop inference and commonsense reasoning – synthesizing information across disparate parts of the context and bridging logical gaps.
Avoidance of memorization – pretraining knowledge must not override or interfere with the provided context.
Safe abstention – declining to answer when the context is insufficient, ambiguous, or lacks the necessary information.

Why Mid-training?

Mid-training is a core stage that explicitly shapes the SLM’s reasoning architecture for Context QA. It enables:

Strong multi-hop reasoning by training on reasoning-trace datasets that internalize the functional structure of multi-hop inference (subquestion decomposition, verification).
Faithful, non-memorized QA by tying every reasoning step back to provided evidence.
Calibrated abstention by including “context-insufficient” examples with explicit reasoning-trace patterns.

Methodology

Training Data Generation Pipeline

A synthetic corpus of ~3.25M QA pairs was created, targeting three properties: reasoning over context, strict faithfulness (answer recoverable from context alone), and a fraction of unanswerable examples.

1. Single-hop QA Generation (largest subset: 2.78M pairs)

Ingest and chunk English Wikipedia XML – paragraphs become candidate chunks.
QA Generation – for each gold paragraph, gpt-oss-120B generates ten short QA pairs (JSON array, answers must be extractive).
Distractor mining – up to 1000 child pages from Wikipedia link graph, chunked and scored by TF-IDF cosine similarity; top 20 kept.
Filtration – LLM-as-judge removes inaccurate or illogical QA pairs.

2. Multi-hop QA Generation (262k single-context, 165k multi-context pairs)

Requires synthesizing multiple facts. Uses a Knowledge Graph (KG) extracted from context (Wikontic pipeline) to condition generation.
Path sampling – adopts DRAGOn benchmark taxonomy: simple questions, two-hop families (set, multi-hop, condition), three-hop bamboo-style questions. Each type corresponds to a SPARQL template selecting a sub-graph.
For each sampled path, gpt-oss-120B receives a type-specific prompt with gold path and supporting paragraphs, generating one QA pair per path.

3. Unanswerable Question Construction (43k abstain pairs)

Uses a DeBERTa model fine-tuned on SQuAD to answer questions with reduced subsets of gold contexts. If the predicted answer does not match the original, the model should abstain (critical information missing).

4. Structured Reasoning Traces Every QA pair is enriched with an explicit reasoning trace (generated by Qwen3.5-27B in non-thinking mode):

Query Analysis – what the question asks.
Source Analysis – which sources are relevant.
Reasoning – how facts combine.
Status – explicit ANSWERABLE/UNANSWERABLE verdict.
Answer – final answer string.

Filtering checks: format, answer match (exact match), LLM-as-judge (Qwen3-4B for borderline cases), and overthinking (traces > 1256 tokens or >10 thinking markers dropped).

Dataset Statistics

Subset	Pairs	Tokens (Qwen3)
Single-hop	2.78M	7.76B
Multi-hop single-context	262k	0.16B
Multi-hop multi-context	165k	0.21B
Abstain (unanswerable)	43k	0.029B
Total	~3.25M	~8.16B

Distractor contexts consume the largest share of tokens (35%–75%), followed by gold contexts and reasoning chains.

Mid-training Procedure

Base model: Qwen3-0.6B-Base and Qwen3-1.7B-Base (selected over Gemma3 and SmolLM3 based on held-out QA slice).
Objective: Supervised fine-tuning, loss applied only to response tokens. Prompt includes question + context passages (random order, numeric source identifiers). Response is the structured reasoning trace.
Special tokens delimit prompt elements and response sections; their embeddings are initialized from the mean of subword embeddings of natural-language names.
Data mixing: Multi-hop examples are oversampled 3× per epoch; single-hop shown once. No curriculum schedule used.
Training: ~9 × 10⁹ tokens, ~17 hours (0.6B) and ~28 hours (1.7B) on 8× NVIDIA H100 (80 GB).

Empirical Validation / Results

Benchmarks

Dataset	# Samples	# Sources	Task	Metric
HotpotQA	7,405	10	Multi-hop reasoning	In-Acc ↑
MuSiQue	2,417	10	Hard multi-hop	In-Acc ↑
TAT-QA	906	1	Table multi-hop	F1 ↑
ConFiQA	6,000×3	1	Faithfulness	In-Acc ↑, MR ↓
MuSiQue-Un	2,417	10	Refusal	R-Acc ↑

ConFiQA subsets: QA (single counterfactual triple), MR (multi-hop chain, one counterfactual), MC (multi-hop chain, all counterfactual).
Memorization Ratio (MR): $M_R = \frac{P_o}{P_o + P_c}$ where (P_o) is rate of original (memorized) answer, (P_c) is rate of counterfactual (context-grounded) answer. Lower is better.

Main Results

Model	HotpotQA In-Acc ↑	MuSiQue In-Acc ↑	TAT-QA F1 ↑	ConFiQA In-Acc ↑	ConFiQA MR ↓	MuSiQue-Un R-Acc ↑
Gemma3-1B-it	30.8	12.8	53.6	62.1	7.7	2.2
Gemma3-4B-it	55.8	30.1	65.3	69.8	8.9	55.8
Qwen3-0.6B	34.8 (41.8)	13.2 (17.2)	62.5 (66.3)	59.7 (64.5)	9.0 (8.2)	6.3 (70.0)
Qwen3-1.7B	47.7 (60.9)	20.1 (30.7)	74.4 (74.8)	64.8 (70.4)	12.7 (8.3)	54.7 (82.8)
Qwen3-4B	60.6 (67.1)	33.1 (41.5)	76.9 (79.1)	69.7 (74.1)	10.3 (7.5)	64.1 (84.0)
Qwen3-8B	68.7 (70.3)	39.3 (43.9)	72.9 (74.5)	75.9 (77.6)	9.2 (6.9)	90.7 (90.3)
Qwen3-32B	70.9 (71.4)	49.7 (49.3)	75.9 (76.7)	72.0 (75.8)	11.5 (8.5)	80.7 (87.0)
SmolLM3-3B	49.9 (56.5)	21.5 (29.4)	71.1 (69.7)	58.6 (60.5)	15.4 (13.3)	32.1 (77.1)
Pleias-RAG-1.2B	48.5	15.0	8.4	37.3	25.3	21.9
OCC-RAG-0.6B	57.6	36.6	75.0	79.9	5.2	86.9
OCC-RAG-1.7B	60.9	38.2	81.0	81.4	5.0	87.2

Best per column in bold, second-best underline.
Parentheses for Qwen3/SmolLM3 indicate thinking mode results.
OCC-RAG-0.6B (0.6B) exceeds Qwen3-1.7B (2.8× larger) by 9.5 points on ConFiQA and reduces MR from 8.2 to 5.2.
OCC-RAG-1.7B achieves the highest ConFiQA accuracy (81.4) and lowest memorization ratio (5.0) across all models.
On refusal (MuSiQue-Un), OCC-RAG-1.7B attains 87.2 R-Acc, on par with models 8B+.
At 2–6× smaller size, OCC-RAG models are competitive with Qwen3 4B–14B on multi-hop reasoning while surpassing them on faithfulness and refusal.

Key Observations

Faithfulness is not a function of scale alone: OCC-RAG’s training curriculum explicitly teaches context grounding, yielding MR values far below those of much larger models (e.g., Qwen3-32B MR=11.5).
Structured reasoning traces (with explicit ANSWERABLE/UNANSWERABLE status) enable calibrated abstention behavior.
Multi-hop reasoning benefits from the oversampling of multi-examples in training data.

Theoretical and Practical Implications

Theoretical Implications

The results challenge the assumption that large parametric knowledge is necessary for robust QA. Faithfulness and reliable reasoning can be instilled in small models through meticulously designed training data and supervision formats.
Mid-training on synthetic reasoning traces serves as an effective transfer mechanism for task-specific capabilities (multi-hop inference, evidence grounding, abstention) without requiring massive scale.
The explicit ANSWERABLE/UNANSWERABLE status as a supervised target turns abstention into a structured behavior, not an ad-hoc heuristic.

Practical Implications

Deployment efficiency: OCC-RAG’s compact size (0.6B–1.7B) makes it suitable for resource-constrained environments, while delivering performance competitive with or exceeding models 6× larger on key metrics.
Trustworthiness: The model’s strict adherence to context, low memorization ratio, and learned abstention make it well-suited for high-stakes applications (e.g., legal, medical, financial QA) where hallucination is unacceptable.
Transparency: Structured reasoning traces with source citations provide chain-of-thought-level interpretability at a fraction of the computational cost of full thinking-mode inference.
Reusability: The synthetic data generation pipeline (single-hop, multi-hop, unanswerable) and the mid-training recipe provide a reusable blueprint for building compact, faithful QA systems.

Conclusion

OCC-RAG demonstrates that a compact, task-specialized small language model can match or exceed much larger general-purpose models on faithful context-grounded question answering. By combining:

Large-scale synthetic mid-training covering multi-hop reasoning, context faithfulness, and calibrated abstention,
Structured reasoning traces with explicit evidence citations,
and