KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

Summary (Overview)

Key Insight: Proposes a minimal-sufficiency perspective for hint-based Reinforcement Learning for Verifiable Reasoning (RLVR), identifying that performance improves sharply with a critical segment of knowledge (critical-segment effect) rather than monotonically with hint length.
Core Method: Introduces KnowRL, a framework that decomposes hints into atomic Knowledge Points (KPs) and uses Constrained Subset Search (CSS) to select compact, interaction-aware subsets for training, addressing the pruning interaction paradox (removing one KP may help, but removing multiple together can hurt).
Main Result: Trains KnowRL-Nemotron-1.5B, achieving a new state-of-the-art average accuracy of 74.16 (with KPs) and 70.08 (without KPs) across eight mathematical reasoning benchmarks at the 1.5B scale, significantly outperforming strong baselines.

Introduction and Theoretical Foundation

Reinforcement Learning for Verifiable Reasoning (RLVR) improves LLM reasoning by optimizing for rule-based correctness but suffers from reward sparsity on hard problems. Recent hint-based RL methods inject partial solutions or abstract templates to mitigate this. However, they treat hint design as a quantity expansion problem, leading to three key challenges:

Guidance Redundancy: Only a small subset of information is needed to trigger successful reasoning.
Cross-Hint Inconsistency: Longer hints can introduce branching or ambiguity.
Guidance-Efficiency Trade-off: Abstraction-based hints often rely on costly teacher models.

The paper argues that the core challenge is selecting minimal, coherent knowledge units sufficient to overcome reward sparsity. It introduces a minimal-sufficiency perspective, empirically demonstrating the critical-segment effect where accuracy exhibits a sharp jump once key knowledge is provided.

Methodology

KnowRL follows a workflow of constructing candidate KPs, selecting a minimal-sufficient subset, and using it for RL training.

1. KP Curation Pipeline

For each training problem, a three-stage pipeline constructs candidate KPs:

Generating Correct Solutions: Sample from a strong model (DeepSeek-R1) until a correct solution is obtained.
Extracting Raw Knowledge Points: Prompt the model to extract only indispensable mathematical principles from the correct solution, yielding an initial set $K = \{ k_1, k_2, ..., k_n \}$ .
Leakage Verification: Automatically verify each KP to ensure it is generalizable and not instance-bound.

2. Problem-wise KP Subset Selection

The goal is to select the most beneficial KP configuration $K^* \subseteq K$ for each problem. Performance is estimated via offline accuracy $A$ under different configurations: $A_\emptyset$ (no KPs), $A_K$ (all KPs), and $A_{-i} = A(K \setminus \{k_i\})$ (leave-one-out).

Several selection strategies are explored:

Max-Score: Selects the configuration with the highest accuracy among $\{\emptyset, K, K \setminus \{k_i\}\}$ .
Leave-One-Out (LOO) Strategies: A parameterized operator $\Phi_\epsilon(K)$ $Φ_{ϵ} (K)$ selects based on a tolerance $\epsilon$ $ϵ$ .
- S-LOO ( $\epsilon=0$ ): Strict selection.
- T-LOO ( $\epsilon=1/32$ ): Tolerant selection allowing one-sample-scale rollback.
Consensus-Based Robust Selection (CBRS): Treats each of 8 evaluation runs independently. Defines near-optimal configurations for run $j$ as: $O^{(j)} = \{ c \mid A^{(j)}(c) \geq \max_{c'} A^{(j)}(c') - \delta \}$ with $\delta = 1/32$ . The robust consensus $O^*$ is the intersection of these sets, or the configuration with the most votes.
Constrained Subset Search (CSS): Designed to address the pruning interaction paradox. It first identifies:
- $H = \{ k_i \mid A_{-i} \geq \max(A_K, A_\emptyset) \}$ (non-degrading KPs).
- $N = \{ k_i \in H \mid A_{-i} \geq A_{\max} \}$ (near-optimal removals), where $A_{\max} = \max_i A_{-i}$ . KPs in $N$ are removed directly. Let $C = H \setminus N$ . CSS enumerates subsets only within $C$ (search space size $2^{|C|}$ ) and chooses the final configuration via:
$S^* = \arg \max_S A(S)$ over all constrained candidates plus $\emptyset$ $\emptyset$ and $K$ $K$ .

3. RL Training Integration

The curated KP subsets (using CSS) are integrated into RL training via difficulty-aware prompt injection (added under a ## Hint header). The training uses GRPO-style group-based optimization with entropy annealing for faster convergence.

Empirical Validation / Results

Offline KP Selection Evaluation

The table below compares selection strategies on the base Nemotron-1.5B model across eight benchmarks. CSS achieves the best trade-off between accuracy and KP compactness.

Table 1: Offline KP selection strategies on Nemotron-1.5B. Avg.#KP denotes the average number of selected key knowledge points per problem. Green numbers indicate improvements over w/o KP.

Selection Strategy	AIME24	AIME25	BRUMO25	HMMT-Feb-25	AMC23	CMIMC25	MATH-500	Olympiad-Bench	Avg.	Avg. #KP
w/o KP	58.75	48.44	61.67	30.10	90.55	30.08	92.40	71.70	60.46	0.00
All KP	60.90	49.01	61.11	32.46	89.67	32.32	92.22	70.55	61.03	5.86
Max-Score	62.63	49.79	64.27	34.79	90.94	32.99	92.52	73.89	62.73	2.61
S-LOO	62.71	49.22	63.88	33.54	91.71	33.52	92.90	73.70	62.65	1.72
T-LOO	62.11	49.27	64.20	33.65	91.25	33.67	92.40	73.46	62.50	1.20
CBRS	63.02	49.90	64.17	34.79	91.56	33.57	92.65	73.89	62.94	2.60
CSS	64.44 +5.69	50.57 +2.13	65.03 +3.36	35.77 +5.67	91.71 +1.16	36.70 +6.62	92.90 +0.50	74.11 +2.41	63.90 +3.44	2.57

Final RL Training Results

KnowRL-Nemotron-1.5B was trained on the QuestA dataset using CSS-selected KPs.

Table 3: Evaluation results of RL training with CSS-selected KP data under different test-time prompting strategies (with and without KPs).

Model	Hint Setting	AIME24	AIME25	BRUMO25	HMMT25	AMC23	CMIMC25	MATH	OlyBench	Avg.
KnowRL-Nemotron-1.5B	w/o KP	69.79 +10.73	64.69 +16.36	69.48 +8.75	41.04 +10.41	95.55 +4.85	44.14 +14.06	95.70 +3.35	80.23 +8.53	70.08 +9.63
	CBRS	75.52 +12.50	65.00 +16.00	78.33 +14.16	45.00 +10.21	95.78 +4.22	49.22 +15.65	96.45 +3.80	82.34 +8.45	73.46 +10.52
	CSS	74.58 +10.52	65.21 +15.11	78.12 +13.09	48.75 +12.98	95.70 +5.23	52.19 +15.49	96.20 +3.30	82.44 +8.35	74.16 +10.52
QuestA	w/o KP	71.56	62.08	67.5	40.94	93.44	41.48	92.95	72.28	67.78
JustRL	w/o KP	69.69	62.92	66.88	40.63	96.02	41.72	94.15	76.59	68.58

Key Findings:

State-of-the-Art Performance: KnowRL achieves the highest average scores, establishing a new SOTA at the 1.5B scale.
Internalized Reasoning: The substantial improvement without KP hints at inference (70.08) shows that KnowRL improves the underlying policy itself, not just hint-conditioned behavior.
Effectiveness on Hard Problems: Gains are particularly large on challenging competition-style benchmarks (e.g., +15.11 on AIME25, +15.49 on CMIMC25 with CSS).

Analysis of Training Data Improvement

Training set analysis shows KnowRL effectively overcomes reward sparsity:

Base Model: 41.21% of queries had zero correct answers (mean accuracy 22.40%).
KnowRL (w/o inference KPs): Zero-correct fraction reduced to 13.00%, all-correct bucket raised to 34.28% (mean accuracy 64.30%).
KnowRL (w/ inference KPs): All-correct bucket further concentrated to 51.07% (mean accuracy 77.04%).

Theoretical and Practical Implications

Theoretical: Introduces and formalizes the minimal-sufficiency perspective and the pruning interaction paradox for hint-based RL. Provides a principled framework (CSS) for interaction-aware subset selection.
Practical: Demonstrates that compact, structured guidance is more effective and efficient than longer hints or heavy abstractions. The method reduces computational overhead by minimizing hint length and avoiding reliance on teacher models during online RL.

Conclusion

KnowRL presents an effective framework for RLVR that uses minimal-sufficient knowledge guidance. By decomposing hints into KPs and selecting robust subsets via CSS, it achieves new state-of-the-art results while maintaining efficiency. The work positions structured, compact guidance as a practical scaling principle for sparse-reward RL and opens directions for extending KP curation to broader reasoning domains.