KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance
Summary (Overview)
- Key Insight: Proposes a minimal-sufficiency perspective for hint-based Reinforcement Learning for Verifiable Reasoning (RLVR), identifying that performance improves sharply with a critical segment of knowledge (critical-segment effect) rather than monotonically with hint length.
- Core Method: Introduces KnowRL, a framework that decomposes hints into atomic Knowledge Points (KPs) and uses Constrained Subset Search (CSS) to select compact, interaction-aware subsets for training, addressing the pruning interaction paradox (removing one KP may help, but removing multiple together can hurt).
- Main Result: Trains KnowRL-Nemotron-1.5B, achieving a new state-of-the-art average accuracy of 74.16 (with KPs) and 70.08 (without KPs) across eight mathematical reasoning benchmarks at the 1.5B scale, significantly outperforming strong baselines.
Introduction and Theoretical Foundation
Reinforcement Learning for Verifiable Reasoning (RLVR) improves LLM reasoning by optimizing for rule-based correctness but suffers from reward sparsity on hard problems. Recent hint-based RL methods inject partial solutions or abstract templates to mitigate this. However, they treat hint design as a quantity expansion problem, leading to three key challenges:
- Guidance Redundancy: Only a small subset of information is needed to trigger successful reasoning.
- Cross-Hint Inconsistency: Longer hints can introduce branching or ambiguity.
- Guidance-Efficiency Trade-off: Abstraction-based hints often rely on costly teacher models.
The paper argues that the core challenge is selecting minimal, coherent knowledge units sufficient to overcome reward sparsity. It introduces a minimal-sufficiency perspective, empirically demonstrating the critical-segment effect where accuracy exhibits a sharp jump once key knowledge is provided.
Methodology
KnowRL follows a workflow of constructing candidate KPs, selecting a minimal-sufficient subset, and using it for RL training.
1. KP Curation Pipeline
For each training problem, a three-stage pipeline constructs candidate KPs:
- Generating Correct Solutions: Sample from a strong model (DeepSeek-R1) until a correct solution is obtained.
- Extracting Raw Knowledge Points: Prompt the model to extract only indispensable mathematical principles from the correct solution, yielding an initial set .
- Leakage Verification: Automatically verify each KP to ensure it is generalizable and not instance-bound.
2. Problem-wise KP Subset Selection
The goal is to select the most beneficial KP configuration for each problem. Performance is estimated via offline accuracy under different configurations: (no KPs), (all KPs), and (leave-one-out).
Several selection strategies are explored:
- Max-Score: Selects the configuration with the highest accuracy among .
- Leave-One-Out (LOO) Strategies: A parameterized operator selects based on a tolerance .
- S-LOO (): Strict selection.
- T-LOO (): Tolerant selection allowing one-sample-scale rollback.
- Consensus-Based Robust Selection (CBRS): Treats each of 8 evaluation runs independently. Defines near-optimal configurations for run as: with . The robust consensus is the intersection of these sets, or the configuration with the most votes.
- Constrained Subset Search (CSS): Designed to address the pruning interaction paradox. It first identifies:
- (non-degrading KPs).
- (near-optimal removals), where . KPs in are removed directly. Let . CSS enumerates subsets only within (search space size ) and chooses the final configuration via:
3. RL Training Integration
The curated KP subsets (using CSS) are integrated into RL training via difficulty-aware prompt injection (added under a ## Hint header). The training uses GRPO-style group-based optimization with entropy annealing for faster convergence.
Empirical Validation / Results
Offline KP Selection Evaluation
The table below compares selection strategies on the base Nemotron-1.5B model across eight benchmarks. CSS achieves the best trade-off between accuracy and KP compactness.
Table 1: Offline KP selection strategies on Nemotron-1.5B. Avg.#KP denotes the average number of selected key knowledge points per problem. Green numbers indicate improvements over w/o KP.
| Selection Strategy | AIME24 | AIME25 | BRUMO25 | HMMT-Feb-25 | AMC23 | CMIMC25 | MATH-500 | Olympiad-Bench | Avg. | Avg. #KP |
|---|---|---|---|---|---|---|---|---|---|---|
| w/o KP | 58.75 | 48.44 | 61.67 | 30.10 | 90.55 | 30.08 | 92.40 | 71.70 | 60.46 | 0.00 |
| All KP | 60.90 | 49.01 | 61.11 | 32.46 | 89.67 | 32.32 | 92.22 | 70.55 | 61.03 | 5.86 |
| Max-Score | 62.63 | 49.79 | 64.27 | 34.79 | 90.94 | 32.99 | 92.52 | 73.89 | 62.73 | 2.61 |
| S-LOO | 62.71 | 49.22 | 63.88 | 33.54 | 91.71 | 33.52 | 92.90 | 73.70 | 62.65 | 1.72 |
| T-LOO | 62.11 | 49.27 | 64.20 | 33.65 | 91.25 | 33.67 | 92.40 | 73.46 | 62.50 | 1.20 |
| CBRS | 63.02 | 49.90 | 64.17 | 34.79 | 91.56 | 33.57 | 92.65 | 73.89 | 62.94 | 2.60 |
| CSS | 64.44 +5.69 | 50.57 +2.13 | 65.03 +3.36 | 35.77 +5.67 | 91.71 +1.16 | 36.70 +6.62 | 92.90 +0.50 | 74.11 +2.41 | 63.90 +3.44 | 2.57 |
Final RL Training Results
KnowRL-Nemotron-1.5B was trained on the QuestA dataset using CSS-selected KPs.
Table 3: Evaluation results of RL training with CSS-selected KP data under different test-time prompting strategies (with and without KPs).
| Model | Hint Setting | AIME24 | AIME25 | BRUMO25 | HMMT25 | AMC23 | CMIMC25 | MATH | OlyBench | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| KnowRL-Nemotron-1.5B | w/o KP | 69.79 +10.73 | 64.69 +16.36 | 69.48 +8.75 | 41.04 +10.41 | 95.55 +4.85 | 44.14 +14.06 | 95.70 +3.35 | 80.23 +8.53 | 70.08 +9.63 |
| CBRS | 75.52 +12.50 | 65.00 +16.00 | 78.33 +14.16 | 45.00 +10.21 | 95.78 +4.22 | 49.22 +15.65 | 96.45 +3.80 | 82.34 +8.45 | 73.46 +10.52 | |
| CSS | 74.58 +10.52 | 65.21 +15.11 | 78.12 +13.09 | 48.75 +12.98 | 95.70 +5.23 | 52.19 +15.49 | 96.20 +3.30 | 82.44 +8.35 | 74.16 +10.52 | |
| QuestA | w/o KP | 71.56 | 62.08 | 67.5 | 40.94 | 93.44 | 41.48 | 92.95 | 72.28 | 67.78 |
| JustRL | w/o KP | 69.69 | 62.92 | 66.88 | 40.63 | 96.02 | 41.72 | 94.15 | 76.59 | 68.58 |
Key Findings:
- State-of-the-Art Performance: KnowRL achieves the highest average scores, establishing a new SOTA at the 1.5B scale.
- Internalized Reasoning: The substantial improvement without KP hints at inference (70.08) shows that KnowRL improves the underlying policy itself, not just hint-conditioned behavior.
- Effectiveness on Hard Problems: Gains are particularly large on challenging competition-style benchmarks (e.g., +15.11 on AIME25, +15.49 on CMIMC25 with CSS).
Analysis of Training Data Improvement
Training set analysis shows KnowRL effectively overcomes reward sparsity:
- Base Model: 41.21% of queries had zero correct answers (mean accuracy 22.40%).
- KnowRL (w/o inference KPs): Zero-correct fraction reduced to 13.00%, all-correct bucket raised to 34.28% (mean accuracy 64.30%).
- KnowRL (w/ inference KPs): All-correct bucket further concentrated to 51.07% (mean accuracy 77.04%).
Theoretical and Practical Implications
- Theoretical: Introduces and formalizes the minimal-sufficiency perspective and the pruning interaction paradox for hint-based RL. Provides a principled framework (CSS) for interaction-aware subset selection.
- Practical: Demonstrates that compact, structured guidance is more effective and efficient than longer hints or heavy abstractions. The method reduces computational overhead by minimizing hint length and avoiding reliance on teacher models during online RL.
Conclusion
KnowRL presents an effective framework for RLVR that uses minimal-sufficient knowledge guidance. By decomposing hints into KPs and selecting robust subsets via CSS, it achieves new state-of-the-art results while maintaining efficiency. The work positions structured, compact guidance as a practical scaling principle for sparse-reward RL and opens directions for extending KP curation to broader reasoning domains.