Visual Summary | Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Summary (Overview)

ZPPO (Zone of Proximal Policy Optimization) introduces a novel post-training paradigm where teacher knowledge is injected only into the prompt, never into the student’s policy gradient, overcoming the brittleness of logit distillation and the zero-advantage failure mode of RL on hard questions.
For questions where the student’s mean rollout accuracy is below 50%, two prompt reformulations are constructed: BCQ (pairs one correct teacher response with one wrong student response as anonymized candidates for the student to discriminate) and NCQ (aggregates all of the student’s wrong rollouts into a single prompt to surface shared failure patterns).
A prompt replay buffer re‑exposes each hard question until the student either graduates (accuracy ≥ 50%) or is FIFO‑evicted, amplifying BCQ/NCQ inside the student’s current zone of proximal development.
On a 31‑benchmark suite (16 VLM, 10 LLM, 5 Video) with Qwen3.5 students from 0.8B to 9B and a 27B teacher, ZPPO outperforms off‑/on‑policy distillation and GRPO at every scale, with the largest gains at the smallest scale (e.g., +9.3 pp on VLM benchmarks for the 0.8B student).
Distillation degrades generalization on LLM/Video benchmarks beyond the training corpus ( – 2.5 pp at 0.8B), whereas ZPPO improves generalization on the same families ( + 6.8 pp at 0.8B).

Introduction and Theoretical Foundation

Knowledge distillation forces a small student to imitate a large teacher’s logits, but in the small‑student regime this creates a “mode‑seeking bias”: the student concentrates on the teacher’s sharpest peaks, memorizes answers, and generalizes poorly on benchmarks beyond the training corpus. Reinforcement learning (e.g., GRPO) avoids logit imitation by training on the student’s own rollouts, but it has a blind spot: when every rollout in a group is wrong, the group‑relative advantage is zero and the question contributes no gradient signal. Injecting a teacher’s correct response directly into the policy gradient breaks the on‑pacity assumption and induces drift.

ZPPO is inspired by Vygotsky’s zone of proximal development – the band of tasks a learner cannot solve alone but can solve with a small amount of guidance from a more capable peer. The key insight: keep the teacher inside the prompt (as guidance) rather than inside the gradient (as a target to imitate).

Methodology

Preliminaries. For each question (x), a group of (G_S) student rollouts ({y_\text{S}^{(g)}}{g=1}^{G_S}) is drawn from policy (\pi\theta) and scored with a binary reward (r(x, y_\text{S}^{(g)}) \in {0,1}). The standard group‑relative advantage (used in GRPO) is:

[ A^{(g)} = \frac{r(x, y_\text{S}^{(g)}) - \bar{r}_x}{\text{std}_x + \epsilon} \tag{1} ]

When (\bar{r}_x = 0) (all wrong), every advantage is zero, yielding no gradient. ZPPO targets exactly these hard questions ((\bar{r}_x < 0.5)).

Prompt Reformulations.

BCQ (Binary Candidate‑included Question): Append one correct teacher response (from a frozen teacher (\pi_T)) and one wrong student response as shuffled, anonymized candidates inside <candidate> tags, with the instruction: “One is correct and another is wrong.” The student samples a new rollout group from this prompt; all tokens are on‑pacity, so the gradient stays on the student’s own policy.
NCQ (Negative Candidate‑included Question): Aggregates all of the student’s wrong rollouts, explicitly lists the parsed wrong answers, and appends each compressed reasoning trace as a <candidate> block. The student is cued to recognize shared failure patterns.

Both BCQ and NCQ are constructed only on hard questions and with fresh teacher rollouts each time.

Prompt Replay Buffer. Stores only the question (image + text), never responses. Admission criterion: (\bar{r}_x < 0.5). Graduation: (\bar{r}x \ge 0.5). FIFO eviction at capacity. Each training batch combines new questions with replayed ones (fraction (\rho{\text{replay}})), and BCQ/NCQ are applied to the hardest first.

RL Backbone. ZPPO builds on GRPO with three DAPO ingredients (clip‑higher, token‑level policy gradient loss, no KL penalty). Additionally:

Iterations per step: (I = 4) (instead of the default 16) – reduces on‑pacity drift while still providing multiple updates per rollout.
Batch‑level advantage normalization with zero‑advantage groups excluded (Norm w/o Zero) – prevents all‑wrong/all‑correct groups from shrinking the batch standard deviation and inflating advantages.

Empirical Validation / Results

Main Results (Tables 1 & 2). ZPPO is evaluated on four student scales (0.8B, 2B, 4B, 9B) with a 27B teacher. Key excerpts (0.8B and 2B) from the 16 VLM benchmarks:

Method	AI2D	BabyV	CharXiv	DynaM	EmbSp	InfoVQA	MVerse	MVision	MVista	MMMU Pro	MM-Vet	OCR EN	OCR ZH	VisP	VBlind	WeMath	Avg
0.8B Base	65.6	6.7	54.3	17.8	67.9	68.6	43.5	16.4	60.7	26.8	53.2	40.0	17.0	20.5	42.8	54.4	41.0
+ GRPO †	71.2	9.8	59.9	23.6	69.4	72.4	51.1	20.9	68.3	30.5	57.5	41.3	17.5	27.8	43.6	62.5	45.4
+ ZPPO	76.5	13.9	63.9	31.1	71.5	75.3	59.3	29.2	73.2	37.6	59.9	42.5	18.7	35.0	44.7	71.7	50.3
Δ	+5.3	+4.1	+4.0	+7.5	+2.1	+2.9	+8.2	+8.3	+4.9	+7.1	+2.4	+1.2	+1.2	+7.2	+1.1	+9.2	+4.9
2B Base	81.9	11.6	71.6	41.1	78.2	81.2	69.7	38.4	78.6	46.2	69.7	44.7	24.0	38.3	55.2	77.9	56.8
+ ZPPO	85.3	18.6	73.9	52.7	79.5	84.6	76.0	50.5	80.5	53.2	77.1	48.8	26.0	42.0	60.5	82.6	62.0
Δ	+1.5	+4.2	+0.9	+6.8	+0.8	+1.5	+3.2	+7.1	+1.2	+3.6	+3.1	+2.4	+0.8	+1.7	+3.5	+1.9	+2.8

Table 1 excerpt – full table continues for 4B/9B in paper.

On the LLM and Video benchmarks (generalization beyond training corpus), ZPPO also gains while distillation loses (e.g., 0.8B: ZPPO +6.8 pp vs. distillation –2.5 pp on LLM+Video average).

Component Ablation (Table 3). Each component is isolated at 0.8B and 2B on VLM benchmarks:

Replay alone (GRPO †) gives marginal gains.
Reformulation alone (GRPO + Both) also moves little.
But BCQ + Buffer and NCQ + Buffer are super‑additive, and ZPPO (full combination) is strongest at every scale.

RL Recipe Choices (Figure 6). At 2B, sweeping iterations per step (I) shows (I = 4) as the sweet spot across all benchmark families. Batch‑level normalization with zero‑advantage groups excluded (Norm w/o Zero) is critical; Norm w/ Zero degrades performance.

Comparison with Hint/Prefix (Table 4). ZPPO (or even BCQ alone) outperforms Hint (directional cue without answer) and Prefix (off‑p‑culty teacher prefix) on both VLM and generalization benchmarks.

Graduation Dynamics (Figures 4 & 5). On questions admitted at 0% rollout accuracy, ZPPO graduates 28% vs. GRPO †’s 4%. As student scale grows, BCQ’s contribution shrinks (teacher also fails) and NCQ’s grows; large students graduate most questions before eviction, while small students rely more on repeated re‑exposures.

Theoretical and Practical Implications

Teacher in prompts, not gradients: ZPPO provides a principled way to transfer teacher knowledge to small students without violating on‑pacity assumptions or requiring the student to match a broad logit distribution. This preserves the student’s own exploration and generalization.
Bridging the small‑model gap: The largest gains are at the smallest scales (0.8B, 2B), which are precisely the scales needed for deployment on mobile, AR/VR, and robotics. ZPPO makes post‑training viable for compute‑constrained settings.
Limitation: teacher‑bounded zone. BCQ requires the teacher to succeed on hard questions; if both teacher and student fail, only NCQ operates, yielding limited gains. Extending the zone beyond the current teacher (e.g., via synthetic prompts or ensembles) is an open problem.
Recipe lessons: Two low‑cost choices – ((I=4)) iterations per step and batch‑level norm excluding zero‑advantage groups – materially affect small‑model dynamics and are easily reproducible.

Conclusion

ZPPO answers how to transfer teacher knowledge to a small student without imitating logits or injecting teacher responses into the policy gradient. By placing the teacher only inside the prompt (via BCQ and NCQ) and amplifying those reformulations with a prompt replay buffer, ZPPO achieves strong gains on 31 benchmarks across three families, especially at sub‑2B scales where distillation fails. The key open challenge is extending the student’s zone beyond what the current teacher covers – a direction for future work with ensembles, curriculum‑aware sampling, or synthetic data. ZPPO is orthogonal to initial model construction and can be stacked as a later post‑training stage.