Visual Summary | Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Summary (Overview)

Z-Reward is a teacher-student framework that decouples reasoning-heavy judgment from efficient reward deployment for text-to-image generation.
The teacher (27B VLM) uses reasoning via Group-wise Direct Score Optimization (GDSO) to infer rubric-aligned score distributions, combining policy-gradient rewards with direct pointwise and pairwise supervision.
The student (9B VLM) uses Reasoning-Internalized Score Distillation (RISD) to internalize the teacher’s reasoning-conditioned score distribution into a compact, non-reasoning model for fast, differentiable scoring.
On an internally annotated test set, the 27B GDSO teacher achieves 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO; the 9B RISD student reaches 88.6%, closely matching the teacher and outperforming on-policy distillation (OPD) baselines.
Z-Reward serves as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over an SFT baseline.

Introduction and Theoretical Foundation

Reward models are essential for text-to-image post-training, but visual preference is inherently subjective — different annotators may assign different scores to the same image, especially for aesthetics, realism, and fine-grained prompt alignment. Human evaluation is therefore better viewed as a distribution of judgments rather than a deterministic scalar score.

Existing reward modeling paradigms have key limitations:

Scalar, score-token, and pairwise reward models compress preference into a single value or comparison, discarding annotator uncertainty and fine-grained differences among plausible scores.
Reasoning-based generative reward models can produce higher-quality judgments by leveraging world knowledge and explicit rationales, but are expensive at inference time and their textual outputs are less suitable for gradient-based optimization.
Explicit distribution modeling can represent uncertainty but typically requires repeated annotations per sample, which is difficult to scale.

Z-Reward resolves this tension by decoupling judgment quality from reward efficiency: the large teacher uses reasoning to produce a calibrated score distribution, while the compact student internalizes this reasoning-enhanced distribution and directly predicts scores without generating reasoning chains at inference time.

Key insight: Reward models do not need to reproduce how a teacher reasons — they need to reproduce how a reasoning teacher judges. Therefore, instead of forcing the student to imitate the sequential process of reasoning, Z-Reward allows the compact model to internalize the teacher’s reasoning-conditioned judgment directly into score distributions.

Methodology

Annotation and Datasets

The annotation document defines four critical dimensions: Text–Image Alignment, Realism, Aesthetics, and Physical Plausibility. Each dimension uses a five-level rubric, but final annotations are recorded on a nine-level half-point scale ( $\hat{s} \in \{1.0, 1.5, \dots, 5.0\}$ ). This allows annotators to capture fine-grained quality differences. The workflow involves:

Pointwise scoring according to the rubric and document examples.
Comparison of candidates under the same prompt to shift scores by $\pm 0.5$ .
Quality control review.

Annotation prompts come from three sources: internal captions, real-world user prompts, and LLM-expanded compositional concepts.

Teacher Model: Group-wise Direct Score Optimization (GDSO)

Given a prompt $p$ , image $I$ , and reward dimension $d \in \mathcal{D}$ , the teacher generates a reasoning trace $\rho$ and predicts a distribution over score bins $s \in \mathcal{S}$ :

q_\theta(s | p, I, d, \rho), \quad s \in \mathcal{S}

The expected scalar score is:

\mu_\theta(p, I, d, \rho) = \sum_{s \in \mathcal{S}} s \, q_\theta(s | p, I, d, \rho)

GDSO augments GRPO-style policy-gradient optimization with direct supervised gradients on score distributions and score gaps. For each training instance with winning sample $x_w = (p, I_w, d)$ and losing sample $x_l = (p, I_l, d)$ with ground-truth scores $\hat{s}_w$ and $\hat{s}_l$ :

Pointwise reward (Eq. 7): measures closeness of predicted expected score to annotated score:
$r^\text{pt}_{j,i} = 1 - \frac{|\mu_{j,i} - \hat{s}_j|}{S_{\max} - S_{\min}}$
Pointwise cross-entropy loss (Eq. 8): directly supervises score-bin probabilities:
$\mathcal{L}^\text{pt}_\text{CE} = -\frac{1}{2G} \sum_{j \in \{w,l\}} \sum_{i=1}^G \log q_{j,i}(\hat{s}_j)$
Pairwise reward (Eq. 10): matches score gap between same-prompt candidates:
$r^\text{pw}_{j,i} = 1 - \frac{1}{G(S_{\max} - S_{\min})} \sum_{k=1}^G \left| (\mu_{j,i} - \mu_{\bar{j},k}) - \Delta\hat{s}_{j,\bar{j}} \right|$
Pairwise loss (Eq. 11): direct supervised gap loss:
$\mathcal{L}^\text{pw} = \frac{1}{2G^2(S_{\max} - S_{\min})} \sum_{j \in \{w,l\}} \sum_{i=1}^G \sum_{k=1}^G \left| (\mu_{j,i} - \mu_{\bar{j},k}) - \Delta\hat{s}_{j,\bar{j}} \right|$

The combined reward is $r_{j,i} = \lambda_\text{pt} r^\text{pt}_{j,i} + \lambda_\text{pw} r^\text{pw}_{j,i}$ , and the overall objective (Eq. 14) is:

\mathcal{L}_\text{GDSO} = \mathcal{L}_\text{GRPO}(\{r_{j,i}\}) + \alpha_\text{pt} \mathcal{L}^\text{pt}_\text{CE} + \alpha_\text{pw} \mathcal{L}^\text{pw}

Student Model: Reasoning-Internalized Score Distillation (RISD)

The student $q_\phi(s | p, I, d)$ is trained to predict the teacher’s reasoning-conditioned distribution directly, without generating reasoning tokens. RISD uses a KL divergence loss (Eq. 15):

\mathcal{L}_\text{RISD} = \mathbb{E}_{(p, I, d)} \left[ D_\text{KL}\big( q_T(s | p, I, d, \rho_T) \,\|\, q_\phi(s | p, I, d) \big) \right]

The student’s deployable score (Eq. 16) is:

\mu_\phi(p, I, d) = \sum_{s \in \mathcal{S}} s \, q_\phi(s | p, I, d)

Empirical Validation / Results

Main Results

Table 2 presents key results on the internally annotated test set:

Method	PLCC	SRCC	HPA	Margin HPA
27B
Zero-shot	0.6301	0.5816	74.38%	95.38%
SFT	0.6458	0.5914	81.35%	96.44%
RewardDance	0.6667	0.6207	84.25%	97.06%
GRPO	0.7200	0.6832	86.04%	98.27%
GDSO	0.7620	0.7132	89.56%	98.85%
9B
SFT	0.5296	0.4942	74.59%	84.01%
RewardDance	0.5182	0.4338	78.17%	89.72%
GRPO	0.5340	0.5072	77.03%	90.76%
GDSO	0.6341	0.5665	83.95%	95.99%
RISD	0.7391	0.6882	88.64%	98.01%

The 27B GDSO teacher substantially outperforms all baselines, and the 9B RISD student closely matches the larger teacher.

Ablation Studies

Distribution-based reward extraction vs. text parsing: Using the expectation of the decoded score distribution consistently improves HPA and margin HPA for both GRPO and GDSO, because it preserves fine-grained scoring signals that text parsing quantizes (e.g., scores 3.8 and 4.2 both emitted as token "4").
Distillation method comparison (Table 3): RISD (88.64% HPA, single output token) outperforms OPD (83.11% HPA, ~750 output tokens) and achieves superior efficiency. OPD is bounded by the student’s own weak exploration, while RISD directly matches the teacher’s reasoning-conditioned distribution.

Validating as Optimizable Reward Signal

Using the student reward model in a ReFL-style differentiable reward backpropagation scheme across multiple dimensions, validation reward scores steadily improve over 10k RL iterations. Blind human evaluation using the GSB metric shows a 41.3% net human-preference improvement over the SFT baseline, confirming that the reward improvements translate to human-perceived quality gains.

Theoretical and Practical Implications

Uncertainty-aware reward modeling: Representing human preference as a distribution over rubric scores, rather than a scalar, better captures the inherent subjectivity of visual evaluation and preserves fine-grained differences between neighboring score bins.
Reasoning-internalized efficiency: The teacher-student decoupling enables high-quality judgment from a large reasoning model while deploying a compact, fast, differentiable student — solving the central tension between judgment quality and deployment efficiency.
Practical impact for text-to-image optimization: The differentiable nature of Z-Reward allows direct gradient backpropagation through the generator, providing dense and stable optimization signals that yield significant human-perceived improvements.
General framework: The formulation extends beyond image generation to any sequence-to-score task, including video quality assessment, caption evaluation, and unified reward modeling across modalities with pointwise distributions, pairwise comparisons, and calibrated score gaps.

Conclusion

Z-Reward is a teacher-student reward modeling framework that represents human preference as a reasoning-conditioned score distribution. The teacher is trained with Group-wise Direct Score Optimization (GDSO), combining policy-gradient learning with direct supervision on score distributions and score gaps. The student uses Reasoning-Internalized Score Distillation (RISD) to internalize the teacher’s reasoning-based distribution and provide efficient, direct, differentiable scoring without explicit reasoning chains.

Experiments show that the 27B GDSO teacher outperforms SFT, RewardDance, and GRPO, while the 9B RISD student closely matches the larger teacher and serves as an effective reward signal for text-to-image optimization, yielding 41.3% net human-preference improvement.

Future directions include:

Strengthening reasoning-score coupling to ensure scores are tightly grounded in rationales.
Extending the decoupled teacher-student design to unified reward modeling across image, video, text, and multimodal generation tasks.
Applying the framework to general sequence-to-score evaluator roles for quality assessment and caption evaluation.