Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

Summary (Overview)

Primary Contribution: This paper is the first to successfully extend the MeanFlow framework for one-step image generation from class-conditional to text-conditional tasks, enabling efficient and high-quality text-to-image (T2I) synthesis.
Key Insight: The authors identify that effective one-step generation requires text representations with high discriminability (to capture subtle semantic differences) and disentanglement (to preserve linguistic structure), properties that are crucial when refinement steps are extremely limited.
Method: They propose EMF (Extending MeanFlow to T2I), which adapts the MeanFlow training objective to a T2I model (BLIP3o-NEXT) by modifying its temporal conditioning mechanism to handle interval-based flow maps.
Empirical Result: The proposed EMF model achieves competitive performance with very few sampling steps (e.g., GenEval score of 0.90 at 4 steps), outperforming other distilled models and approaching the performance of the full 30-step baseline.
General Finding: The success of MeanFlow adaptation is highly dependent on the quality of the underlying text encoder; the BLIP3o-NEXT encoder was validated to possess the necessary properties, while other encoders (e.g., SANA-1.5) failed even after domain matching.

Introduction and Theoretical Foundation

Recent advancements in generative models, particularly diffusion and flow matching models, have enabled high-quality image creation. However, generating high-quality images typically requires many iterative denoising steps, making efficiency a key challenge. Few-step generation aims to reduce this computational cost. Among these methods, MeanFlow has emerged as a principled framework for one-step generation by learning a flow map that predicts the average velocity between two time steps, achieving performance comparable to standard multi-step models.

Existing research on MeanFlow has primarily focused on class-conditional image generation (e.g., on ImageNet). A natural and impactful extension is to move from fixed class labels to flexible text inputs, which would enable richer and more diverse content creation. However, text conditions pose a greater challenge to a model's semantic understanding capabilities.

The authors attempted to integrate powerful LLM-based text encoders (common in modern T2I models like SANA-1.5) into the MeanFlow framework. Surprisingly, conventional training strategies failed to yield satisfactory performance. Analysis revealed that the extremely limited number of refinement steps in MeanFlow (often just one) demands text feature representations with very high discriminability. This explains why discrete, easily distinguishable class features work well in the original MeanFlow framework. The paper thus investigates the properties required of text encoders for successful adaptation and proposes a method to achieve efficient text-conditioned synthesis.

Methodology

3.1. Preliminary: MeanFlow

MeanFlow learns a flow map $u_\theta(z_t, t, r)$ that directly predicts the transition from state $z_t$ at time $t$ to $z_r$ at time $r$ :

z_r = z_t + (r - t) u_\theta(z_t, t, r), \quad r > t.

For the true ODE trajectory, the ideal flow map is the average velocity over $[t, r]$ . MeanFlow derives a self-consistent target by differentiating the transition equation along the trajectory:

u_\theta(z_t, t, r) = v(z_t, t) + (r - t) \frac{d}{dt} u_\theta(z_t, t, r).

Here, the total derivative is $\frac{d}{dt} u_\theta = \partial_t u_\theta + (\nabla_{z_t} u_\theta) v(z_t, t)$ , implemented via Jacobian-vector product (JVP). The training objective is:

\mathcal{L}_{MF}(\theta) = \mathbb{E}_{t, z_t, r} \left[ \| u_\theta(z_t, t, r) - \text{sg}(\tilde{u}(z_t, t, r)) \|^2 \right],

where $\text{sg}(\cdot)$ is the stop-gradient operator and $\tilde{u}(z_t, t, r) = v(z_t, t) + (r - t) \frac{d}{dt} u_\theta(z_t, t, r)$ .

3.2. & 3.3. Analysis of Text Representations

The authors evaluate two mainstream T2I models (SANA-1.5 and BLIP3o-NEXT) under constrained-iteration settings. They find that BLIP3o-NEXT maintains basic semantic integrity even at 1 step, while SANA-1.5 suffers substantial semantic loss. This indicates that the text representation from BLIP3o-NEXT yields a higher-quality velocity field, better aligned with target semantics, and is more suitable for MeanFlow.

To understand this difference, they analyze two key properties of text encoders:

Discriminability: The ability to generate representations well-aligned with corresponding image embeddings. They conduct an image-text retrieval experiment on COCO 2017. Text embeddings $e(x)$ are mean-pooled:

h(x) = \frac{1}{L_{seq}} \sum_{t=1}^{L_{seq}} e(x)_t,

and cosine similarity is computed:

\cos(x, y) = 1 - \frac{h(x)^\top h(y)}{\|h(x)\|_2 \|h(y)\|_2}.

The retrieved images are re-encoded with DINOv3, and the cosine similarity to the query image is aggregated.

Disentanglement: The ability of the text embedding to retain the linguistic structure of the original prompt. They evaluate on DPG-Bench by creating ablated (shortened) prompts and computing the cosine distance between the embeddings of the original and ablated versions.

Key Tables from Analysis:

Table 1: DINO evaluation of image-feature similarity for text-retrieved images (quantifying Discriminability).

Model	BLIP3o-NEXT	CLIP	Gemma	T5
Score	0.734	0.730	0.713	0.634

Table 2: Evaluation of Text Encoder Disentanglement via Subsequence Similarity.

Model	BLIP3o-NEXT	CLIP	Gemma	T5
Score	0.999	0.967	0.987	0.893

BLIP3o-NEXT excels in both properties.

3.4. Extending MeanFlow to T2I Generation (EMF)

Given a pre-trained flow matching backbone conditioned on textual embeddings, the architecture is modified to support MeanFlow's bidirectional time conditioning.

Temporal Conditioning Adaptation: Standard flow matching uses a single temporal embedding layer $\phi_{\text{time}}(t)$ . EMF duplicates these parameters into two separate layers: $\phi_{\text{interval}}(\cdot)$ encodes the interval length $(t - r)$ , and $\phi_{\text{end}}(\cdot)$ encodes the segment end time $t$ . The conditional temporal embedding is constructed as:

\phi_{\text{cond}}(t, r) = \phi_{\text{interval}}(t - r) + \phi_{\text{end}}(t).

Conditioning: The conditioning embedding $\phi_{\text{cond}}$ and text features $\psi_{\text{text}}(x_{\text{text}}$ ) (from the BLIP3o-NEXT encoder) jointly condition the velocity network:

u_\theta(z_t, t, r, \psi_{\text{text}}) = f_\theta(z_t, \phi_{\text{cond}}(t, r), \psi_{\text{text}}).

Training: Timesteps $(t, r)$ are adaptively sampled from a uniform or logit-normal distribution:

t, r \sim p(\cdot; \mu(p), \sigma(p)), \quad t \neq r,

where parameters $\mu(p), \sigma(p)$ are interpolated based on training progress $p \in [0,1]$ . This ensures exposure to both short- and long-range segments.

The model is trained with the MeanFlow objective adapted for text conditioning:

\mathcal{L}_{MF}(\theta) = \mathbb{E}_{z_t, t, r} \left[ \| u_\theta(z_t, t, r, \psi_{\text{text}}) - \text{sg}(u_{\text{tgt}}) \|^2 \right],

with the target defined as:

u_{\text{tgt}} = v_\theta(z_t, t, \psi_{\text{text}}) + (r - t) \frac{d}{dt} u_\theta(z_t, t, r, \psi_{\text{text}}).

Empirical Validation / Results

4.2. Comparison with State-of-the-arts

The model is evaluated on GenEval, DPG-Bench, and HPS-v2.

Table 3: GenEval results for pretrained, unified, distilled models, and few-step comparisons.

Model	#Params	Steps	Single Object	Two Objects	Counting	Colors	Position	Color Attribution	Overall
BLIP3o-NEXT [51]	3B	30	0.99	0.95	0.88	0.90	0.92	0.79	0.91
EMF (Ours)	3B	1	0.98	0.86	0.66	0.69	0.80	0.47	0.74
EMF (Ours)	3B	2	0.99	0.91	0.81	0.86	0.86	0.66	0.85
EMF (Ours)	3B	4	1.00	0.94	0.88	0.92	0.91	0.76	0.90

EMF achieves a GenEval score of 0.90 with just 4 steps, nearly matching BLIP3o-NEXT's 30-step score of 0.91, and outperforming other distilled models.
On the more challenging DPG-Bench and HPS-v2, EMF with 4 steps closely matches the 30-step BLIP3o-NEXT baseline (see Table 4).

Table 4: DPG-Bench and HPS-v2.1 results (partial).

Model	Steps	DPG-Bench Overall	HPS-v2.1 Average
BLIP3o-NEXT	4	78.15	26.96
BLIP3o-NEXT	30	82.05	29.42
EMF	1	77.36 (+20.31)	25.77 (+7.23)
EMF	2	79.44 (+12.06)	27.21 (+4.76)
EMF	4	81.20 (+3.05)	29.25 (+2.29)

EMF shows dramatic improvements at very few steps (1, 2) and matches baseline performance at 4 steps.

4.3. Ablation of Sampling Steps

Performance under different sampling steps is monitored during training (Figure 4).

4-step sampling achieves high quality (~0.90 GenEval) within ~10k training steps.
2-step sampling reaches 0.85 at 70k steps.
1-step sampling reaches 0.74 at 90k steps. This demonstrates stable convergence and the benefit of additional steps in the MeanFlow framework.

5. Discussion & Additional Experiments

Scalability Beyond 2 Steps: Unlike consistency-distilled models which often saturate, EMF performance continues to improve with more steps (e.g., DPG-Bench score increases from 81.20 at 4-step to 81.94 at 8-step).
Dependency on Text Encoder: Attempts to apply MeanFlow to SANA-1.5 failed, even after fine-tuning its text encoder on the same SFT data used for BLIP3o-NEXT (Table 5). This confirms that the success of MeanFlow adaptation is critically dependent on the inherent properties of the text encoder.

Table 5: GenEval scores of SANA-1.5 experiments.

Sample Method	Encoder-SFT	MeanFlow Train	Sampling Steps	GenEval
Flow Matching			20	0.81
Flow Matching	✓		20	0.85
MeanFlow		✓	4	0.50
MeanFlow		✓	20	0.83
MeanFlow	✓	✓	4	0.47
MeanFlow	✓	✓	20	0.82

Training Stability: Figure 6 shows that BLIP3o-NEXT (both RL and SFT variants) converges stably under MeanFlow training, while SANA-1.5 exhibits instability regardless of encoder fine-tuning.

Theoretical and Practical Implications

Theoretical Insight: The paper provides a principled explanation for why MeanFlow succeeds in class-conditional settings but fails with poor text encoders: few-step generation requires highly discriminative and disentangled text representations to compensate for the limited refinement capacity.
Practical Guidance: It offers a concrete recipe for adapting MeanFlow to T2I generation: 1) Select a text encoder validated to possess high discriminability and disentanglement (BLIP3o-NEXT is identified as such), and 2) Adapt the temporal conditioning mechanism to handle interval-based flow maps.
Impact on Efficiency: The proposed EMF model enables high-quality T2I generation with very few steps (1-4), significantly reducing inference time while maintaining competitive performance with models requiring 20-30 steps.
Broader Applicability: The findings and methodology could serve as a reference for integrating efficient one-step generation frameworks into other conditional generation tasks beyond T2I.

Conclusion

This work successfully extends MeanFlow-based one-step generation from class labels to flexible text inputs, enabling efficient and high-quality T2I synthesis. The key discovery is that successful adaptation hinges on text representations with strong semantic discriminability and disentanglement. Guided by this insight, the authors leverage the BLIP3o-NEXT text encoder and adapt the MeanFlow framework, resulting in the EMF model. Empirical results validate the approach, showing competitive performance with very few sampling steps. This work provides practical guidance and a strong reference for future research on text-conditioned MeanFlow generation.