# Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

> The paper successfully extends the MeanFlow framework to text-to-image generation by showing that one-step synthesis requires text encoders with high discriminability and disentanglement.

- **Source:** [arXiv](https://arxiv.org/abs/2604.18168)
- **Published:** 2026-04-22
- **Permalink:** https://picx.dev/p/iKbI5Y
- **Whiteboard:** https://picx.dev/p/iKbI5Y/image

## Summary

# Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

## Summary (Overview)
*   **Primary Contribution:** This paper is the first to successfully extend the MeanFlow framework for one-step image generation from class-conditional to text-conditional tasks, enabling efficient and high-quality text-to-image (T2I) synthesis.
*   **Key Insight:** The authors identify that effective one-step generation requires text representations with high **discriminability** (to capture subtle semantic differences) and **disentanglement** (to preserve linguistic structure), properties that are crucial when refinement steps are extremely limited.
*   **Method:** They propose EMF (Extending MeanFlow to T2I), which adapts the MeanFlow training objective to a T2I model (BLIP3o-NEXT) by modifying its temporal conditioning mechanism to handle interval-based flow maps.
*   **Empirical Result:** The proposed EMF model achieves competitive performance with very few sampling steps (e.g., GenEval score of 0.90 at 4 steps), outperforming other distilled models and approaching the performance of the full 30-step baseline.
*   **General Finding:** The success of MeanFlow adaptation is highly dependent on the quality of the underlying text encoder; the BLIP3o-NEXT encoder was validated to possess the necessary properties, while other encoders (e.g., SANA-1.5) failed even after domain matching.

## Introduction and Theoretical Foundation
Recent advancements in generative models, particularly diffusion and flow matching models, have enabled high-quality image creation. However, generating high-quality images typically requires many iterative denoising steps, making efficiency a key challenge. **Few-step generation** aims to reduce this computational cost. Among these methods, **MeanFlow** has emerged as a principled framework for **one-step generation** by learning a flow map that predicts the average velocity between two time steps, achieving performance comparable to standard multi-step models.

Existing research on MeanFlow has primarily focused on **class-conditional image generation** (e.g., on ImageNet). A natural and impactful extension is to move from fixed class labels to **flexible text inputs**, which would enable richer and more diverse content creation. However, text conditions pose a greater challenge to a model's semantic understanding capabilities.

The authors attempted to integrate powerful LLM-based text encoders (common in modern T2I models like SANA-1.5) into the MeanFlow framework. Surprisingly, conventional training strategies failed to yield satisfactory performance. Analysis revealed that the **extremely limited number of refinement steps** in MeanFlow (often just one) demands text feature representations with very high **discriminability**. This explains why discrete, easily distinguishable class features work well in the original MeanFlow framework. The paper thus investigates the properties required of text encoders for successful adaptation and proposes a method to achieve efficient text-conditioned synthesis.

## Methodology

### 3.1. Preliminary: MeanFlow
MeanFlow learns a flow map $u_\theta(z_t, t, r)$ that directly predicts the transition from state $z_t$ at time $t$ to $z_r$ at time $r$:

$$
z_r = z_t + (r - t) u_\theta(z_t, t, r), \quad r > t.
$$

For the true ODE trajectory, the ideal flow map is the average velocity over $[t, r]$. MeanFlow derives a self-consistent target by differentiating the transition equation along the trajectory:

$$
u_\theta(z_t, t, r) = v(z_t, t) + (r - t) \frac{d}{dt} u_\theta(z_t, t, r).
$$

Here, the total derivative is $\frac{d}{dt} u_\theta = \partial_t u_\theta + (\nabla_{z_t} u_\theta) v(z_t, t)$, implemented via Jacobian-vector product (JVP). The training objective is:

$$
\mathcal{L}_{MF}(\theta) = \mathbb{E}_{t, z_t, r} \left[ \| u_\theta(z_t, t, r) - \text{sg}(\tilde{u}(z_t, t, r)) \|^2 \right],
$$

where $\text{sg}(\cdot)$ is the stop-gradient operator and $\tilde{u}(z_t, t, r) = v(z_t, t) + (r - t) \frac{d}{dt} u_\theta(z_t, t, r)$.

### 3.2. & 3.3. Analysis of Text Representations
The authors evaluate two mainstream T2I models (SANA-1.5 and BLIP3o-NEXT) under constrained-iteration settings. They find that **BLIP3o-NEXT maintains basic semantic integrity even at 1 step**, while SANA-1.5 suffers substantial semantic loss. This indicates that the text representation from BLIP3o-NEXT yields a higher-quality velocity field, better aligned with target semantics, and is more suitable for MeanFlow.

To understand this difference, they analyze two key properties of text encoders:
1.  **Discriminability:** The ability to generate representations well-aligned with corresponding image embeddings. They conduct an image-text retrieval experiment on COCO 2017. Text embeddings $e(x)$ are mean-pooled:

$$
h(x) = \frac{1}{L_{seq}} \sum_{t=1}^{L_{seq}} e(x)_t,
$$

and cosine similarity is computed:

$$
\cos(x, y) = 1 - \frac{h(x)^\top h(y)}{\|h(x)\|_2 \|h(y)\|_2}.
$$

The retrieved images are re-encoded with DINOv3, and the cosine similarity to the query image is aggregated.

2.  **Disentanglement:** The ability of the text embedding to retain the linguistic structure of the original prompt. They evaluate on DPG-Bench by creating ablated (shortened) prompts and computing the cosine distance between the embeddings of the original and ablated versions.

**Key Tables from Analysis:**

**Table 1:** DINO evaluation of image-feature similarity for text-retrieved images (quantifying Discriminability).
| Model          | BLIP3o-NEXT | CLIP | Gemma | T5   |
|----------------|-------------|------|-------|------|
| Score          | 0.734       | 0.730| 0.713 | 0.634|

**Table 2:** Evaluation of Text Encoder Disentanglement via Subsequence Similarity.
| Model          | BLIP3o-NEXT | CLIP | Gemma | T5   |
|----------------|-------------|------|-------|------|
| Score          | 0.999       | 0.967| 0.987 | 0.893|

BLIP3o-NEXT excels in both properties.

### 3.4. Extending MeanFlow to T2I Generation (EMF)
Given a pre-trained flow matching backbone conditioned on textual embeddings, the architecture is modified to support MeanFlow's bidirectional time conditioning.

*   **Temporal Conditioning Adaptation:** Standard flow matching uses a single temporal embedding layer $\phi_{\text{time}}(t)$. EMF duplicates these parameters into two separate layers: $\phi_{\text{interval}}(\cdot)$ encodes the interval length $(t - r)$, and $\phi_{\text{end}}(\cdot)$ encodes the segment end time $t$. The conditional temporal embedding is constructed as:

$$
\phi_{\text{cond}}(t, r) = \phi_{\text{interval}}(t - r) + \phi_{\text{end}}(t).
$$

*   **Conditioning:** The conditioning embedding $\phi_{\text{cond}}$ and text features $\psi_{\text{text}}(x_{\text{text}}$) (from the BLIP3o-NEXT encoder) jointly condition the velocity network:

$$
u_\theta(z_t, t, r, \psi_{\text{text}}) = f_\theta(z_t, \phi_{\text{cond}}(t, r), \psi_{\text{text}}).
$$

*   **Training:** Timesteps $(t, r)$ are adaptively sampled from a uniform or logit-normal distribution:

$$
t, r \sim p(\cdot; \mu(p), \sigma(p)), \quad t \neq r,
$$

where parameters $\mu(p), \sigma(p)$ are interpolated based on training progress $p \in [0,1]$. This ensures exposure to both short- and long-range segments.

The model is trained with the MeanFlow objective adapted for text conditioning:

$$
\mathcal{L}_{MF}(\theta) = \mathbb{E}_{z_t, t, r} \left[ \| u_\theta(z_t, t, r, \psi_{\text{text}}) - \text{sg}(u_{\text{tgt}}) \|^2 \right],
$$

with the target defined as:

$$
u_{\text{tgt}} = v_\theta(z_t, t, \psi_{\text{text}}) + (r - t) \frac{d}{dt} u_\theta(z_t, t, r, \psi_{\text{text}}).
$$

## Empirical Validation / Results

### 4.2. Comparison with State-of-the-arts
The model is evaluated on **GenEval**, **DPG-Bench**, and **HPS-v2**.

**Table 3:** GenEval results for pretrained, unified, distilled models, and few-step comparisons.
| Model                | #Params | Steps | Single Object | Two Objects | Counting | Colors | Position | Color Attribution | **Overall** |
|----------------------|---------|-------|---------------|-------------|----------|--------|----------|-------------------|-------------|
| **BLIP3o-NEXT [51]** | 3B      | 30    | 0.99          | 0.95        | 0.88     | 0.90    | 0.92     | 0.79              | **0.91**    |
| **EMF (Ours)**       | 3B      | 1     | 0.98          | 0.86        | 0.66     | 0.69    | 0.80     | 0.47              | **0.74**    |
| **EMF (Ours)**       | 3B      | 2     | 0.99          | 0.91        | 0.81     | 0.86    | 0.86     | 0.66              | **0.85**    |
| **EMF (Ours)**       | 3B      | 4     | 1.00          | 0.94        | 0.88     | 0.92    | 0.91     | 0.76              | **0.90**    |

*   EMF achieves a GenEval score of **0.90 with just 4 steps**, nearly matching BLIP3o-NEXT's 30-step score of 0.91, and outperforming other distilled models.
*   On the more challenging **DPG-Bench** and **HPS-v2**, EMF with 4 steps closely matches the 30-step BLIP3o-NEXT baseline (see Table 4).

**Table 4:** DPG-Bench and HPS-v2.1 results (partial).
| Model          | Steps | DPG-Bench Overall | HPS-v2.1 Average |
|----------------|-------|-------------------|------------------|
| BLIP3o-NEXT    | 4     | 78.15             | 26.96            |
| BLIP3o-NEXT    | 30    | 82.05             | 29.42            |
| **EMF**        | **1** | **77.36 (+20.31)**| **25.77 (+7.23)**|
| **EMF**        | **2** | **79.44 (+12.06)**| **27.21 (+4.76)**|
| **EMF**        | **4** | **81.20 (+3.05)** | **29.25 (+2.29)**|

*   EMF shows dramatic improvements at very few steps (1, 2) and matches baseline performance at 4 steps.

### 4.3. Ablation of Sampling Steps
Performance under different sampling steps is monitored during training (Figure 4).
*   **4-step** sampling achieves high quality (~0.90 GenEval) within ~10k training steps.
*   **2-step** sampling reaches 0.85 at 70k steps.
*   **1-step** sampling reaches 0.74 at 90k steps.
This demonstrates stable convergence and the benefit of additional steps in the MeanFlow framework.

### 5. Discussion & Additional Experiments
*   **Scalability Beyond 2 Steps:** Unlike consistency-distilled models which often saturate, EMF performance continues to improve with more steps (e.g., DPG-Bench score increases from 81.20 at 4-step to 81.94 at 8-step).
*   **Dependency on Text Encoder:** Attempts to apply MeanFlow to SANA-1.5 failed, even after fine-tuning its text encoder on the same SFT data used for BLIP3o-NEXT (Table 5). This confirms that the success of MeanFlow adaptation is critically dependent on the inherent properties of the text encoder.

**Table 5:** GenEval scores of SANA-1.5 experiments.
| Sample Method | Encoder-SFT | MeanFlow Train | Sampling Steps | GenEval |
|---------------|-------------|----------------|----------------|---------|
| Flow Matching |             |                | 20             | 0.81    |
| Flow Matching | ✓           |                | 20             | 0.85    |
| MeanFlow      |             | ✓              | 4              | 0.50    |
| MeanFlow      |             | ✓              | 20             | 0.83    |
| MeanFlow      | ✓           | ✓              | 4              | 0.47    |
| MeanFlow      | ✓           | ✓              | 20             | 0.82    |

*   **Training Stability:** Figure 6 shows that BLIP3o-NEXT (both RL and SFT variants) converges stably under MeanFlow training, while SANA-1.5 exhibits instability regardless of encoder fine-tuning.

## Theoretical and Practical Implications
*   **Theoretical Insight:** The paper provides a principled explanation for why MeanFlow succeeds in class-conditional settings but fails with poor text encoders: **few-step generation requires highly discriminative and disentangled text representations** to compensate for the limited refinement capacity.
*   **Practical Guidance:** It offers a concrete recipe for adapting MeanFlow to T2I generation: 1) Select a text encoder validated to possess high discriminability and disentanglement (BLIP3o-NEXT is identified as such), and 2) Adapt the temporal conditioning mechanism to handle interval-based flow maps.
*   **Impact on Efficiency:** The proposed EMF model enables **high-quality T2I generation with very few steps (1-4)**, significantly reducing inference time while maintaining competitive performance with models requiring 20-30 steps.
*   **Broader Applicability:** The findings and methodology could serve as a reference for integrating efficient one-step generation frameworks into other conditional generation tasks beyond T2I.

## Conclusion
This work successfully extends MeanFlow-based one-step generation from class labels to flexible text inputs, enabling efficient and high-quality T2I synthesis. The key discovery is that successful adaptation hinges on text representations with strong **semantic discriminability** and **disentanglement**. Guided by this insight, the authors leverage the BLIP3o-NEXT text encoder and adapt the MeanFlow framework, resulting in the **EMF** model. Empirical results validate the approach, showing competitive performance with very few sampling steps. This work provides practical guidance and a strong reference for future research on text-conditioned MeanFlow generation.

---

_Markdown view of https://picx.dev/p/iKbI5Y, served by PicX — AI-generated visual whiteboard summaries of research papers._
