FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization - Summary

Summary (Overview)

Real-Time Interactive Garment Switching: Presents FashionChameleon, the first framework enabling real-time (23.8 FPS) and interactive human-garment video customization where users can switch garments during generation while preserving coherent human motion.
Three Key Technical Innovations: Introduces: 1) Teacher Model with In-Context Learning trained on single-garment data to preserve coherence; 2) Streaming Distillation with In-Context Learning for efficient, consistent long-video generation; 3) Training-Free KV Cache Rescheduling for interactive multi-garment control.
Superior Performance & Efficiency: Outperforms existing subject-to-video (S2V) baselines in garment consistency and temporal smoothness while being 30-180× faster, achieving real-time 720p generation on a single GPU.
Novel Data & Benchmark: Develops a high-quality four-stage data curation pipeline and proposes HGC-Bench, a dedicated benchmark for evaluating human-garment video customization.
Practical Applications: Uniquely supports interactive multi-garment switching and consistent long-video extrapolation, demonstrating high value for e-commerce, content creation, and entertainment.

Introduction and Theoretical Foundation

Human-centric video customization, especially at the garment level, holds significant commercial value in e-commerce, filmmaking, and entertainment. However, existing subject-to-video (S2V) customization methods primarily focus on overall subject identity preservation, suffer from high inference latency, and lack support for fine-grained, interactive control like real-time garment switching. This paper addresses the challenge of achieving interactive multi-garment video customization using only single-garment video data.

The work is motivated by the success of hybrid autoregressive video generation paradigms (e.g., CausVid, Self-Forcing), which combine diffusion models with autoregressive prediction for efficient, streaming generation. The authors ask: Can this paradigm be extended to the customization domain? They identify three core challenges:

Single-to-Multiple Generalization: How to leverage readily available single-garment video data for interactive multi-garment tasks.
Consistency and Efficiency: How to maintain identity and motion consistency during efficient, autoregressive "self -rollout" generation.
Coherent Interaction: How to enable seamless garment transitions while preserving continuous human motion during generation.

FashionChameleon is proposed as a solution, formulating a new task: streaming and interactive human-garment video customization.

Methodology

The overall pipeline comprises three core components, supported by a dedicated data curation pipeline.

1. Teacher Model with In-Context Learning

Instead of training on scarce multi-garment data, a bidirectional teacher model is trained using in-context learning on single reference-garment pairs.

Shared Latent Space: The VAE encoder $E$ is reused to encode the video $V$ , reference image $I_{src}$ , and garment image $I_{gar}$ into a shared latent space: $z^v_0 = E(V); \quad z^{src}_0 = E(I_{src}); \quad z^{gar}_0 = E(I_{gar})$
Key Training Strategy: The image-to-video (I2V) training paradigm is retained, ensuring the first generated frame matches the reference except for the garment. Crucially, the garment worn by the reference person is mismatched with the target garment $I_{gar}$ . This forces the model to learn implicit coherence during single-garment switching.
Multi-Modal Attention: The clean reference latent $z^{src}_0$ , clean garment latent $z^{gar}_0$ , and noisy video latent $z^v_t$ are concatenated and processed via standard multi-head attention within a single backbone, eliminating need for auxiliary encoders.

2. Streaming Distillation with In-Context Learning

The teacher model is distilled into a few-step autoregressive student for real-time generation.

In-Context Teacher Forcing Mask: A masking strategy is designed for fine-tuning, allowing the model to attend to conditional signals ( $z^{src}_0$ , $z^{gar}_0$ ) and ground-truth history when predicting the next frame/chunk, aligning training with the autoregressive inference.
Gradient-Reweighted Distribution Matching Distillation (GR-DMD): To address error accumulation in long-video extrapolation, an aesthetic reward model $R$ reweights the DMD loss, focusing more on low-quality frames. The loss is: $\nabla\mathcal{L}_{\text{Reweight-DMD}} = -\mathbb{E}_t \left[ \int \mathbf{A}^{1:f}(G(\epsilon)) \cdot \left( \mathbf{s}^{1:f}_{\text{real}}(\phi(G(\epsilon), t), t) - \mathbf{s}^{1:f}_{\text{fake}}(\phi(G(\epsilon), t), t) \right) \cdot \frac{dG_\theta(\epsilon)}{d\theta} \cdot d\epsilon \right],$ where the adaptive weight for frame $i$ is: $A_i(G(\epsilon)) = \frac{\exp(-R(G_i(\epsilon)) / \tau)}{\sum_{j=1}^f \exp(-R(G_j(\epsilon)) / \tau)}, \quad i = 1,\dots,f.$ Here, $\tau$ is a temperature coefficient (set to 0.2).

3. Training-Free KV Cache Rescheduling

Enables interactive garment switching during inference by manipulating the Key-Value (KV) cache.

Garment KV Refresh: To switch to a new garment $I_{gar2}$ , its encoded KV ( $KV_{gar2}$ ) replaces the old $KV_{gar}$ in the cache.
Historical KV Withdraw: Analysis shows the model relies more on historical context than conditional signals. Withdrawing historical KV entries containing the old garment forces attention to the new garment KV.
Reference KV Disentangle: To preserve motion coherence across the switch, the reference KV $KV_{src}$ is replaced with the KV from the last historical frame. A VAE decode-encode process is applied to this frame to disentangle a single-frame representation, matching the training distribution.

4. High-Quality Data Curation Pipeline

A four-stage pipeline curates training triplets (reference image, garment image, video):

General Coarse-to-Fine Video Filtering: Filters raw videos for single-person clips with moderate motion, high aesthetics, and quality.
Static-Dynamic Video Captioning: Uses a VLM (Gemini-3.1) to generate decoupled static (scene, appearance) and dynamic (action, motion) captions.
Fine-Grained Garment Image Extraction: Applies an image "try-off" model (Qwen-Image-Edit) to the first video frame, with VLM validation for semantic/textural consistency.
Adaptive Reference Image Construction: Based on the extracted garment type, a compatible garment is retrieved and "tried-on" onto the first frame to create the reference image, ensuring a garment mismatch for training.

Empirical Validation / Results

Experimental Setup

Baselines: Compared against state-of-the-art S2V methods: VACE, Kaleido, MAGREF, SkyReels-A2, Phantom (1.3B & 14B), and an Edit+I2V pipeline.
Metrics: ID consistency (Cur), text alignment (GME), motion magnitude (Amp.), smoothness (Smoo.), visual quality (VQ), and three new garment consistency scores evaluated by Gemini-3.0: High-Level (HGC), Low-Level (LGC), and Non-Target Preservation (NTP). Frames per second (FPS) measures speed.
HGC-Bench: A new benchmark of 240 samples for evaluation.

Key Results

Table 1: Quantitative comparison for short (81 frames) video generation.

Methods	Params ↓	Cur. ↑	GME ↑	Amp. ↑	Smoo. ↑	VQ ↑	HGC ↑	LGC ↑	NTP ↑	FPS ↑
Edit[49]+I2V[5]	20B+5B	0.4094	0.6741	0.8636	0.9898	0.7482	4.5417	3.9167	4.4583	0.76
VACE[14]	14B	0.2746	0.6962	0.4054	0.9764	0.7409	4.3708	3.5458	4.6417	0.23
Kaleido[20]	14B	0.3676	0.6882	0.2675	0.9935	0.7478	4.1708	3.5500	4.7167	0.13
MAGREF[18]	14B	0.0459	0.7138	0.2571	0.9436	0.7301	3.6000	2.2000	2.6875	0.27
SkyReels-A2[19]	14B	0.3689	0.6550	0.5205	0.9424	0.7241	3.3625	2.6958	4.6458	0.54
Phantom[13]	1.3B	0.5507	0.6855	0.1144	0.9668	0.7338	4.3292	3.6417	4.6875	0.77
Phantom[13]	14B	0.4911	0.6972	0.2086	0.9932	0.7446	4.5375	3.8333	4.6417	0.15
FashionChameleon	5B	0.4911	0.6839	0.7771	0.9969	0.7483	4.6833	3.9250	4.7625	23.8

Performance: FashionChameleon (5B params) achieves the best scores in Smoothness, Visual Quality, and all three Garment Consistency metrics. It ranks second in ID Consistency and Motion Magnitude.
Efficiency: It achieves 23.8 FPS, which is 30-180 times faster than all baselines (0.13-0.77 FPS), enabling real-time generation.
Qualitative Results: Visual comparisons show FashionChameleon better preserves subject identity, garment details, and produces more natural motions compared to baselines, which often show garment mismatch or degradation.
Additional Capabilities:
- Long-Video Extrapolation: Generates coherent, consistent videos well beyond the training sequence length (e.g., 154+ frames).
- Interactive Customization: Successively switches garments during generation while maintaining motion coherence.

Ablation Studies

Table 2: Ablation of teacher training strategies.

Variants	Cur. ↑	GME ↑	Amp. ↑	Smoo. ↑	VQ ↑	HGC ↑	LGC ↑	NTP ↑
Chan.-Concat + Full FT	0.1811	0.6874	0.3748	0.9266	0.7404	4.4917	3.1667	4.4667
Ours (In-Context) + Full FT	0.4602	0.6972	0.5625	0.9936	0.7473	4.8583	4.1583	4.7792
Ours + Attn FT	0.4348	0.6900	0.6350	0.9881	0.7471	4.8500	4.0625	4.7750
Ours + LoRA FT	0.4046	0.6928	0.6448	0.9777	0.7437	4.7292	3.9458	4.7042

In-Context Learning vs. Channel Concatenation: In-context learning significantly outperforms simple channel-wise concatenation across all metrics.
Full Fine-Tuning: Full fine-tuning of the teacher model yields the best overall performance compared to attention-only or LoRA fine-tuning.

Table 3: Ablation of Gradient-Reweighted DMD (GR-DMD) for long-video (165 frames) generation.

Variants	Cur. ↑	GME ↑	Amp. ↑	Smoo. ↑	VQ ↑	HGC ↑	LGC ↑	NTP ↑
Naive DMD	0.4232	0.6700	0.8026	0.9932	0.7419	4.6958	3.8958	4.7125
GR-DMD ( $\tau$ =0.2)	0.4265	0.6732	0.8395	0.9975	0.7480	4.7000	3.9042	4.7333
GR-DMD ( $\tau$ =0.3)	0.4111	0.6786	0.5106	0.9933	0.7465	4.7583	3.9375	4.6958
GR-DMD ( $\tau$ =0.4)	0.4047	0.6696	0.7869	0.9872	0.7424	4.7125	3.9022	4.7208
GR-DMD ( $\tau$ =0.5)	0.4252	0.6774	0.7907	0.9953	0.7421	4.7083	3.8833	4.7058

GR-DMD Effectiveness: GR-DMD ( $\tau=0.2$ ) improves upon naive DMD, particularly in motion amplitude (Amp.) and smoothness (Smoo.), alleviating motion collapse during extrapolation.
KV Cache Rescheduling: Qualitative ablations confirm that Reference KV Disentangle is crucial for maintaining temporal coherence during garment switching.

Theoretical and Practical Implications

Theoretical Contribution: Demonstrates how the hybrid autoregressive generation paradigm can be successfully adapted for discrete-control customization tasks, moving beyond continuous signals (audio, motion). Introduces novel techniques for in-context learning with diffusion transformers and training-free inference-time control via KV cache manipulation.
Practical Impact: FashionChameleon's real-time speed (23.8 FPS) and interactive garment-switching capability unlock immediate applications in live e-commerce showcases, interactive content creation tools, and virtual try-on systems, where low latency and user control are paramount.
Benchmarking: The introduction of HGC-Bench provides a standardized evaluation suite for future research in garment-level video customization.

Conclusion

FashionChameleon is a groundbreaking framework that achieves real-time, interactive human-garment video customization. Its core innovations—In-Context Learning teacher training, Gradient-Reweighted Streaming Distillation, and Training-Free KV Cache Rescheduling—enable it to outperform existing methods in quality and efficiency by a large margin. The work bridges the gap between high-fidelity customization and practical, interactive generation, offering significant value for human-centric applications. Future work may focus on scaling the training data with more garment variety and integrating stronger video generation backbones to handle more complex motions.