DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
Summary (Overview)
- Unified Framework: DataFlex is a comprehensive framework built upon LLaMA-Factory that unifies three major paradigms of data-centric LLM training—dynamic data selection, domain mixture adjustment, and sample reweighting—into a single, reproducible system.
- Drop-in Design & Modularity: It features a modular architecture with extensible trainer abstractions (
Select Trainer,Mix Trainer,Weight Trainer) and pluggable algorithm components, enabling it to serve as a drop-in replacement for standard LLM training workflows in LLaMA-Factory. - Empirical Effectiveness: Experiments show that dynamic data-centric methods implemented in DataFlex consistently outperform static full-data training. For selection, methods like LESS improve MMLU accuracy; for mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity.
- Improved Efficiency & Scalability: DataFlex provides runtime improvements over original implementations of algorithms like LESS and TSDS, and supports large-scale training settings (e.g., DeepSpeed ZeRO-3) by unifying model-dependent operations like gradient computation and embedding extraction.
Introduction and Theoretical Foundation
The remarkable progress of Large Language Models (LLMs) is driven not only by architectural advances but also by the scale, quality, and composition of training data. This has spurred interest in data-centric training methods, which aim to optimize not just model parameters but also the selection, composition, and weighting of training data during optimization.
Recent work explores diverse strategies: online methods (e.g., LESS, NICE, ODM) adapt data decisions within the training loop using signals like gradients or losses, while offline methods (e.g., TSDS, DoReMi) precompute data-centric strategies before training begins. However, these approaches are often released as isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration.
To address this fragmentation, the paper introduces the concept of a Data-Centric Dynamic Training System and presents DataFlex as its realization. The term "dynamic" refers to the system's capability to flexibly orchestrate data usage throughout the training lifecycle, accommodating both online and offline algorithms. The core motivation is to elevate data from a static resource to a first-class optimization variable, providing a unified infrastructure for systematic study and deployment.
Methodology
3.1 Goals and Design Philosophy
DataFlex is guided by three principles:
- Unification: Support representative data-centric paradigms under a common framework.
- Compatibility: Integrate seamlessly with existing large-scale training infrastructure (LLaMA-Factory).
- Extensibility: Enable researchers to implement and compare new algorithms with minimal overhead.
3.2 Data Module Design
DataFlex features a modular three-layer architecture built on LLaMA-Factory:
- Base Layer: Inherited from LLaMA-Factory, providing standard model management, data processing, and optimization.
- Trainer Layer: The core abstraction with three dynamic training modes:
Select Trainer: Dynamically selects a subset of samples.Mix Trainer: Dynamically adjusts mixture ratios across domains.Weight Trainer: Dynamically modifies per-sample training weights.
- Component Layer: Pluggable strategy components (
selectors,mixers,weighters) encapsulate algorithm-specific logic. A centralized registry allows new algorithms to be registered and discovered at runtime.
3.3 Algorithm Integration and Extensibility
DataFlex abstracts a common interaction pattern: each method observes the current model state, computes a data-centric decision, and feeds it back into optimization. The invocation frequency is configurable via parameters like warmup_step and update_step.
Supported Algorithms (as summarized in Table 1):
| Attribute | Data Selection | Data Mixture | Data Reweighting |
|---|---|---|---|
| Methods | LESS, NICE, Loss, Delta Loss, NEAR, TSDS | DoReMi, ODM | Loss Reweighting |
| Category | Gradient, Loss, Distribution | Offline, Online | Loss |
| Model-in-the-Loop | ✔ / ✘ | ✔ | ✔ |
Configuration: Converting a standard LLaMA-Factory config to use DataFlex only requires adding a dataflex section. For example:
### dataflex
train_type: dynamic_select # or dynamic_mix, dynamic_weight
component_name: less # or doremi, etc.
warmup_step:组 100
update_step: 50
update_times: 30
3.4 System Efficiency and Scalability
DataFlex addresses key challenges for scalability:
- Gradient Acquisition: Implements a distributed gradient collection mechanism compatible with DeepSpeed ZeRO-3, using interfaces like
safe_get_full_gradto reconstruct full gradients from partitioned shards. - Overhead Reduction: Executes gradient-based operations at configurable intervals, caches decisions for reuse, and supports lightweight proxy signals.
- Inherited Infrastructure: Builds upon LLaMA-Factory's support for mixed-precision training, distributed data parallelism, and DeepSpeed.
Empirical Validation / Results
4.2 Data Selection and Reweighting
Setup: Fine-tuning on a 100k subset of Open-Hermes-2.5, evaluated on MMLU. Models: Mistral-7B and Llama-3.2-3B using LoRA (rank ). Online methods use a schedule: warmup_step=100, update every update_step=50 for update_times=30.
Key Findings (Figure 3):
- Dynamic methods consistently outperform the static full-data baseline on both models.
- Mistral-7B: LESS achieved the best final accuracy (0.452), outperforming the static baseline (0.394) by 5.8 percentage points. Reweight, TSDS, and NEAR also showed strong performance.
- Llama-3.2-3B: The performance gap was even larger. Reweight achieved the best accuracy (0.453), followed by LESS (0.450). All online methods exceeded 0.427, significantly above the static baseline (0.319). Offline methods (NEAR, TSDS) performed worse on this smaller model.
4.3 Data Mixture
Setup: Pretraining Qwen2.5-1.5B from scratch on SlimPajama subsets (6B and 30B tokens). Compared DoReMi (offline), ODM (online), and a static baseline with default domain proportions.
Key Findings (Table 2): Table 2: Comparison of data mixing strategies. MMLU accuracy (Acc, ↑) and perplexity (PPL, ↓) across different data domains.
| Method | Acc ↑ | Perplexity (PPL) ↓ |
|---|---|---|
| MMLU | ALL | |
| SlimPajama-6B | ||
| Baseline | 25.27 | 4.217 |
| DoReMi | 25.84 | 4.134 |
| ODM | 26.04 | 4.244 |
| SlimPajama-30B | ||
| Baseline | 25.51 | 3.584 |
| DoReMi | 25.97 | 3.562 |
| ODM | 25.63 | 3.429 |
- Both DoReMi and ODM improved over the static baseline in MMLU accuracy and overall perplexity at both scales.
- Complementary Strengths: DoReMi tended to improve perplexity on high-resource domains (CC, C4), while ODM more aggressively improved specialized domains (StackExchange, ArXiv, Book).
- At the 30B scale, ODM achieved the best overall perplexity (3.429) and best perplexity on 5 of 7 domains.
4.4 Efficiency of DataFlex
Online Selection (LESS) Efficiency (Table 3): Table 3: Efficiency and accuracy comparison between DataFlex and LESS.
| Sample Ratio | Method | Accuracy(%) | Training Time (s) | Reduction ↓ |
|---|---|---|---|---|
| 0.05 | LESS | 34.91 | 1,640 | - |
| DataFlex | 38.35 | 1,579 | 3.72% | |
| 0.1 | LESS | 37.97 | 3,735 | - |
| DataFlex | 40.25 | 3,573 | 4.34% | |
| 0.5 | LESS | 41.57 | 14,398 | - |
| DataFlex | 40.93 | 13,377 | 7.09% | |
| 1.0 | LESS | 40.38 | 30,239 | - |
| DataFlex | 42.37 | 28,734 | 4.98% | |
| 1.0 | DataFlex (8-GPUs) | 43.01 | 12,965 | 57.13%* |
*Compared to DataFlex (Single-GPU) at 1.0 ratio.
- DataFlex consistently reduced runtime while maintaining or improving accuracy compared to the original LESS implementation.
- DataFlex demonstrated excellent parallel scalability, reducing runtime by 57.13% when using 8 GPUs.
Offline Selection (TSDS) Efficiency (Figure 4):
- DataFlex's reimplementation showed a consistent 1–3.5% runtime improvement across varying training and validation set sizes, making it more suitable for iterative experiments.
Theoretical and Practical Implications
- Reproducible Research Infrastructure: DataFlex provides a unified platform for systematic, fair comparison of data-centric algorithms, addressing the fragmentation in current research.
- Practical Training System: Its drop-in design and compatibility with LLaMA-Factory significantly lower the engineering barrier for practitioners to adopt advanced data-centric methods in real-world LLM training pipelines.
- Validation of Data-Centric Paradigms: The consistent outperformance of dynamic methods over static baselines across selection and mixture tasks provides strong empirical evidence for the value of treating data as an optimization variable.
- Foundation for Future Work: The modular, extensible architecture serves as a flexible foundation for developing and integrating new data-model interaction algorithms.
Conclusion
DataFlex is presented as a unified data-centric dynamic training framework that treats data as a first-class optimization object. By unifying data selection, mixture, and reweighting paradigms under a common, modular architecture built on LLaMA-Factory, it reduces implementation fragmentation and improves reproducibility.
The framework's key strength is its system design, which replaces the training layer with extensible abstractions rather than introducing an isolated pipeline. Experimental results demonstrate that methods in DataFlex consistently outperform static training and achieve runtime improvements over original codebases.
The authors hope DataFlex will serve as a useful infrastructure for reproducible research and facilitate future studies on more adaptive, efficient, and principled data-model interactions for LLM training.