Summary (Overview)
- Core Finding: For high-resource non-English languages like German, aggressive quality filtering of web corpora followed by multi-epoch training on the resulting high-signal subset consistently outperforms single-pass training on larger, less filtered datasets, even under a fixed token budget.
- Key Methodology: A hierarchical filtering framework was applied to 500M German web documents (FineWeb-2), creating subsets based on Coherence, Information Value, and Educational Quality, culminating in a Dense Core (intersection of all three).
- Empirical Results: Models trained for multiple epochs on the 28B-token Dense Core (up to 7.2 epochs) outperformed models trained on a single pass of 100B+ tokens of diverse data. This "density advantage" held across model scales (350M and 1B parameters) and persisted through instruction tuning.
- Community Contribution: The paper releases:
- BOLDT models: German language models achieving state-of-the-art results for their size despite being trained on 10-360x fewer tokens than comparable models.
- Cleaned benchmarks: Corrected German translations of key evaluation datasets (ARC-Challenge, HellaSwag, LAMBADA, OpenBookQA) to address translation artifacts.
Introduction and Theoretical Foundation
The prevailing scaling paradigm for Large Language Models (LLMs) emphasizes more parameters, compute, and data. However, recent work in English has challenged this, showing that filtering massive web corpora for high-quality content significantly improves training efficiency ("quality-first" approach).
For high-resource non-English languages (e.g., German, French, Japanese), which possess substantial but not trillion-token web corpora, this creates a strategic dilemma:
- Prioritize Diversity: Apply light filters to maintain a large token pool for single-pass training.
- Prioritize Semantic Density (Quality): Apply strict filters to create a smaller, high-signal subset and repeat it over multiple epochs.
This paper investigates this trade-off using German as a case study. The core research questions are:
- Does repeating high-quality data outperform maximizing unique data volume under a fixed compute budget?
- Where is the point of diminishing returns for multi-epoch training on filtered data?
- Does the advantage of high-density pre-training translate to improved instruction-tuning performance?
The theoretical basis is that for data-constrained non-English settings, maximizing the expected training signal per token (semantic density) may be more efficient than maximizing token volume, even if it necessitates data repetition.
Methodology
1. Data and Hierarchical Filtering
The foundation is the German split of FineWeb-2 (FW2-DE), containing ~500M documents. A three-tier hierarchical filtering framework is applied using document-level classifiers:
- Coherence: Targets basic linguistic/structural integrity to remove "word-salad" and fragments.
- Information Value: Selects fact-bearing, content-rich documents (e.g., reports, news).
- Educational Quality: The most restrictive tier, prioritizing textbook-like clarity and pedagogical value (modeled after FineWeb-Edu).
The intersection of all three filters defines the Dense Core, representing the upper bound of semantic density.
Table 1: Dataset statistics for the German split of FineWeb-2 (FW2-DE) and derived subsets.
| Subset | N Docs (Millions) | Yield (%) | Token Count | Doc Length () | Tokenizer Fertility (Train/Test/Benchmarks) |
|---|---|---|---|---|---|
| FW2-DE (Full Pool) | 496.0 | 100.0 | - | - | - |
| Hierarchical Tiers | |||||
| RANDOM (Baseline) | 128.9* | 26.0* | 100B* | 786 ± 1725 | 1.49 / 1.48 / 1.57 |
| COHERENCE | 300.6 (138.3*) | 60.6 (27.8*) | 100B* | 730 ± 1540 | 1.48 / 1.50 / 1.56 |
| INFORMATION VALUE | 43.5 | 8.8 | 65B | 1494 ± 2561 | 1.36 / 1.42 / 1.42 |
| EDUCATIONAL QUALITY | 30.2 | 6.1 | 33B | 1087 ± 2103 | 1.33 / 1.40 / 1.38 |
| Target Core | |||||
| DENSE CORE (Intersection) | 24.5 | 5.1 | 28B | 1150 ± 2193 | 1.32 / 1.40 / 1.38 |
| External Baselines | |||||
| FW HQ (Messmer et al., 2025) | 43.2 | 8.7 | 35B | 823 ± 1895 | 1.33 / 1.40 / 1.38 |
| AA High (Burns et al., 2025) | 70.4 | 14.2 | 21B | 296 ± 368 | 1.35 / 1.40 / 1.43 |
2. Model Training & Evaluation
- Architecture: Decoder-only transformer following Llama, with primary sizes of 350M and 1B non-embedding parameters.
- Training: AdamW optimizer (, ), weight decay , cosine learning rate decay (peak ), batch size 0.5M tokens.
- Evaluation Suite: A cleaned and modernized German benchmark suite including:
- Factual Knowledge: Global MMLU (German subset)
- Reasoning: ARC-Easy, ARC-Challenge, OpenBookQA
- Commonsense & Context: HellaSwag, LAMBADA
- Instruction Tuning Evaluation: Fine-tuned on German S MOL T ALK 2, evaluated via LLM-as-a-judge (Llama-3.3-70B-Instruct) for correctness and helpfulness.
Empirical Validation / Results
Experiment I: Token Allocation Strategies (100B Budget)
350M models were trained on a fixed 100B token budget using different data strategies.
Table 3: Benchmark results for 350M models (100B token budget). Tokens column shows unique tokens; bracketed value shows epochs to reach 100B.
| Subset | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
|---|---|---|---|---|---|---|---|---|
| Baseline | ||||||||
| RANDOM | 100B (1.0x) | 27.13 | 26.15 | 41.10 | 37.18 | 33.52 | 41.01 | 34.35 |
| Single Filters | ||||||||
| COHERENCE | 100B (1.0x) | 27.06 | 27.65 | 42.74 | 40.45 | 38.33 | 42.63 | 36.48 |
| INFORMATION VALUE | 65B (1.5x) | 28.29 | 30.46 | 46.20 | 40.71 | 38.52 | 44.04 | 38.04 |
| EDUCATIONAL QUALITY | 33B (3.0x) | 28.64 | 31.49 | 50.91 | 40.57 | 36.49 | 43.64 | 38.62 |
| Filter Combinations | ||||||||
| DENSE-CORE | 28B (3.6x) | 28.97 | 31.40 | 50.55 | 41.10 | 37.55 | 45.86 | 39.24 |
| MKC (External) | 35B (2.9x) | 28.00 | 27.37 | 46.37 | 39.49 | 40.43 | 42.02 | 37.28 |
| AA HIGH (External) | 21B (4.8x) | 26.55 | 25.31 | 39.83 | 37.39 | 29.75 | 40.00 | 33.14 |
Key Findings:
- Dense Core (3.6 epochs) outperforms the Random baseline by +4.89 avg. points.
- Performance improves with filter strictness (Information Value > Coherence).
- The advantage is consistent throughout training, not just at convergence.
- External dataset AA High performed poorly despite high repetition (4.8x), indicating repetition only benefits sufficiently high-quality data.
- Dense Core outperforms much larger models (LLäMmlein-120M, Gemma-3-270M, Qwen-3-0.6B) trained on 10-360x more tokens.
Experiment II: Parameter Scaling (350M → 1B)
Scaling to 1B parameters with a 100B token budget.
Finding: The performance gap widens with scale. The 1B Dense Core model leads its Random baseline by +5.14 avg. points (vs. +4.89 for 350M).
Experiment III: Exploring Repetition Limits (200B Budget)
Extending training to 200B tokens to find diminishing returns.
Table 7: Benchmark results of 350M models trained on 200B tokens.
| Subset | Token Count | Avg. Score |
|---|---|---|
| RANDOM | 200B (1.0x) | 34.75 |
| DENSE-CORE | 28B (7.2x) | 40.16 |
| PHASED (Curriculum) | 128B (1.6x) | 39.07 |
Key Findings:
- Benefits of multi-epoch training on Dense Core persist beyond 7 epochs, contradicting earlier findings of saturation at ~4 epochs.
- Dense Core (7.2 epochs) maintains a significant lead over Random (still seeing new data).
- For a 1B model, extending Dense Core training from 100B to 200B tokens yielded a +2.08 avg. point gain, more than double the 350M model's gain, suggesting larger models can extract more value from repeated high-quality data.
Experiment IV: Generalization to Instruction Tuning
Models were instruction-tuned and evaluated for correctness.
Table 4: LLM-as-a-Judge results for instruction-tuned models.
| Subset | Tokens | Score (1-10) | Correct/1000 |
|---|---|---|---|
| 350M @ 100B tokens | |||
| RANDOM | 1.0x | 5.25 | 178 |
| DENSE-CORE | 3.6x | 5.74 | 253 |
| 1B @ 100B tokens | |||
| RANDOM 1B | 1.0x | 5.87 | 293 |
| DENSE-CORE 1B | 3.6x | 6.13 | 338 |
| 350M @ 200B tokens | |||
| DENSE-CORE | 7.2x | 5.96 | 278 |
Finding: The "density advantage" persists through instruction tuning. The 350M Dense Core model (7.2 epochs) achieves 278 correct answers, nearly matching the 1B Random model (293 correct) despite having 3x fewer parameters.
Released Model Performance
The released BOLDT models, trained on significantly fewer tokens, are competitive with or outperform larger multilingual models.
Table 5: Benchmark results of BOLDT models vs. reference models.
| Model | Tokens Trained | Avg. Score |
|---|---|---|
| BOLDT-DC-1B | 200B | 44.05 |
| BOLDT-1B | 230B | 44.52 |
| LLäMmlein-1B | 1T | 40.78 |
| Gemma-3-1B | 2T* | 39.77 |
| Qwen3-1.7B-Base | 36T* | 44.89 |
Theoretical and Practical Implications
- Challenging the Data Volume Paradigm: For non-English high-resource languages, semantic concentration through aggressive quality filtering is a more viable path to efficient language modeling than simply maximizing unique data volume. The "more is better" dogma is context-dependent.
- Re-evaluating Repetition Risks: The cautious approach to multi-epoch training may be unwarranted when data is optimized for knowledge density. High-quality data can be repeated for many epochs (7+) without performance saturation, especially for larger models.
- Practical Guidance for Practitioners: The study provides a clear recipe: define and apply strict, hierarchical quality filters to create a high-signal "core" dataset, and train on it for multiple epochs. This is more effective than hybrid curricula that start with low-quality data.
- Importance of Data Cleaning: The work highlights the critical need for cleaned, task-preserving benchmarks in non-English NLP to ensure reliable evaluation.
- Sample Efficiency: The released BOLDT models demonstrate that state-of-the-art performance for a given model size can be achieved with orders of magnitude less pre-training data, reducing computational costs and barriers to entry for non-English LLM development.
Conclusion
This work demonstrates that for high-resource non-English languages like German, aggressive quality filtering and subsequent multi-epoch training on the resulting high-density core is a superior strategy to single-pass training on larger, noisier datasets, even when total available text is limited.
Main Takeaways:
- Quality filtering remains beneficial despite smaller non-English data pools.
- The benefits of repeating high-quality data persist well beyond previously assumed epoch limits and scale with model size.
- The advantage of high-density pre-training translates directly to improved instruction-following capabilities.
- Careful filtering based on strong annotators offers a practical path to sample-efficient pre-training in data-constrained, non-English settings.
Future Directions & Limitations:
- Language Scope: Findings need validation across other language families and lower-resource languages.
- Scale: Experiments are limited to ≤1B parameters and ≤200B tokens; trade-offs may differ at industry-scale.
- Architecture: Focus is on dense transformers; Mixture-of-Experts models may behave differently.
- Safety & Bias: The study does not assess toxicity, demographic biases, or safety implications of aggressive filtering and repeated training.