Summary (Overview)

Core Finding: For high-resource non-English languages like German, aggressive quality filtering of web corpora followed by multi-epoch training on the resulting high-signal subset consistently outperforms single-pass training on larger, less filtered datasets, even under a fixed token budget.
Key Methodology: A hierarchical filtering framework was applied to 500M German web documents (FineWeb-2), creating subsets based on Coherence, Information Value, and Educational Quality, culminating in a Dense Core (intersection of all three).
Empirical Results: Models trained for multiple epochs on the 28B-token Dense Core (up to 7.2 epochs) outperformed models trained on a single pass of 100B+ tokens of diverse data. This "density advantage" held across model scales (350M and 1B parameters) and persisted through instruction tuning.
Community Contribution: The paper releases:
- BOLDT models: German language models achieving state-of-the-art results for their size despite being trained on 10-360x fewer tokens than comparable models.
- Cleaned benchmarks: Corrected German translations of key evaluation datasets (ARC-Challenge, HellaSwag, LAMBADA, OpenBookQA) to address translation artifacts.

Introduction and Theoretical Foundation

The prevailing scaling paradigm for Large Language Models (LLMs) emphasizes more parameters, compute, and data. However, recent work in English has challenged this, showing that filtering massive web corpora for high-quality content significantly improves training efficiency ("quality-first" approach).

For high-resource non-English languages (e.g., German, French, Japanese), which possess substantial but not trillion-token web corpora, this creates a strategic dilemma:

Prioritize Diversity: Apply light filters to maintain a large token pool for single-pass training.
Prioritize Semantic Density (Quality): Apply strict filters to create a smaller, high-signal subset and repeat it over multiple epochs.

This paper investigates this trade-off using German as a case study. The core research questions are:

Does repeating high-quality data outperform maximizing unique data volume under a fixed compute budget?
Where is the point of diminishing returns for multi-epoch training on filtered data?
Does the advantage of high-density pre-training translate to improved instruction-tuning performance?

The theoretical basis is that for data-constrained non-English settings, maximizing the expected training signal per token (semantic density) may be more efficient than maximizing token volume, even if it necessitates data repetition.

Methodology

1. Data and Hierarchical Filtering

The foundation is the German split of FineWeb-2 (FW2-DE), containing ~500M documents. A three-tier hierarchical filtering framework is applied using document-level classifiers:

Coherence: Targets basic linguistic/structural integrity to remove "word-salad" and fragments.
Information Value: Selects fact-bearing, content-rich documents (e.g., reports, news).
Educational Quality: The most restrictive tier, prioritizing textbook-like clarity and pedagogical value (modeled after FineWeb-Edu).

The intersection of all three filters defines the Dense Core, representing the upper bound of semantic density.

Table 1: Dataset statistics for the German split of FineWeb-2 (FW2-DE) and derived subsets.

Subset	N Docs (Millions)	Yield (%)	Token Count	Doc Length ( $\mu \pm \sigma$ )	Tokenizer Fertility (Train/Test/Benchmarks)
FW2-DE (Full Pool)	496.0	100.0	-	-	-
Hierarchical Tiers
RANDOM (Baseline)	128.9*	26.0*	100B*	786 ± 1725	1.49 / 1.48 / 1.57
COHERENCE	300.6 (138.3*)	60.6 (27.8*)	100B*	730 ± 1540	1.48 / 1.50 / 1.56
INFORMATION VALUE	43.5	8.8	65B	1494 ± 2561	1.36 / 1.42 / 1.42
EDUCATIONAL QUALITY	30.2	6.1	33B	1087 ± 2103	1.33 / 1.40 / 1.38
Target Core
DENSE CORE (Intersection)	24.5	5.1	28B	1150 ± 2193	1.32 / 1.40 / 1.38
External Baselines
FW HQ (Messmer et al., 2025)	43.2	8.7	35B	823 ± 1895	1.33 / 1.40 / 1.38
AA High (Burns et al., 2025)	70.4	14.2	21B	296 ± 368	1.35 / 1.40 / 1.43

2. Model Training & Evaluation

Architecture: Decoder-only transformer following Llama, with primary sizes of 350M and 1B non-embedding parameters.
Training: AdamW optimizer ( $\beta_1 = 0.9$ , $\beta_2 = 0.95$ ), weight decay $0.1$ , cosine learning rate decay (peak $5e-4$ ), batch size 0.5M tokens.
Evaluation Suite: A cleaned and modernized German benchmark suite including:
- Factual Knowledge: Global MMLU (German subset)
- Reasoning: ARC-Easy, ARC-Challenge, OpenBookQA
- Commonsense & Context: HellaSwag, LAMBADA
Instruction Tuning Evaluation: Fine-tuned on German S MOL T ALK 2, evaluated via LLM-as-a-judge (Llama-3.3-70B-Instruct) for correctness and helpfulness.

Empirical Validation / Results

Experiment I: Token Allocation Strategies (100B Budget)

350M models were trained on a fixed 100B token budget using different data strategies.

Table 3: Benchmark results for 350M models (100B token budget). Tokens column shows unique tokens; bracketed value shows epochs to reach 100B.

Subset	Tokens	MMLU	ARC-C	ARC-E	H-Swag	LAMBADA	OBQA	Avg.
Baseline
RANDOM	100B (1.0x)	27.13	26.15	41.10	37.18	33.52	41.01	34.35
Single Filters
COHERENCE	100B (1.0x)	27.06	27.65	42.74	40.45	38.33	42.63	36.48
INFORMATION VALUE	65B (1.5x)	28.29	30.46	46.20	40.71	38.52	44.04	38.04
EDUCATIONAL QUALITY	33B (3.0x)	28.64	31.49	50.91	40.57	36.49	43.64	38.62
Filter Combinations
DENSE-CORE	28B (3.6x)	28.97	31.40	50.55	41.10	37.55	45.86	39.24
MKC (External)	35B (2.9x)	28.00	27.37	46.37	39.49	40.43	42.02	37.28
AA HIGH (External)	21B (4.8x)	26.55	25.31	39.83	37.39	29.75	40.00	33.14

Key Findings:

Dense Core (3.6 epochs) outperforms the Random baseline by +4.89 avg. points.
Performance improves with filter strictness (Information Value > Coherence).
The advantage is consistent throughout training, not just at convergence.
External dataset AA High performed poorly despite high repetition (4.8x), indicating repetition only benefits sufficiently high-quality data.
Dense Core outperforms much larger models (LLäMmlein-120M, Gemma-3-270M, Qwen-3-0.6B) trained on 10-360x more tokens.

Experiment II: Parameter Scaling (350M → 1B)

Scaling to 1B parameters with a 100B token budget.

Finding: The performance gap widens with scale. The 1B Dense Core model leads its Random baseline by +5.14 avg. points (vs. +4.89 for 350M).

Experiment III: Exploring Repetition Limits (200B Budget)

Extending training to 200B tokens to find diminishing returns.

Table 7: Benchmark results of 350M models trained on 200B tokens.

Subset	Token Count	Avg. Score
RANDOM	200B (1.0x)	34.75
DENSE-CORE	28B (7.2x)	40.16
PHASED (Curriculum)	128B (1.6x)	39.07

Key Findings:

Benefits of multi-epoch training on Dense Core persist beyond 7 epochs, contradicting earlier findings of saturation at ~4 epochs.
Dense Core (7.2 epochs) maintains a significant lead over Random (still seeing new data).
For a 1B model, extending Dense Core training from 100B to 200B tokens yielded a +2.08 avg. point gain, more than double the 350M model's gain, suggesting larger models can extract more value from repeated high-quality data.

Experiment IV: Generalization to Instruction Tuning

Models were instruction-tuned and evaluated for correctness.

Table 4: LLM-as-a-Judge results for instruction-tuned models.

Subset	Tokens	Score (1-10)	Correct/1000
350M @ 100B tokens
RANDOM	1.0x	5.25	178
DENSE-CORE	3.6x	5.74	253
1B @ 100B tokens
RANDOM 1B	1.0x	5.87	293
DENSE-CORE 1B	3.6x	6.13	338
350M @ 200B tokens
DENSE-CORE	7.2x	5.96	278

Finding: The "density advantage" persists through instruction tuning. The 350M Dense Core model (7.2 epochs) achieves 278 correct answers, nearly matching the 1B Random model (293 correct) despite having 3x fewer parameters.

Released Model Performance

The released BOLDT models, trained on significantly fewer tokens, are competitive with or outperform larger multilingual models.

Table 5: Benchmark results of BOLDT models vs. reference models.

Model	Tokens Trained	Avg. Score
BOLDT-DC-1B	200B	44.05
BOLDT-1B	230B	44.52
LLäMmlein-1B	1T	40.78
Gemma-3-1B	2T*	39.77
Qwen3-1.7B-Base	36T*	44.89

Theoretical and Practical Implications

Challenging the Data Volume Paradigm: For non-English high-resource languages, semantic concentration through aggressive quality filtering is a more viable path to efficient language modeling than simply maximizing unique data volume. The "more is better" dogma is context-dependent.
Re-evaluating Repetition Risks: The cautious approach to multi-epoch training may be unwarranted when data is optimized for knowledge density. High-quality data can be repeated for many epochs (7+) without performance saturation, especially for larger models.
Practical Guidance for Practitioners: The study provides a clear recipe: define and apply strict, hierarchical quality filters to create a high-signal "core" dataset, and train on it for multiple epochs. This is more effective than hybrid curricula that start with low-quality data.
Importance of Data Cleaning: The work highlights the critical need for cleaned, task-preserving benchmarks in non-English NLP to ensure reliable evaluation.
Sample Efficiency: The released BOLDT models demonstrate that state-of-the-art performance for a given model size can be achieved with orders of magnitude less pre-training data, reducing computational costs and barriers to entry for non-English LLM development.

Conclusion

This work demonstrates that for high-resource non-English languages like German, aggressive quality filtering and subsequent multi-epoch training on the resulting high-density core is a superior strategy to single-pass training on larger, noisier datasets, even when total available text is limited.

Main Takeaways:

Quality filtering remains beneficial despite smaller non-English data pools.
The benefits of repeating high-quality data persist well beyond previously assumed epoch limits and scale with model size.
The advantage of high-density pre-training translates directly to improved instruction-following capabilities.
Careful filtering based on strong annotators offers a practical path to sample-efficient pre-training in data-constrained, non-English settings.

Future Directions & Limitations:

Language Scope: Findings need validation across other language families and lower-resource languages.
Scale: Experiments are limited to ≤1B parameters and ≤200B tokens; trade-offs may differ at industry-scale.
Architecture: Focus is on dense transformers; Mixture-of-Experts models may behave differently.
Safety & Bias: The study does not assess toxicity, demographic biases, or safety implications of aggressive filtering and repeated training.