Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Summary (Overview)

  • Introduces "Programming with Data" (ProDa), a novel paradigm that establishes a structural correspondence between test-driven software engineering and data engineering for LLMs. It treats training data as executable source code, model training as compilation, and benchmarks as unit tests, enabling a closed-loop compile-test-debug cycle.
  • Relies on a shared three-level knowledge structure (L1/L2/L3) extracted from raw corpora. This structure serves as a shared specification, linking training data synthesis and benchmark construction, which enables precise traceability from model failures back to specific data deficiencies (concept gaps or reasoning deficits).
  • Demonstrates systematic, traceable improvement across 16 disciplines and multiple model families/scales (Llama, Qwen from 3B to 32B). Iterative data debugging consistently improves domain performance without degrading general capabilities (e.g., a 32B open-source model surpasses GPT-5.4 after one round).
  • Releases ProDaLib, an open-source resource suite containing the extracted knowledge base (227k concepts, 186k relations, 44k chains), a benchmark (ProDa-16 with 16k items), and synthesized training data (160k samples).
  • Shows exceptional sample efficiency. Targeted repair data (1K samples) outperforms baseline synthesis methods at much larger scales (10K samples), proving that diagnostic-driven data is superior to indiscriminate data scaling.

Introduction and Theoretical Foundation

Reliably encoding specialized human knowledge from unstructured text corpora (textbooks, manuals, research) into LLMs remains a fundamental challenge. Current domain-specific fine-tuning workflows are open-loop: when a model fails, there is no principled mechanism to trace the failure back to a specific deficiency in the training data. The standard remedy is undirected data augmentation, which is computationally expensive and provides no guarantee of improvement.

The paper identifies the root cause: the structural decoupling of training data and evaluation. Benchmarks are often constructed independently from the training data, so failures diagnose symptoms but cannot identify the "pathology" in the training signal.

The core insight is that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the data-engineering lifecycle maps precisely onto the software development lifecycle:

  • Raw Corpus → Requirements Specification
  • Synthesized Training Data → Source Code
  • Model Training → Compilation
  • Fine-Tuned Model → Compiled Binary
  • Benchmark → Unit Test Suite
  • Failure Diagnosis & Data Repair → Debugging

This correspondence, formalized as Programming with Data, provides the missing structural link. A benchmark failure can be traced through the shared knowledge structure to specific deficits in the training data and repaired with targeted patches, converting evaluation from a terminal judgment into an actionable diagnostic.

Methodology

The ProDa framework operationalizes Programming with Data through a pipeline with three core components (Builder, Tester, Debugger) governed by the CORE engineering standards: Contextualized, Organized, Rigorous, Evolving.

1. Knowledge Structure Extraction (Builder)

A three-level knowledge structure K=(K1,K2,K3)K = (K_1, K_2, K_3) is extracted top-down from curated corpus chunks to guarantee every node is testable (zero orphan rate).

  • L1 Key Concepts (K1K_1): Atomic domain concepts. Each ei=(termi,typei,defi,srci)e_i = (term_i, type_i, def_i, src_i).
  • L2 Knowledge Relations (K2K_2): Typed relational triples. Each rj=(es,ϕj,eo)r_j = (e_s, \phi_j, e_o) where ϕjΦ={SPECIALIZATION,CAUSAL,PREREQUISITE,CONTRAST,...}\phi_j \in \Phi = \{\text{SPECIALIZATION}, \text{CAUSAL}, \text{PREREQUISITE}, \text{CONTRAST}, ...\}.
  • L3 Reasoning Chains (K3K_3): Multi-step inferential pathways. Each gk=(e1ϕ1,2e2ϕ2,3...ϕT1,TeT)g_k = (e_1 \xrightarrow{\phi_{1,2}} e_2 \xrightarrow{\phi_{2,3}} ... \xrightarrow{\phi_{T-1,T}} e_T).

The top-down order enforces a critical reachability property:

eK1,gK3 s.t. enodes(g);rK2,gK3 s.t. redges(g)\forall e \in K_1, \exists g \in K_3 \text{ s.t. } e \in \text{nodes}(g); \quad \forall r \in K_2, \exists g \in K_3 \text{ s.t. } r \in \text{edges}(g)

2. Benchmark and Training Data Synthesis (Builder & Tester)

  • Benchmark Construction (Tester): Built from L3 chains before any training (test-first principle). Each item bk=(xk,Ak,yk,μk)b_k = (x_k, A_k, y_k, \mu_k) requires multi-step reasoning. The Rigorous standard enforces:
    • Adversarial Distractors: Generated via perturbation operators (e.g., SUBSTADJ:replace einodes(gk) with eiN(ei)\text{SUBST}_{ADJ}: \text{replace } e_i \in \text{nodes}(g_k) \text{ with } e'_i \in N(e_i)).
    • Instance-Level Orthogonality: B=fbench(K3)B = f_{\text{bench}}(K_3), S=fsyn(K1,K2)S = f_{\text{syn}}(K_1, K_2). No benchmark item is answerable by verbatim recall of training data.
  • Initial Training Data Synthesis (Builder): Synthesized from L1 concepts and L2 relations in three formats (open-ended QA, multiple-choice, true/false) under the Contextualized and Organized standards.

3. Failure Diagnosis and Data Repair (Debugger)

After model training and evaluation, the Debugger classifies each error bkEb_k \in E:

  • Concept Gap: Model lacks/confuses specific L1/L2 knowledge.
  • Reasoning Deficit: Model possesses knowledge but fails to compose it correctly along an L3 chain.

Targeted patches are generated conditioned on the error type:

Skpatch={frefine(κk,N(κk),K)if concept gap at κkK1,fcot(gk,K)if reasoning deficit along gkK3,S^{\text{patch}}_k = \begin{cases} f_{\text{refine}}(\kappa_k, N(\kappa_k), K) & \text{if concept gap at } \kappa_k \in K_1, \\ f_{\text{cot}}(g_k, K) & \text{if reasoning deficit along } g_k \in K_3, \end{cases}

where frefinef_{\text{refine}} produces contrastive reinforcement samples and fcotf_{\text{cot}} produces chain-of-thought scaffolding.

The Evolving standard is operationalized by mixing new patches with a replay subset of original data under an L2-ID disjoint constraint to prevent catastrophic forgetting, completing the debugging loop.

Empirical Validation / Results

1. Knowledge Structure Extraction

Starting from 117k documents (~15B tokens), quality filtering yielded 48k high-quality chunks (~1.5B tokens). Top-down extraction produced a large, connected knowledge structure:

LayerCountDescription
L3 Reasoning Chains43,953Multi-step inferential pathways
L2 Relational Statements186,784Atomic subject-predicate-object triples
L1 Atomic Concepts227,869Canonicalized domain terms with definitions
Total Nodes458,622Zero orphan rate guaranteed by top-down extraction

The structure spans 16 disciplines with high connectivity (largest connected component >99.3% in every discipline).

2. Benchmark Validation (ProDa-16)

The ProDa-16 benchmark, derived from L3 chains, shows strong construct validity and discriminative power.

  • External Alignment: High mean Spearman correlation (ρ=0.847\rho = 0.847) with 11 established benchmarks (e.g., ρ=0.943\rho = 0.943 with GPQA).
  • Internal Discrimination: Forms a smooth "adjacent-score ladder" across models from 3B to frontier closed-source models (~76% accuracy).
  • Per-Discipline Coverage: All 16 disciplines have median accuracy well above chance (25%) and below ceiling, providing room for measurable gains.

3. Initial Fine-Tuning (V1) and Debugging (V2)

Models were fine-tuned on the initial 160K synthesized samples (V1), then debugged with targeted patches (V2). Results across two model families are shown in Table 1 (Panels B, C, D). Key findings:

Table 1 (Excerpt): Performance Gains from Debugging (Panel D - ∆ Round 3 - Round 2)

ModelPhysics (001)Engineering (002)Medicine (003)...Avg Gain
Llama-3.1-8B+31.80+36.90+30.80...+32.67
Qwen-2.5-3B+13.10+15.60+16.20...+17.45
Qwen-2.5-7B+6.60+4.60+5.20...+4.93
Qwen-2.5-14B+7.30+5.20+9.70...+6.18
Qwen-2.5-32B+2.20+1.90+1.10...+2.30
Qwen-3-4B+0.60+0.90+3.60...+1.67
Qwen-3-8B+2.90+2.50+3.00...+3.72
Qwen-3-14B-0.60-0.20+1.60...+0.77
Qwen-3-32B+0.50+2.80+1.80...+2.17
  • Systematic Gains: Every model improved on average after debugging. Gains were inversely related to V1 performance (weaker models improved more).
  • Competitive Performance: After debugging, open-source models surpassed their official Instruct counterparts and competed with frontier closed-source models. Qwen-2.5-32B-V2 reached 78.84% and Qwen-3-32B-V2 reached 79.52%.
  • Preserved General Capabilities: Debugging recovered the modest general-capability tax introduced by initial domain fine-tuning. On MMLU subsets, 7 of 9 V2 models matched or exceeded their Base scores (median Base-to-V2 change: +0.27 points).

4. Sample Efficiency and Comparison with Baselines

ProDa's diagnostic-driven repair exhibits exceptional sample efficiency compared to baseline synthesis methods (Alpaca, EasyDataset, DataFlow).

Table 3: Feature Comparison of Data Generation Methods

MethodSpecificationTraceabilityDebugging Loop
Alpaca
EasyDataset
DataFlow
ProDa (V1)
ProDa (V2)
  • At the 1K scale, ProDa V2 (68.72%) outperformed the peak performance of all baseline methods at any scale.
  • At the 5K scale, ProDa V2 (72.11%) surpassed the strongest baseline (DataFlow Filter, 56.18%) by nearly 16 absolute points.

5. Case Studies

Three case studies (Physics/Optics, Economics/Law, Biomedicine) illustrate the closed-loop debugging process. In each case, a V1 model failure was diagnosed via the knowledge structure (as a concept gap or reasoning deficit), a targeted patch was generated, and the V2 model corrected the error.

Theoretical and Practical Implications

  • Paradigm Shift: ProDa moves data engineering from an open-loop, artisanal practice to a closed-loop, rigorous engineering discipline analogous to software development. It demonstrates that the relationship between training data and model behavior is structurally traceable and systematically repairable.
  • Principled Foundation for Domain Expertise: Provides a blueprint for reliably converting raw textual knowledge into verifiable model competence, applicable across any discipline.
  • Beyond Scale: Challenges the prevailing reliance on indiscriminate data scaling. Shows that high-quality, diagnostic-driven targeted data is fundamentally superior for capability enhancement.
  • Enabling Self-Improving Systems: The closed-loop architecture paves the way for more autonomous, self-evolving LLMs where the model's own failures drive iterative data refinement.

Conclusion

The paper establishes Programming with Data as a principled paradigm for test-driven data engineering. By introducing a shared knowledge specification that links training data, evaluation, and diagnosis, it closes the feedback loop that has been missing in domain-specific LLM fine-tuning. The empirical validation across 16 disciplines and multiple model scales demonstrates that model failures can be decomposed, traced, and repaired with consistent improvements and preserved general capabilities.

The work offers not a finished system but a general-purpose blueprint and an open resource suite (ProDaLib). Future directions include intersections with retrieval-augmented generation for better grounding and mechanistic interpretability for finer-grained diagnosis. By showing the entanglement between data and behavior is reducible through structural mediation, this work provides a foundation for the reliable engineering of human expertise into language models.