Summary (Overview)

  • Program-as-Weights (PAW) introduces a new programming paradigm for "fuzzy functions": tasks that are naturally specified in natural language but resist clean rule-based implementation (e.g., log triage, malformed JSON repair, intent-based search).
  • PAW compiles a natural-language specification into a hybrid neural program consisting of a discrete pseudo-program (a clean textual restatement) and a continuous PEFT module (LoRA). This program runs on a small, frozen neural interpreter.
  • A 4B-parameter Qwen3 compiler trained on FuzzyBench, a newly released 10M-example dataset, generates per-task LoRA adapters. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of directly prompting Qwen3-32B (73.78% vs. 68.70% exact match) while using approximately 50× less inference memory (~1.2 GB vs. ~60 GB) and running at 30 tokens/s on a MacBook M3.
  • The paradigm reframes foundation models from per-input problem solvers into tool builders: heavy computation is performed once at compile time; subsequent function calls are cheap, offline, and self-contained.
  • PAW generalizes to multimodal (image-conditioned) fuzzy functions by swapping only the text compiler for a vision-language compiler while keeping the same small interpreter.

Introduction and Theoretical Foundation

The Problem of Fuzzy Functions

Traditional programming relies on explicit symbolic rules. Many real-world tasks—filtering important log lines, repairing malformed JSON, ranking search results by intent—are inherently fuzzy: they are intuitive to humans but cannot be fully captured by crisp rules. Real-world inputs also suffer from noise (typos, format drift) that breaks hand-written code and regular expressions.

Currently, developers outsource such fuzziness to LLM APIs (e.g., gpt("extract answer", text)). This approach is convenient but costly, fragile (models are silently updated), and undermines reproducibility and software self-containment.

Proposed Paradigm: Program-as-Weights

PAW proposes a three-step paradigm:

  1. The developer describes the function in natural language.
  2. A neural compiler turns that description into a small neural binary (the "program").
  3. A frozen, lightweight neural interpreter (installed once on the user's device) runs that binary like a user-defined function.

Formally, let f:XYf: X \to Y be a fuzzy function, ss a user specification. A neural Compiler maps ss to a program pp. A small fixed neural Interpreter executes pp on inputs xx to produce outputs y^\hat{y}:

p=Compiler(s),y^=Interpreter(p,x)f(x).(1)p = \text{Compiler}(s), \quad \hat{y} = \text{Interpreter}(p, x) \approx f(x). \tag{1}

Hybrid Programs

The program pp is a hybrid of discrete and continuous components:

p=pdiscrete,pcontinuous.(2)p = \langle p_{\text{discrete}}, p_{\text{continuous}} \rangle. \tag{2}
  • Discrete component pdiscretep_{\text{discrete}}: a variable-length token sequence (the "pseudo-program"), presented to the interpreter as part of its input. It acts as a clean restatement of the task, shielding the interpreter from typos and ambiguity.
  • Continuous component pcontinuousp_{\text{continuous}}: a PEFT module (e.g., LoRA) injected into the interpreter, supplying fine-grained behavioral control.

Why "Program"?

A compiled PAW program is a single file (~23 MB for a 0.6B interpreter) that can be saved, version-controlled, distributed, and called with a two-line API. The compiler does the heavy lifting; the interpreter is a fixed runtime, analogous to a CPU or byte-code interpreter.

Methodology

Compiler–Interpreter Architecture

The PAW pipeline consists of three components:

  1. Pseudo compiler CpC_p: An off-the-shelf Qwen3-4B-Instruct model (never trained) prompted to produce a clean restatement of the spec plus representative input–output examples → pdiscretep_{\text{discrete}}.
  2. PEFT compiler CPEFTC_{\text{PEFT}} (either prefix-tuning or LoRA): A second 4B Qwen3 model, trained on FuzzyBench, that reads the spec and pseudo-program and emits the continuous PEFT module from its hidden states.
  3. Frozen interpreter: A small LM (e.g., Qwen3-0.6B) that runs the program by prepending pdiscretep_{\text{discrete}} to user input and applying the PEFT module.

Text-to-LoRA (current best)

The LoRA compiler CLC_L takes the concatenation [spdiscreteEOSτ1,,τT][s \mid p_{\text{discrete}} \mid \text{EOS} \mid \tau_1, \ldots, \tau_T] where τ1:T\tau_{1:T} are T=64T=64 learned prefix tokens. It produces hidden states HRL×T×dteacherH \in \mathbb{R}^{L \times T \times d_{\text{teacher}}} at LL depth-aligned layers. A LoRA mapper converts these into mixing coefficients that compose LoRA matrices from shared learnable bases:

A^{\text{ex}}_{l,m} = \sum_{n=1}^N \alpha^A_{l,m,n} A^{(m)}_n, \quad B^{\text{ex}}_{l,m} = \sum_{n=1}^N \alpha^B_{l,m,n} B^{(m)}_n. \tag{3}$$ Rank $r=64$, $N=64$ shared bases per module type (attention q/k/v/o, MLP gate/up/down), injecting ~38.5M LoRA parameters per function. #### Prefix-tuning Precursor An earlier instantiation uses a linear mapper to project hidden states into prefix KV-pairs prepended to interpreter attention. Both methods work; LoRA is stronger. ### Training Objective Only the PEFT compiler is trained. The loss is the negative log-likelihood of target output $y$ under the frozen interpreter:

\mathcal{L}(\theta) = \mathbb{E}{(s,x,y)} \left[ -\log P\phi(y \mid p_{\text{discrete}}, p_{\text{LoRA}}(\theta; s, p_{\text{discrete}}), x) \right], \tag{4}

where $\theta$ are compiler parameters (gradients flow back through the interpreter, LoRA mapper, and compiler). ### FuzzyBench Dataset FuzzyBench is a 10M-example dataset of triples $(s, x, y)$ (specification, input, target output), generated using `gpt-5.2` over 29 thematic versions covering >800 categories of fuzzy tasks: | Task Family | Size | % | |-------------|------|---| | Core text processing & NLP | 2.95M | 30% | | Search, matching & web intelligence | 1.80M | 18% | | Custom classification & filtering | 1.50M | 15% | | Code & natural-language commands | 1.25M | 12% | | Safety, verification & domain knowledge | 1.25M | 12% | | Agentic & tool use | 0.75M | 8% | | Format repair & validation | 0.50M | 5% | A verified test set requires agreement between `gpt-5-mini` and `gpt-5.2` to filter ambiguous targets. Empirical ceiling: `gpt-5.2` 96.09%, `gpt-5-mini` 91.87%. ## Empirical Validation / Results ### Main Results on FuzzyBench PAW (Qwen3-0.6B interpreter) achieves **73.78% exact match**, outperforming direct prompting of Qwen3-32B (68.70%) while using ~50× less inference memory. | Method | Contained | Interp. Size | FuzzyBench Acc | YouTube Acc | SMS F1 | Yelp Acc | IMDB Acc | |--------|-----------|--------------|----------------|-------------|--------|----------|----------| | gpt-5.2 (API) | × | – | 96.09% | 95.20% | 97.06% | 98.55% | 95.60% | | gpt-5-mini (API) | × | – | 91.87% | 93.60% | 91.03% | 98.13% | 94.96% | | Local LM (Qwen3-32B) | ✓ | 32B | 68.70% | 93.60% | 89.04% | 98.11% | 94.64% | | **PAW (Qwen3-0.6B)** | **✓** | **0.6B** | **73.78%** | **90.40%** | **80.77%** | **95.82%** | **90.64%** | | PAW (GPT-2 124M) | ✓ | 124M | 54.39% | 93.60% | 77.50% | 93.16% | 82.12% | ### PEFT Instantiation Comparison | Method | Accuracy | |--------|----------| | Prompting (no compiler) | 0.098 | | Prefix Tuning | 0.504 | | Text-to-LoRA, r=18 | 0.565 | | **Text-to-LoRA, r=64 (default)** | **0.657** | ### Multimodal Generalization Replacing the text compiler with a Qwen3-VL-4B compiler but keeping the same small text interpreter, PAW outperforms VLM baselines (up to 4B) on three CoSyn diagram understanding tasks at ~0.6B interpreter size. | Method | Circuit | Chemical | Music | |--------|---------|----------|-------| | Qwen3-VL 4B-Instruct | 0.196 | 0.221 | 0.450 | | AndesVL 0.6B | 0.183 | 0.214 | 0.448 | | **PAW LoRA (Qwen3 0.6B)** | **0.274** | **0.414** | **0.552** | ### Ablations **LoRA mapper variants:** The simplest design (mean-pool + shallow MLP + shared basis) outperforms more expressive alternatives: | Mapper variant | Accuracy | |----------------|----------| | Default (r=64, N=64, shared bases) | **0.6223** | | Per-position aggregation | 0.5598 | | Per-layer bases | 0.6028 | | LoRA + prefix-tuning (both) | 0.6033 | **No compiler baselines:** PAW exceeds full fine-tuning by 15.4 points and fixed LoRA by 21.7 points. | Method (0.6B base) | Accuracy | |---------------------|----------| | Fixed LoRA r=64 | 0.5210 | | Full fine-tuning | 0.5840 | | **PAW (Qwen3 0.6B)** | **0.7378** | ### Robustness to Noisy Specifications PAW degrades only slightly under heavy noise (e.g., combined heavy: -3.7%). The pseudo-program mediates robustness: | Interpreter input | Acc (clean) | Acc (heavy typos) | |------------------|-------------|-------------------| | Pseudo-program (default) | **0.6443** | **0.6108** | | Raw spec | 0.6285 | 0.5662 | ### Quantization and Latency A Q6_K base + Q4_0 LoRA is indistinguishable from bf16 (0.6575 vs 0.6580). A Q4_K_M base + Q4_0 adapter loses only 1.3 points but reduces total disk to ~507 MB. On a MacBook M3: 31.6 tokens/s, 0.48 s cold load. ## Theoretical and Practical Implications - **Paradigm shift**: Foundation models are reframed from per-input problem solvers (called repeatedly at inference) to **tool builders** (invoked once at compile time). The heavy lifting is amortized, and runtime execution is cheap and local. - **Software engineering**: PAW programs are first-class software artifacts—versioned, distributable, cacheable, callable via a simple API. This enables self-contained applications that are not dependent on remote API availability, model versioning, or internet connectivity. - **Small-model future**: The work supports the vision of "small models as the runtime" (Belcak et al., 2025): large models compile, small models execute. This has implications for on-device AI, edge computing, and reducing the environmental/economic cost of AI inference. - **Modality generality**: The compiler–interpreter abstraction naturally extends to multimodal tasks (e.g., image-conditioned fuzzy functions) by swapping only the compiler. The small interpreter never sees pixels, yet the system performs competitively with larger VLMs. - **Robustness via denoising**: The discrete pseudo-program acts as a noise filter, making the system robust to messy real-world specifications. This is a concrete architectural mechanism for handling natural-language input variation. ## Conclusion Program-as-Weights introduces a new programming paradigm where fuzzy functions are compiled once into small neural binaries and executed locally on a fixed interpreter. Key contributions include: - The PAW architecture (pseudo compiler + PEFT compiler + frozen interpreter). - FuzzyBench, a 10M-example dataset of fuzzy function specifications. - Empirical demonstration that a 0.6B PAW interpreter matches Qwen3-32B prompting while using ~50× less memory and running at 30 tok/s on a MacBook M3. Future directions could include exploring more expressive PEFT methods, scaling the compiler and interpreter, extending to other modalities (audio, video), and developing tooling for the PAW program ecosystem (package managers, versioning, sharing). The work contributes to a future where large models compile and small models execute, shifting the role of foundation models from per-input problem solvers to per-function tool builders.

Related papers