Summary (Overview)
- Code2LoRA introduces a hypernetwork framework that generates repository-specific LoRA adapters for frozen code language models, eliminating inference-time token overhead by injecting repository knowledge directly into model parameters.
- Two usage scenarios are instantiated: Code2LoRA-Static (maps a single repository snapshot to an adapter) and Code2LoRA-Evo (maintains an adapter via a GRU hidden state updated per code diff to track software evolution).
- RepoPeftBench is a new benchmark of 604 Python repositories with static and evolution tracks for evaluating repository-level parameter-efficient fine-tuning. It includes a temporal out-of-distribution holdout (92 repositories created after the training cutoff).
- Code2LoRA-Static achieves 63.8% exact match (EM) on cross-repo (CR) evaluation, outperforming the strongest baseline (FFT+RAG) by +9.9 pp. Code2LoRA-Evo reaches 60.3% CR exact match on the evolution track, +5.2 pp over a single shared LoRA.
- On an out-of-distribution temporal holdout, Code2LoRA-Evo achieves the highest EM (74.1%), demonstrating strong generalization to unseen, post-cutoff repositories.
Introduction and Theoretical Foundation
Code language models require repository-level context (imports, APIs, conventions) to perform complex tasks like assertion completion. Existing approaches inject this knowledge through long inputs (RAG, dependency analysis) incurring high token and retrieval costs, or through per-repository fine-tuning/LoRA which is costly and brittle to evolving codebases.
Hypernetwork-generated LoRA adapters (e.g., Text2LoRA, Doc2LoRA) provide a promising alternative: a forward pass over a conditioning input produces task-specific weights for a frozen LLM. However, these methods are designed for short natural-language inputs or single documents, not the long, repository-scale context of code, and lack mechanisms for tracking software evolution.
Code2LoRA fills this gap by framing repository-level adaptation along two orthogonal axes:
- How knowledge enters parameters (via a hypernetwork conditioned on a repository embedding)
- When it is updated (static snapshot vs. sequential commit diffs)
The theoretical foundation rests on low-rank adaptation (LoRA) (Hu et al., 2022) and hypernetworks (Ha et al., 2017). For a frozen base model with weight , a LoRA adapter injects an update , where , . Code2LoRA generates and from a sampled repository embedding using a trained hypernetwork, so no per-repository fine-tuning is needed.
Methodology
3.1 Repository Encoder
Repository context is compressed into a fixed-size vector in two steps using a frozen Qwen3-Embedding-0.6B model:
- File-level embedding: Each file (or diff) is chunked into 4096-token segments with 512-token overlap, embedded, and mean-pooled to produce a file vector ().
- Repository-level aggregation: Each file vector receives an importance weight . The repository embedding is the concatenation of a weighted mean and a max pool:
3.2 Code2LoRA-Static
A shared 2-layer MLP with GELU activation projects the embedding to a hidden state, which is then fed to dedicated output heads for each LoRA module type :
Learnable log-scales control adapter magnitudes (initialized to -3.5). LoRA matrices are shared across all layers and injected via (rank , ). The hypernetwork has trainable parameters.
3.3 Code2LoRA-Evo
A GRU recurrent neural network aggregates a chronological stream of diff embeddings :
The initial state is computed from the initial repository embedding via a small linear projector. At each step , the shared LoRA-generation head uses in place of to produce the adapter. The GRU and projector add parameters, total .
3.4 Training
The hypernetwork is trained end-to-end by minimizing cross-entropy on assertion-completion pairs from the frozen base LLM:
where for Code2LoRA-Static and for Code2LoRA-Evo. For Code2LoRA-Evo, truncated BPTT is used with detach every steps. Batches sample a repository first, then an input-output pair from it.
Empirical Validation / Results
Benchmark: RepoPeftBench
- 604 Python repositories (512 in-distribution, 92 temporal OOD holdout after 2025-04-01)
- Task: assertion completion – given a test-file prefix, predict the expected value of an assertion
- Two tracks:
- Static: single snapshot per repository (39,612 train, 11,636 test tasks)
- Evolution: commit-derived tasks (215,129 train, 86,793 test tasks from commit history)
- Splits: Cross-Repo (CR) and In-Repo (IR)
Table 1: Dataset statistics
| Split | Repos | Commits | Tasks | Tasks / repo |
|---|---|---|---|---|
| Static track | ||||
| Train | 409 | 409 | 39,612 | 96.9 |
| CR Test | 52 | 52 | 6,414 | 123.3 |
| IR Test | 409 | 409 | 5,222 | 12.8 |
| Evolution track | ||||
| Train (Evo) | 400 | 45,516 | 215,129 | 537.8 |
| CR Test | 51 | 6,618 | 44,732 | 877 |
| IR Test | 389 | 6,179 | 42,061 | 108.1 |
| OOD holdout | 92 | 1,950 | 14,813 | 161.0 |
Static Track Results (Table 2)
| Method | CR EM (%) | IR EM (%) |
|---|---|---|
| Pretrained | 45.7 | 46.8 |
| RAG (k=3) | 39.7 | 42.1 |
| Dep.-Resolved Context | 48.2 | 49.5 |
| FFT | 51.4 | 55.9 |
| Single LoRA | 47.4 | 50.4 |
| Per-repo LoRA | — | 64.0 |
| Text2LoRA (strengthened) | 45.8 | 46.7 |
| Code2LoRA-Static | 63.8 | 66.2 |
Code2LoRA-Static outperforms all baselines by large margins (+9.9 pp over FFT+RAG on CR) and matches the Per-repo LoRA upper bound on IR without per-repository training.
Evolution Track Results (Table 3)
| Method | CR EM (%) | IR EM (%) |
|---|---|---|
| Pretrained | 31.5 | 29.3 |
| Single LoRA | 55.1 | 61.3 |
| Per-repo LoRA | — | 64.2 |
| Text2LoRA | 41.7 | 43.5 |
| Code2LoRA-Static | 55.7 | 60.6 |
| Code2LoRA-Evo | 60.3 | 64.5 |
Commit-derived tasks are significantly harder (Pretrained drops to 31.5% CR). Code2LoRA-Evo gains +5.2 pp over Single LoRA on CR and exceeds the Per-repo LoRA bound on IR without per-repo training.
Out-of-Distribution Generalization (Table 4)
| Method | EM (%) |
|---|---|
| Pretrained | 44.6 |
| Single LoRA | 72.3 |
| Text2LoRA | 60.4 |
| Code2LoRA-Static | 72.2 |
| Code2LoRA-Evo | 74.1 |
Code2LoRA-Evo leads on OOD by ~1.8 pp over the next-best fine-tuned adapter, with consistent gains across EditSim and CodeBLEU. (Note: OOD targets are systematically shorter, inflating absolute scores, but within-table comparisons remain valid.)
Theoretical and Practical Implications
- Parametric injection of repository knowledge (via generated adapters) consistently outperforms context-injection methods (RAG, dependency resolution) across both static and evolution settings, suggesting that code models benefit from distilling repository context into parameters rather than extending input length.
- Recurrent aggregation over commit diffs is shown to be more effective than static snapshot adaptation when codebases evolve, providing a principled way to keep model knowledge current without full retraining.
- The hypernetwork approach enables zero-inference-time token overhead and generalization to unseen repositories without per-repository fine-tuning, making it practical for large-scale deployment across many codebases.
- RepoPeftBench provides a standardized evaluation framework for repository-level PEFT, including a temporal OOD split that challenges models to generalize to future codebases.
Conclusion
Code2LoRA is a hypernetwork framework that generates repository-specific LoRA adapters for code language models. Two instances address different usage scenarios:
- Code2LoRA-Static maps a single repository snapshot to an adapter, achieving 63.8% CR / 66.2% IR exact match on the static track.
- Code2LoRA-Evo uses a GRU to aggregate commit diffs, reaching 60.3% CR / 64.5% IR exact match on the evolution track, outperforming static and shared adapters.
The results demonstrate that repository knowledge is best injected parametrically and updated to track software evolution, rather than through long input context. Code2LoRA provides a building block for more context-aware, customizable, and cost-efficient AI code assistants.
Limitations include evaluation limited to Python and a single backbone (Qwen2.5-Coder-1.5B), potential inflationary effects on OOD metrics due to shorter target lengths (though within-table comparisons are valid), and the large size of the hypernetwork itself (~720M–745M parameters). Future work should extend to more languages, larger backbones, and additional downstream tasks.
Related papers
- Mellum2 Technical Report
Mellum 2 is an efficient 12B MoE model specialized for software engineering, matching the inference cost of a 7B dense model while achieving competitive performance on coding and reasoning tasks.
- On the Geometry of On-Policy Distillation
On-policy distillation exhibits subspace locking, with cumulative updates confined to a persistent low-dimensional channel controlled by objective composition.
- Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning
CPPO replaces uniform token-level trust regions with position-weighted thresholds and cumulative prefix budgets, achieving state-of-the-art AIME results across Qwen3 models.