Summary (Overview)
- Core thesis: Parameter-efficient fine-tuning (PEFT), especially LoRA, can serve as a persistent local adaptive state on top of strong shared foundation models, enabling millions of personal model instances rather than a single universal assistant.
- Three-axis framework: The paper introduces and empirically validates three coupled scaling problems — Scale Up (stronger shared prior makes small adapters more useful), Scale Down (how small and stable the adaptive state can become), and Scale Out (what becomes possible when many persistent adapted instances coexist).
- Trillion-scale evidence: Demonstrates feasible LoRA-based reinforcement learning on a 1T-parameter MoE model (Kimi K2), with stable training and ~10% of full-parameter RL GPU cost.
- Tiny-adapter reliability: OLoRA-tail initialization with minor-subspace singular vectors rescues rank-1 LoRA from collapse, achieving consistent gains across batch sizes and model scales (e.g., +48% relative on Qwen3-30B-A3B).
- Emergent collective intelligence: Majority voting across distinct LoRA variants shows accuracy scaling as (\approx 0.386 + 0.0172 \ln(k)) on AIME24, reaching 48.67% at (k=198), surpassing repeated sampling from a single model.
Introduction and Theoretical Foundation
The paper argues that frontier models (e.g., GPT-5, GLM-5) provide broad competence but are not inherently personal. A personal model needs persistent state that preserves continuity across interactions, including preferences, skills, and memory-like behavior. PEFT offers a compact unit of adaptive state that can be trained, served, and composed at population scale.
The biological analogy in Figure 1 frames the architecture: humans share >99.9% of their genome; the <0.1% variation drives individuality. Similarly, a shared foundation model (>99% of weights) can support millions of persistent personal models via small PEFT adapters (<1% of weights).
Three coupled axes:
- Scale Up: Stronger shared priors make small local updates more high-leverage (Section 3).
- Scale Down: Smaller, stabler adaptive state lowers marginal cost of repeated learning (Section 4).
- Scale Out: Low marginal cost enables populations of persistent model instances, with diversity as a resource (Section 5).
The axes form a dependency chain: Scale Up without Scale Down → expensive adapters; Scale Down without Scale Up → weak adapters; Scale Out without both → many disposable variants rather than durable personal models.
Theoretical basis: RL is prior-limited — it amplifies behaviors already latent in the base model. A strong prior expands the trajectory support, making exploration and credit assignment more productive. LoRA changes the economics: it provides budgeted access to stronger priors, where the comparison is not full fine-tuning vs. adapter tuning, but "how much prior can be brought into the learning loop under a fixed adaptation budget."
Methodology
The paper employs a mix of theoretical analysis, controlled experiments, and systems infrastructure development across the three axes.
Scale Up methodology:
- Trillion-scale LoRA RL: Operates on a 1T-parameter MoE model (32.6B active parameters) using GRPO-style on-policy optimization. LoRA is applied to selected dense and expert layers. Hybrid parallelism (tensor, pipeline, expert, sequence) is co-designed with adapter placement. Rollout uses a serving-oriented inference engine; training uses a Megatron-style backend.
- Scale-induced failure modes: Identified and mitigated via:
- Training–inference mismatch (TIM): Differences in routing decisions between rollout and training engines in MoE models. Mitigated with Router Replay (R3) — recording rollout routing decisions and replaying them during training.
- Sparse-architecture failures: GLM5/GLM5.1 support requiring full-stack alignment of attention (e.g., DeepSeek Sparse Attention, Multi-Head Latent Attention), adapter semantics, and checkpoint conversion.
Scale Down methodology:
- Rank-reduction sweep: 216 runs across 9 LoRA ranks ((r=1) to (r=256)), 4 batch sizes, 6 seeds each, using Qwen3-8B with fixed 500-step PPO on a mixed mathematics corpus with verifiable rewards. Mean gain, best gain, token efficiency, and seed-level spread are analyzed.
- RL-native initialization: Comparison of standard LoRA (random Gaussian), PiSSA (principal SVD), MiLoRA (minor SVD with scaling), and OLoRA-tail (minor singular vectors without scaling). OLoRA-tail initializes: [ B_0 = U_{-r}, \quad A_0 = V_{-r}^\top ] where (U_{-r}, V_{-r}) correspond to the (r) smallest singular values of the pretrained weight matrix (W_0 = U\Sigma V^\top).
- Hyperparameter transfer: Examines three alpha-scaling rules for LoRA update (\Delta W = \frac{\alpha}{r} BA):
- Fixed (\alpha): (\alpha) constant, early update magnitude (\propto \eta \alpha^2 / r)
- Fixed (\alpha / r): (\alpha \propto r), early update (\propto \eta r)
- (\alpha \propto \sqrt{r}): early update rank-invariant Tested on AG News (DistilBERT) and Qwen3-4B MATH.
- Stateful adapter: δ-mem maintains a low-dimensional state (S_t \in \mathbb{R}^{r \times r}) updated via delta-rule: [ S_t = \text{Diag}(\lambda_t) S_{t-1} + \text{Diag}(\beta_t) (v_t^m - S_{t-1} k_t^m) (k_t^m)^\top ] Trained with RL to produce history-conditioned low-rank corrections to frozen attention.
Scale Out methodology:
- LoRA memory capacity: DishNameBenchmark — slot-writing and querying tasks varying memory tokens, rank, and target modules. 263 runs across Qwen3-series models.
- Context Learning: Context Distillation — query-only rollout is scored using query+context, then RL-style update applied to the query-only policy. Repeated to internalize useful context signals into adapter parameters.
- User simulation: OASIS platform with per-user rank-4 LoRA adapters (trained on 80 historical tweets) vs. shared-base agents. Population sizes (N \in {128, 256, 512}). Structured metrics: polarization distance, stance dispersion, interaction communities, modularity.
- Diversity aggregation: 200 distinct LoRA variants trained on same base (Qwen3-30B) and same RL recipe, differing only by data permutation/masking. Majority voting over (k) models with random subset sampling. Control: repeated sampling from one model.
Empirical Validation / Results
Scale Up — Trillion-scale LoRA RL (Table 2): Under comparable RL budgets, larger base models with LoRA outperform smaller full-RL models despite fewer trainable parameters:
| Model and adaptation | Trainable parameters | AIME 2025 normalized gain | GPQA Diamond normalized gain |
|---|---|---|---|
| DS-Distill-Qwen-1.5B, full RL | 1.5B | 8.33% | 25.00% |
| DS-Distill-Qwen-7B, LoRA (r=64) | 0.16B | 11.31% | 27.23% |
| DS-Distill-Qwen-32B, LoRA (r=8) | 0.07B | 20.61% | 33.02% |
Router Replay (R3) reduces training–inference mismatch: KL divergence near 0.000026 vs. >0.01 for baselines, sustained critic scores and validation accuracy.
Scale Down — Rank regimes (Figures 8–10):
- Three regimes: Ranks 16–32 are deployment-default (highest mean gain, low downside); ranks 1–4 are research frontier (best runs match higher ranks, but mean reliability collapses); ranks ≥64 cost-inflating (footprint grows without ceiling).
- OLoRA-tail at rank 1 (Figure 15): On Qwen3-8B, consistent ~+20% gain over base across all batch sizes, while standard LoRA degrades from +15% to -18% as batch size increases. On Qwen3-30B-A3B, OLoRA-tail achieves 35.5% vs. LoRA 24.0% (absolute +11.5pp, relative +48%).
- Hyperparameter transfer: Fixed (\alpha) is flattest on simple tasks; (\alpha \propto \sqrt{r}) preserves same-order learning-rate reuse and is most robust on harder reasoning (Qwen3-4B MATH).
- δ-mem (Table 3): On Qwen3-4B-Instruct, δ-mem improves average score from 46.79% to 51.66%, with strong gains on memory-intensive benchmarks (MemoryAgentBench: 29.54% → 38.85%).
Scale Out — LoRA memory capacity (Figure 21):
- Capacity efficiency: usable at ~(10^{-3}) to (10^{-2}) tokens per trainable parameter, with sharp collapse beyond.
- Module ordering: MLP LoRA > Attention ≈ All ≫ Unembed.
- Context Learning: On ALFWorld, rank-32 LoRA trained with Skill-0/MinT recipe improves average from 0.646 to 0.845 (Figure 22).
Scale Out — User simulation (Tables 6–7): Per-user LoRA produces richer social structure than shared-base agents: effective interaction communities grow from 9.21 to 14.85 (at (N=512)), co-engagement modularity from 0.502 to 0.716, while within-community side-homophily decreases. LoRA generates substantially more content (original posts, comments) and preserves higher stance dispersion (2.18–2.45× supportive stance std. vs. base).
Scale Out — Diversity voting (Figure 24): Accuracy as a function of model count (k) approximates: [ \text{accuracy} \approx 0.386 + 0.0172 \ln(k), \quad R^2 \approx 0.888 ] Best observed: 48.67% at (k=198) vs. baseline 37.27%. Collaboration (different models) significantly outperforms repetition (same model), with advantage ~+5.33pp at large (k).
Theoretical and Practical Implications
Theoretical implications:
- LoRA rank defines an operating regime, not a monotonic capacity curve. The low-rank failure is reliability, not expressivity — best runs match higher ranks, but mean collapses. This shifts focus from capacity to optimization stability and initialization geometry.
- RL-native initialization must respect the KL leash: in on-policy RL, the first-order Taylor expansion of the importance weight breaks down if policy drifts too fast. SVD-scaled initializations (PiSSA, MiLoRA) consume the KL budget prematurely; OLoRA-tail avoids this by using unscaled minor singular vectors.
- The logarithmically diminishing returns of model-count scaling (\propto \ln(k)) suggest a new research object: accuracy as a function of adapter population size, distinct from throughput or single-model performance.
Practical implications:
- PEFT is not a cheaper substitute for full fine-tuning but a mechanism for persistent individuality. A personal model built on a strong shared prior can preserve continuity, serve as a stable user simulator, and contribute to collective intelligence through diversity-based aggregation.
- Infrastructure must manage the lifecycle of adapters as policy revisions (MinT framework): identity, provenance, mobility (adapter-only handoff vs. merged checkpoints), bounded residency (hot/warm/cold tiers), and readiness gates. This enables millions of adapters without requiring simultaneous GPU residency.
- Memory hierarchy: LoRA should store behavioral state (skills, habits, policies), not raw facts. Editable facts belong in retrieval; ephemeral state in context. Context Learning provides a write policy to internalize repeatedly useful signals.
- Hyperparameter transfer must be documented with alpha scaling rule; without it, recipes are not portable across ranks, models, or serving platforms.
Conclusion
The paper argues that PEFT enables scaling from one shared foundation model to millions of persistent personal model instances through three coupled axes:
- Scale Up establishes that stronger shared priors make small adapters high-leverage. Trillion-scale LoRA RL is operationally feasible with proper routing-aware correction and full-stack alignment.
- Scale Down identifies the efficient operating regime: middle ranks (16–32) are practical; lower ranks (1–4) are a research frontier where better initialization (OLoRA-tail) and rank-stable hyperparameters (square-root alpha) can close the reliability gap. Stateful adapters like δ-mem extend scale-down from parameter reduction to writable local state.
- Scale Out demonstrates three sources of population-level value: (1) individual personalization via memory and skill internalization, (2) realistic user simulation with per-user adapters that preserve heterogeneity, and (3) collective intelligence through diversity-based aggregation, where accuracy scales logarithmically with model count.
Open problems remain: RL-native PEFT theory, tiny-adapter reliability, stateful adapter design, signal efficiency for Context Learning, and scaling population-level aggregation mechanisms.
Final vision: Not one universal assistant, but an ecology of persistent, partially specialized agents built on strong shared priors and lightweight adapters. PEFT makes adaptation efficient, and through that efficiency, persistent individuality becomes scalable.
Related papers
- K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
Even strong frontier models achieve only 45.67% accuracy on K-BrowseComp, and Korean open-weight models score 0–10.33%, revealing a massive agentic gap.
- Function2Scene: 3D Indoor Scene Layout from Functional Specifications
Function2Scene introduces a novel framework that generates 3D indoor layouts from functional specifications using an iterative check-and-repair pipeline with LLMs, significantly outperforming prior methods in functional design.
- Mellum2 Technical Report
Mellum 2 is an efficient 12B MoE model specialized for software engineering, matching the inference cost of a 7B dense model while achieving competitive performance on coding and reasoning tasks.