MinT: Managed Infrastructure for Training and Serving Millions of LLMs - Summary
Summary (Overview)
- Core Concept: MinT (MindLab Toolkit) is a managed infrastructure system that uses Low-Rank Adaptation (LoRA) adapter revisions as the fundamental unit for post-training and online serving, instead of full model checkpoints. It keeps expensive base models resident while moving only compact adapters through the lifecycle.
- Scaling Axes: The system scales along three axes: Scale Up (supports LoRA RL on frontier dense and Mixture-of-Experts (MoE) models beyond 1T parameters), Scale Down (minimizes training-serving handoff by moving only adapters), and Scale Out (expands the addressable policy namespace to millions while bounding engine-local execution).
- Key Performance Gains: Adapter-only handoff reduces the handoff step by 18.3× on a 4B dense model and 2.85× on a 30B MoE model. Concurrent multi-policy training under the same base allocation shortens wall time by 1.77× and 1.45×, respectively. Packed MoE LoRA tensors improve live engine loading by 8.5–8.7×.
- System Design: Separates the adapter revision (executable payload) from the policy record (service state). Provides a Tinker-compatible service interface that hides distributed training, serving, scheduling, and data movement complexity.
- Validation: Empirically validated across dense and MoE models (up to 1T parameters), multiple training paradigms (SFT, DPO, GRPO), and policy-population serving with catalogs up to 100k entries, demonstrating feasibility for million-scale managed LoRA policies.
Introduction and Theoretical Foundation
The evolution of LLM post-training from a simple stage to a complex, continuous workload introduces significant infrastructure challenges. As models scale to trillions of parameters and move towards lifelong learning and agentic capabilities, traditional workflows that materialize a full fine-tuned checkpoint for each model variant become untenable due to resource management, scheduling, and version control complexities.
MinT addresses this by adopting LoRA adapters as the basic policy units. The theoretical foundation rests on the premise that trained behaviors (task variants, product branches, experimental versions) can be effectively represented as different adapters applied to a shared, resident base model, as established by LoRA (Hu et al., 2022). This shifts the infrastructure problem from managing many full-model deployments to managing a population of lightweight adapter revisions over a small number of base deployments.
The core innovation is changing what crosses the training-serving boundary: instead of a full or merged checkpoint, MinT moves only the exported LoRA adapter revision to an inference engine that already holds the compatible base model (see Figure 2 in the paper).
Methodology
MinT's methodology is built around a service-oriented architecture with a clear separation between the control plane and the compute plane.
1. Service Plane & Policy Lifecycle:
- Manages durable policy records that contain metadata: base version, LoRA rank/target modules, training checkpoints, rollout records, and exported adapter revisions.
- Provides operation visibility, policy record resolution, and worker admission/eviction.
- Defines the adapter lifecycle: Training updates produce adapter tensors and optimizer state. Export freezes the current state into a fixed adapter revision in serving tensor layout. Rollout, evaluation, and serving select a specific revision.
2. Compute Plane & Resident Workers:
- Trainers: Can be single-worker PEFT or distributed Megatron groups for model-parallel bases. They implement time-sliced multi-LoRA training, where one trainer swaps only the LoRA tensors and optimizer state of different policies while keeping the base model resident.
- Samplers/Servers: Use vLLM engines that hold a base model resident and attach exported LoRA adapters for inference.
3. Key Technical Mechanisms:
- Adapter Data Flow: Exports trained LoRAs in PEFT format, converting sharded training views to serving layout. For MoE models, this includes gathering tensor-parallel slices and deduplicating shared-expert tensors.
- Consistency Handling: For MoE models, uses R3 to record and replay expert routing IDs from rollout to avoid training-inference mismatch. For Dynamic Sparse Attention (DSA), uses IcePop-style rollout correction to zero the importance weight of tokens where the training/rollout probability ratio falls outside a trusted band.
- Serving Cache Tiers: Separates adapter state into three tiers (see Table 2):
- Addressable Catalog: Durable, all exported revisions (scale: –).
- CPU Adapter Cache: Local to a serving actor, hundreds of adapters.
- GPU Batch: Currently executing adapters, ≤ 64 distinct adapters.
- Cold Loading as Service Work: Treats loading an adapter not in the CPU cache as scheduled work with deduplication and backpressure control.
- Packed Representation: For MoE LoRA, packs thousands of small tensor objects into a compact serving representation to reduce fanout and accelerate loading.
Empirical Validation / Results
Experiments validate the three scaling axes using models like Qwen3-4B, Qwen3-30B, Qwen3-235B-A22B, and Kimi K2 (1.04T).
Scale Down: Adapter Handoff & Concurrent Training
- Adapter-Only Handoff: Compared to a merge-and-load path, moving only the adapter drastically reduces handoff time.
Model Path Checkpoint File Size Materialization/Load Time Total vs Warm Sample Speed Qwen3-4B Adapter rank-32 252 MiB 0.036 s 15.568/15.567 tok/s Qwen3-4B Merge full model 8.061 GB 71.820 s 4.697/20.595 tok/s Qwen3-30B Adapter rank-16 1.692 GB 46.455 s 1.874/5.700 tok/s Qwen3-30B Merge full model 61.084 GB 402.245 s 1.573/6.904 tok/s - Handoff step reduction: 18.3× (4B), 2.85× (30B).
- Concurrent Multi-Policy Training: Time-slicing policies on a resident base improves GPU utilization and reduces total wall time without increasing peak memory.
Model Schedule Wall Time Speedup Peak Memory Qwen3-4B Sequential 3081.2 s 1.00× 65.6 GiB Qwen3-4B Concurrent MinT 1736.1 s 1.77× 65.6 GiB Qwen3-30B Sequential 10130.0 s 1.00× 68.0 GiB Qwen3-30B Concurrent MinT 7008.4 s 1.45× 68.0 GiB
Scale Up: Learning Across Paradigms & Model Scales
- Dense Models: The same adapter lifecycle successfully carried SFT (e.g., FinEval accuracy 0.4226 → 0.7811), DPO (reward margin -0.03 → 30.88), and GRPO (AIME24 train accuracy 0.11 → 0.47) updates.
- MoE Models: Validated LoRA RL on large sparse models, including a Qwen3-235B-A22B run reaching 0.967 peak mean@1 on AIME24 and a Kimi K2 1.04T countdown-task run. MoE route replay (R3) kept token-level route mismatch very low (e.g., 0.0013% on Qwen3-30B).
Scale Out: Policy-Population Serving
- Catalog Scale: Successfully performed single-engine sweeps through catalogs of 100k adapter entries.
- Cache Tiers: On one 4-GPU serving actor, the CPU cache held 369-550 adapters, while the GPU batch executed with up to 64 distinct adapters (see Table 6).
- Cold Load Performance: Packing MoE LoRA tensors drastically reduced cold load overhead.
Metric Original Packed Effect Tensor Objects 37,248 672 55.4× fewer Live Engine Load (N=16) 1.388 s 0.164 s 8.5× faster - AutoResearch: The cookbook utilities enabled efficient recipe search, screening candidates with proxy tasks before full evaluation (e.g., on LawBench, moving from a base score of 0.4628 to a maintained recipe score of 0.5079).
Theoretical and Practical Implications
- Infrastructure Abstraction: MinT redefines the unit of management for post-training infrastructures from the "model checkpoint" to the "adapter revision." This provides a scalable abstraction for the emerging workload of maintaining large populations of continuously evolving policies over shared frontier bases.
- Resource Efficiency: By eliminating full-checkpoint materialization and enabling concurrent training over resident bases, MinT makes large-scale, multi-tenant LoRA RL services more cost-effective and practical to operate.
- Reproducibility and Control: The separation of adapter revisions and policy records, along with the service interface, improves reproducibility, precise rollback, and controlled rollout of policies. The integrated cookbook supports systematic recipe development (AutoResearch).
- Path to Personalization and Specialization: The ability to manage millions of addressable policies over massive base models opens a practical path towards large-scale organizational and personal policy customization without the overhead of deploying separate full models.
Conclusion
MinT demonstrates that exported LoRA adapter revisions can serve as the effective managed unit for scalable post-training infrastructure. By keeping base models resident and moving only adapters, it addresses the scaling challenges along three axes:
- Scale Up to trillion-parameter sparse models.
- Scale Down the training-serving handoff, achieving significant time and resource savings.
- Scale Out the policy namespace to millions, decoupling addressability from local resource bounds.
The system hides the complexity of distributed training, serving, and data movement behind a service interface, making large-scale LoRA-based reinforcement learning easier to run, reproduce, and deploy. MinT enables the vision of multi-tenant training services and paves the way for managing vast populations of specialized policies over the next generation of shared, frontier base models.