MinT: Managed Infrastructure for Training and Serving Millions of LLMs - Summary

Summary (Overview)

Core Concept: MinT (MindLab Toolkit) is a managed infrastructure system that uses Low-Rank Adaptation (LoRA) adapter revisions as the fundamental unit for post-training and online serving, instead of full model checkpoints. It keeps expensive base models resident while moving only compact adapters through the lifecycle.
Scaling Axes: The system scales along three axes: Scale Up (supports LoRA RL on frontier dense and Mixture-of-Experts (MoE) models beyond 1T parameters), Scale Down (minimizes training-serving handoff by moving only adapters), and Scale Out (expands the addressable policy namespace to millions while bounding engine-local execution).
Key Performance Gains: Adapter-only handoff reduces the handoff step by 18.3× on a 4B dense model and 2.85× on a 30B MoE model. Concurrent multi-policy training under the same base allocation shortens wall time by 1.77× and 1.45×, respectively. Packed MoE LoRA tensors improve live engine loading by 8.5–8.7×.
System Design: Separates the adapter revision (executable payload) from the policy record (service state). Provides a Tinker-compatible service interface that hides distributed training, serving, scheduling, and data movement complexity.
Validation: Empirically validated across dense and MoE models (up to 1T parameters), multiple training paradigms (SFT, DPO, GRPO), and policy-population serving with catalogs up to 100k entries, demonstrating feasibility for million-scale managed LoRA policies.

Introduction and Theoretical Foundation

The evolution of LLM post-training from a simple stage to a complex, continuous workload introduces significant infrastructure challenges. As models scale to trillions of parameters and move towards lifelong learning and agentic capabilities, traditional workflows that materialize a full fine-tuned checkpoint for each model variant become untenable due to resource management, scheduling, and version control complexities.

MinT addresses this by adopting LoRA adapters as the basic policy units. The theoretical foundation rests on the premise that trained behaviors (task variants, product branches, experimental versions) can be effectively represented as different adapters applied to a shared, resident base model, as established by LoRA (Hu et al., 2022). This shifts the infrastructure problem from managing many full-model deployments to managing a population of lightweight adapter revisions over a small number of base deployments.

The core innovation is changing what crosses the training-serving boundary: instead of a full or merged checkpoint, MinT moves only the exported LoRA adapter revision to an inference engine that already holds the compatible base model (see Figure 2 in the paper).

Methodology

MinT's methodology is built around a service-oriented architecture with a clear separation between the control plane and the compute plane.

1. Service Plane & Policy Lifecycle:

Manages durable policy records that contain metadata: base version, LoRA rank/target modules, training checkpoints, rollout records, and exported adapter revisions.
Provides operation visibility, policy record resolution, and worker admission/eviction.
Defines the adapter lifecycle: Training updates produce adapter tensors and optimizer state. Export freezes the current state into a fixed adapter revision in serving tensor layout. Rollout, evaluation, and serving select a specific revision.

2. Compute Plane & Resident Workers:

Trainers: Can be single-worker PEFT or distributed Megatron groups for model-parallel bases. They implement time-sliced multi-LoRA training, where one trainer swaps only the LoRA tensors and optimizer state of different policies while keeping the base model resident.
Samplers/Servers: Use vLLM engines that hold a base model resident and attach exported LoRA adapters for inference.

3. Key Technical Mechanisms:

Adapter Data Flow: Exports trained LoRAs in PEFT format, converting sharded training views to serving layout. For MoE models, this includes gathering tensor-parallel slices and deduplicating shared-expert tensors.
Consistency Handling: For MoE models, uses R3 to record and replay expert routing IDs from rollout to avoid training-inference mismatch. For Dynamic Sparse Attention (DSA), uses IcePop-style rollout correction to zero the importance weight of tokens where the training/rollout probability ratio falls outside a trusted band.
Serving Cache Tiers: Separates adapter state into three tiers (see Table 2):
1. Addressable Catalog: Durable, all exported revisions (scale: $10^3$ – $10^6$ ).
2. CPU Adapter Cache: Local to a serving actor, hundreds of adapters.
3. GPU Batch: Currently executing adapters, ≤ 64 distinct adapters.
Cold Loading as Service Work: Treats loading an adapter not in the CPU cache as scheduled work with deduplication and backpressure control.
Packed Representation: For MoE LoRA, packs thousands of small tensor objects into a compact serving representation to reduce fanout and accelerate loading.

Empirical Validation / Results

Experiments validate the three scaling axes using models like Qwen3-4B, Qwen3-30B, Qwen3-235B-A22B, and Kimi K2 (1.04T).

Scale Down: Adapter Handoff & Concurrent Training

Adapter-Only Handoff: Compared to a merge-and-load path, moving only the adapter drastically reduces handoff time.

Model	Path	Checkpoint File Size	Materialization/Load Time	Total vs Warm Sample Speed
Qwen3-4B	Adapter rank-32	252 MiB	0.036 s	15.568/15.567 tok/s
Qwen3-4B	Merge full model	8.061 GB	71.820 s	4.697/20.595 tok/s
Qwen3-30B	Adapter rank-16	1.692 GB	46.455 s	1.874/5.700 tok/s
Qwen3-30B	Merge full model	61.084 GB	402.245 s	1.573/6.904 tok/s

Handoff step reduction: 18.3× (4B), 2.85× (30B).

Concurrent Multi-Policy Training: Time-slicing policies on a resident base improves GPU utilization and reduces total wall time without increasing peak memory.

Model	Schedule	Wall Time	Speedup	Peak Memory
Qwen3-4B	Sequential	3081.2 s	1.00×	65.6 GiB
Qwen3-4B	Concurrent MinT	1736.1 s	1.77×	65.6 GiB
Qwen3-30B	Sequential	10130.0 s	1.00×	68.0 GiB
Qwen3-30B	Concurrent MinT	7008.4 s	1.45×	68.0 GiB

Scale Up: Learning Across Paradigms & Model Scales

Dense Models: The same adapter lifecycle successfully carried SFT (e.g., FinEval accuracy 0.4226 → 0.7811), DPO (reward margin -0.03 → 30.88), and GRPO (AIME24 train accuracy 0.11 → 0.47) updates.
MoE Models: Validated LoRA RL on large sparse models, including a Qwen3-235B-A22B run reaching 0.967 peak mean@1 on AIME24 and a Kimi K2 1.04T countdown-task run. MoE route replay (R3) kept token-level route mismatch very low (e.g., 0.0013% on Qwen3-30B).

Scale Out: Policy-Population Serving

Catalog Scale: Successfully performed single-engine sweeps through catalogs of 100k adapter entries.
Cache Tiers: On one 4-GPU serving actor, the CPU cache held 369-550 adapters, while the GPU batch executed with up to 64 distinct adapters (see Table 6).
Cold Load Performance: Packing MoE LoRA tensors drastically reduced cold load overhead.
Metric Original Packed Effect
Tensor Objects 37,248 672 55.4× fewer
Live Engine Load (N=16) 1.388 s 0.164 s 8.5× faster
AutoResearch: The cookbook utilities enabled efficient recipe search, screening candidates with proxy tasks before full evaluation (e.g., on LawBench, moving from a base score of 0.4628 to a maintained recipe score of 0.5079).

Metric	Original	Packed	Effect
Tensor Objects	37,248	672	55.4× fewer
Live Engine Load (N=16)	1.388 s	0.164 s	8.5× faster

Theoretical and Practical Implications

Infrastructure Abstraction: MinT redefines the unit of management for post-training infrastructures from the "model checkpoint" to the "adapter revision." This provides a scalable abstraction for the emerging workload of maintaining large populations of continuously evolving policies over shared frontier bases.
Resource Efficiency: By eliminating full-checkpoint materialization and enabling concurrent training over resident bases, MinT makes large-scale, multi-tenant LoRA RL services more cost-effective and practical to operate.
Reproducibility and Control: The separation of adapter revisions and policy records, along with the service interface, improves reproducibility, precise rollback, and controlled rollout of policies. The integrated cookbook supports systematic recipe development (AutoResearch).
Path to Personalization and Specialization: The ability to manage millions of addressable policies over massive base models opens a practical path towards large-scale organizational and personal policy customization without the overhead of deploying separate full models.

Conclusion

MinT demonstrates that exported LoRA adapter revisions can serve as the effective managed unit for scalable post-training infrastructure. By keeping base models resident and moving only adapters, it addresses the scaling challenges along three axes:

Scale Up to trillion-parameter sparse models.
Scale Down the training-serving handoff, achieving significant time and resource savings.
Scale Out the policy namespace to millions, decoupling addressability from local resource bounds.

The system hides the complexity of distributed training, serving, and data movement behind a service interface, making large-scale LoRA-based reinforcement learning easier to run, reproduce, and deploy. MinT enables the vision of multi-tenant training services and paves the way for managing vast populations of specialized policies over the next generation of shared, frontier base models.