# Redesign Mixture-of-Experts Routers with Manifold Power Iteration

> Manifold Power Iteration (MPI) aligns router rows with expert weight principal directions, boosting MoE model accuracy and convergence with minimal overhead.

- **Source:** [arXiv](https://arxiv.org/abs/2606.12397)
- **Published:** 2026-06-12
- **Permalink:** https://picx.dev/p/leCPAG
- **Whiteboard:** https://picx.dev/p/leCPAG/image

## Summary

## Summary (Overview)

- Proposes **Manifold Power Iteration (MPI)**, a principled redesign of MoE routers that aligns each router row with the principal singular direction of its corresponding expert weight matrix.
- Introduces a **“Power-then-Retract”** paradigm: a single power iteration step on router weights followed by L₂ normalization to ensure stability and efficiency.
- Provides theoretical proof that MPI drives router rows to converge toward the principal singular direction of expert weights, equivalent to a steepest ascent optimization under norm constraints.
- Demonstrates consistent improvements in convergence speed, downstream performance, and load balancing across MoE models from 1B to 11B parameters, using multiple optimizers (AdamW, Muon, AdamH, MuonH).
- Shows that MPI introduces negligible training overhead (≈0.2% slowdown) and zero inference overhead, making it practical for large-scale deployment.

## Introduction and Theoretical Foundation

Mixture-of-Experts (MoE) models scale LLM capacity by replacing FFNs with multiple expert modules, using a router to select a sparse subset of experts per token. The router is typically a linear matrix \(R \in \mathbb{R}^{N \times D}\), where each row \(R[i]\) serves as a proxy for the \(i\)-th expert. Ideally, \(R[i]\) should encode the most informative features of the expert’s weight matrix \(W_i^*\) to accurately reflect token–expert affinity. However, conventional router designs lack explicit constraints to enforce this encoding, leading to suboptimal routing.

The authors propose to align each router row with the **principal singular direction** of the associated expert weight matrix. This choice is motivated by matrix theory: the principal singular vector captures the highest information density of a matrix (Eckart–Young theorem). Mathematically, this alignment is equivalent to maximizing the squared projection:

\[
\max_{R[i]} \phi(W_i^*, R[i]) = \frac{\|R[i] W_i^*\|_2^2}{\|R[i]\|_2^2}
\tag{3}
\]

where \(\phi(\cdot)\) is the Rayleigh quotient with \(W_i^* W_i^{*\top}\). Direct SVD is prohibitive per training step, so the authors leverage **power iteration** as a lightweight online approximation.

## Methodology

### Manifold Power Iteration (MPI)

The proposed method follows a **“Power-then-Retract”** paradigm:

1. **Power iteration step**: For each router row \(R[i]\), compute:
   \[
   \hat{R}[i] = R[i] W_g^i W_g^{i\top}
   \tag{4}
   \]
   where \(W_g^i\) is the gate projection matrix of expert \(i\).

2. **Retraction step**: Normalize to a constant norm \(C\) to prevent divergence:
   \[
   R'[i] = C \cdot \frac{\hat{R}[i]}{\|\hat{R}[i]\|_2}
   \tag{5}
   \]

The final gating weights are computed using the updated router matrix \(R'\):
\[
w' = \text{Softmax}\left( \text{TopK}\left( x R'^\top \right) \right)
\tag{6}
\]

### Design Principle for \(C\)

To bound routing logits at \(O(1)\), the authors derive:
\[
C := \frac{C'}{\sqrt{N}}, \quad C' \text{ a scale-invariant hyperparameter}
\tag{7}
\]
This decouples the scaling effect from the number of experts \(N\).

### Connection to Optimization on the Sphere

The update induced by MPI is shown to be equivalent to a **steepest ascent** on the spherical manifold under the maximum projection objective (Eq. 3). The gradient in the tangent space is:
\[
\Delta r_g = \eta \left( R'[i] M - R'[i] (R'[i] M R'[i]^\top) \right)
\tag{9}
\]
where \(M = W_g W_g^\top\). The MPI update approximates this with an adaptive step size:
\[
\Delta r_M \approx \frac{1}{R'[i] M R'[i]^\top} \left( R'[i] M - R'[i] (R'[i] M R'[i]^\top) \right)
\tag{10}
\]
This drives router rows toward the principal singular subspace of the expert weights.

## Empirical Validation / Results

### Experiments Setup

- **Scales**: 1B, 3B, 11B parameters, trained on 100B–350B tokens (FineWeb-Edu).
- **Optimizers**: AdamW, Muon, AdamH, MuonH.
- **Evaluation**: 25 downstream benchmarks (ARC-C, MMLU, TriviaQA, NaturalQs, BBH, GSM8K, MBPP, etc.).

### Key Results

| Optimizer | MoE (Avg. Acc.) | MoE + MPI (Avg. Acc.) |
|-----------|------------------|------------------------|
| AdamW     | 42.26            | **43.56**              |
| AdamH     | 42.59            | **43.93**              |
| Muon      | 43.01            | **43.55**              |
| MuonH     | 42.78            | **43.98**              |

**Table 1**: Downstream performance (average across 25 benchmarks) at 1B scale. MPI consistently improves all optimizers.

**Convergence**: MoE with MPI achieves faster loss reduction (e.g., 0.013 lower loss for MuonH-1B, Figure 2). At 11B scale, the advantage persists throughout training (Figure 3a).

**Downstream Performance (11B)**:

| Task | MoE 3B | + MPI 3B | MoE 11B | + MPI 11B |
|------|--------|----------|---------|-----------|
| ARC-C | 55.91 | **58.96** | 61.54 | **62.24** |
| MMLU | 47.01 | **48.83** | 50.00 | **50.93** |
| TriviaQA | 45.78 | **46.52** | 55.41 | **56.89** |
| NaturalQs | 17.87 | **20.13** | 25.30 | **25.36** |
| BBH | 29.53 | **30.99** | 31.17 | **31.45** |
| GSM8K | 16.22 | **20.92** | 17.89 | **27.60** |
| MBPP | 42.25 | **44.54** | 45.12 | **44.87** |
| **Average** | **36.37** | **38.70** | **40.92** | **42.76** |

**Table 3**: Challenging benchmark results. MPI improves performance at both scales.

**Load Balancing**: MPI reduces load imbalance. For 3B models:

| Metric | MoE | MoE + MPI |
|--------|-----|-----------|
| MaxVio Batch | 1.133 | **1.024** |
| MaxVio Global | 0.964 | **0.711** |

**Table 4**: MaxVio (lower is better). MPI achieves better load distribution.

**Router-Expert Alignment**: The projection \(\lambda = \frac{\|R'[i] W_g^i\|_2}{\|R'[i]\|_2 \|W_g^i\|_2}\) is significantly higher for MPI (e.g., Layer 1: 0.67 vs 0.37, Table 5), confirming alignment with the principal singular direction.

**Ablation**:
- Removing power iteration (only retraction) yields near-vanilla performance (Figure 5).
- Removing retraction causes training collapse for AdamW/Muon.
- Sensitivity to \(C'\) is low; optimal \(C' \approx 4\) (Table 6).

**Efficiency**: At 11B, MPI incurs only 0.2% training slowdown; zero inference overhead.

## Theoretical and Practical Implications

- **Theoretical**: The paper establishes a principled connection between router representation and expert parameters via principal singular direction alignment. The optimization perspective (Eqs. 9–10) explains why single power iteration suffices and why updates become more conservative as alignment improves.
- **Practical**: MPI is optimizer-agnostic, compatible with auxiliary losses (load balancing, z-loss) and alternative activation functions (Sigmoid). It requires minimal hyperparameter tuning (\(C'\)) and transfers across scales. The negligible overhead makes it suitable for large-scale MoE training and deployment.
- **Load balancing**: The retraction step implicitly regularizes router norms, leading to more equitable expert utilization.

## Conclusion

The paper revisits MoE router design from a row-wise expert-proxy perspective and proposes **Manifold Power Iteration (MPI)**. MPI is an efficient, theoretically grounded alternative that aligns router rows with the principal singular directions of expert weights through a lightweight “Power-then-Retract” paradigm. Extensive experiments across scales (1B–11B) and optimizers validate that MPI accelerates convergence, improves downstream performance, and enhances load balancing with negligible overhead. The work opens avenues for mathematically principled router design and deeper understanding of representation geometry in MoEs. Future directions include exploring combinations of expert weight matrices and further theoretical analysis of the retraction’s balancing effect.

---

_Markdown view of https://picx.dev/p/leCPAG, served by PicX — AI-generated visual whiteboard summaries of research papers._
