Mixture-of-Experts Documentation

Mixture-of-Experts (MoE) models replace part of a dense neural network with a set of parallel expert modules and a router. In modern large language models, the MoE component is usually placed where a dense Transformer would otherwise use a feed-forward network (FFN).

Dense FFN Versus MoE FFN

In a dense Transformer block, every token passes through the same FFN weights:

token hidden state -> dense FFN -> output

In an MoE block, each token is routed to a subset of experts:

token hidden state -> router -> top-k experts -> weighted sum -> output

This makes compute sparse. The model may contain many experts, but each token uses only a few of them.

Core Terms

Expert : A neural submodule, often an FFN, that processes token hidden states.

Router or gate : A learned function that scores experts for each token and selects the active experts.

Top-k routing : A routing policy that selects the k highest-scoring experts for each token.

Shared expert : An expert that is always active for every token. DeepSeekMoE uses shared experts to carry common knowledge and reduce redundancy pressure on routed experts.

Routed expert : An expert selected conditionally by the router.

Expert hotness : The observed demand for an expert over a window of tokens or batches. In the simulator, hotness is the trace-derived load signal used by placement algorithms.

Expert parallelism : A distributed execution strategy where experts are sharded across devices. Tokens or hidden states are communicated to the devices that host their selected experts.

Why MoE Models Are Efficient

MoE allows total parameter count and active parameter count to diverge:

Total parameters determine model capacity.
Activated parameters determine per-token compute cost.

If a model has 256 routed experts and activates 8 per token, most expert weights are idle for that token. That sparsity is why MoE can scale model capacity without proportional increases in per-token FLOPs.

Why MoE Models Are Hard To Serve

Sparse activation creates irregular load:

Tokens do not choose experts uniformly.
Popular domains or token patterns can create hot experts.
Expert popularity changes over time.
Distributed MoE serving requires communication between token-owning devices and expert-owning devices.

This means the serving system needs to care about both algorithmic routing and hardware placement.

Load Balancing Mechanisms

Common MoE load-balancing tools include:

Auxiliary load-balancing loss during training.
Capacity limits that cap how many tokens an expert can receive.
Noisy routing to encourage exploration and avoid early expert collapse.
Expert grouping to limit cross-device or cross-node communication.
Expert replication to place extra copies of hot experts.
Dynamic placement to move replicas as hotness changes.

The competition focuses on the last two: replication and dynamic placement.

Competition Metrics

The simulator reports two main metrics:

PAR : Peak-average ratio of device loads. Lower is better. A PAR near 1 means the most-loaded device is close to the average device load.

Transmit amount : The amount of expert movement caused by redeployment. Lower is better. A policy that constantly moves experts may improve balance but create too much deployment cost.

Good submissions should reduce PAR without paying excessive transmit cost.