# Mixture-of-Experts Documentation

Mixture-of-Experts (MoE) models replace part of a dense neural network with a set of parallel
expert modules and a router. In modern large language models, the MoE component is usually placed
where a dense Transformer would otherwise use a feed-forward network (FFN).

## Dense FFN Versus MoE FFN

In a dense Transformer block, every token passes through the same FFN weights:

```text
token hidden state -> dense FFN -> output
```

In an MoE block, each token is routed to a subset of experts:

```text
token hidden state -> router -> top-k experts -> weighted sum -> output
```

This makes compute sparse. The model may contain many experts, but each token uses only a few of
them.

## Core Terms

**Expert**
: A neural submodule, often an FFN, that processes token hidden states.

**Router or gate**
: A learned function that scores experts for each token and selects the active experts.

**Top-k routing**
: A routing policy that selects the k highest-scoring experts for each token.

**Shared expert**
: An expert that is always active for every token. DeepSeekMoE uses shared experts to carry common
knowledge and reduce redundancy pressure on routed experts.

**Routed expert**
: An expert selected conditionally by the router.

**Expert hotness**
: The observed demand for an expert over a window of tokens or batches. In the simulator, hotness
is the trace-derived load signal used by placement algorithms.

**Expert parallelism**
: A distributed execution strategy where experts are sharded across devices. Tokens or hidden
states are communicated to the devices that host their selected experts.

## Why MoE Models Are Efficient

MoE allows total parameter count and active parameter count to diverge:

- Total parameters determine model capacity.
- Activated parameters determine per-token compute cost.

If a model has 256 routed experts and activates 8 per token, most expert weights are idle for that
token. That sparsity is why MoE can scale model capacity without proportional increases in
per-token FLOPs.

## Why MoE Models Are Hard To Serve

Sparse activation creates irregular load:

- Tokens do not choose experts uniformly.
- Popular domains or token patterns can create hot experts.
- Expert popularity changes over time.
- Distributed MoE serving requires communication between token-owning devices and expert-owning
  devices.

This means the serving system needs to care about both algorithmic routing and hardware placement.

## Load Balancing Mechanisms

Common MoE load-balancing tools include:

- **Auxiliary load-balancing loss** during training.
- **Capacity limits** that cap how many tokens an expert can receive.
- **Noisy routing** to encourage exploration and avoid early expert collapse.
- **Expert grouping** to limit cross-device or cross-node communication.
- **Expert replication** to place extra copies of hot experts.
- **Dynamic placement** to move replicas as hotness changes.

The competition focuses on the last two: replication and dynamic placement.

## Competition Metrics

The simulator reports two main metrics:

**PAR**
: Peak-average ratio of device loads. Lower is better. A PAR near 1 means the most-loaded device is
close to the average device load.

**Transmit amount**
: The amount of expert movement caused by redeployment. Lower is better. A policy that constantly
moves experts may improve balance but create too much deployment cost.

Good submissions should reduce PAR without paying excessive transmit cost.