# DeepSeek MoE Routing Algorithm

This page breaks down the DeepSeek-style MoE router at the level needed for load-balancing
experiments. The exact training implementation is not fully reproduced in the public inference
code, but the papers and inference implementation expose the main architecture and inference-time
routing path.

## Router Inputs And Outputs

For each token hidden state \(h_t\), the router produces:

- a score for every routed expert;
- a selected set of routed experts;
- routing weights used to combine selected expert outputs.

The MoE layer also includes shared experts. Shared experts do not need routing; they process every
token.

## DeepSeekMoE Design

DeepSeekMoE differs from a conventional top-k MoE in two important ways.

### Fine-Grained Experts

Instead of keeping a small number of large experts, DeepSeekMoE segments experts more finely. If a
conventional MoE would activate \(K\) experts from \(N\), DeepSeekMoE can activate a larger number
from a larger expert pool while keeping the compute budget controlled by reducing expert size.

The intended effect is more flexible composition: a token can combine several specialized smaller
experts instead of relying on fewer coarse experts.

### Shared Experts

DeepSeekMoE isolates some experts as shared experts. These are always active and are meant to
capture common knowledge. Routed experts can then focus more on specialized information because
they do not need to redundantly learn the common component for every token.

## DeepSeek-V3 Routing Path

The public DeepSeek-V3 inference code implements the gate as a learned linear scoring layer over
routed experts. The 671B configuration uses sigmoid scores, 256 routed experts, 8 active routed
experts per token, 8 expert groups, and 4 selected groups.

At inference time, the route is:

1. Compute expert affinities from the token hidden state and the gate weights.
2. Apply the configured score function. In V3, the large model uses sigmoid scores.
3. Add expert bias for top-k selection when using the auxiliary-loss-free routing method.
4. Partition experts into groups and compute group scores.
5. Keep only the configured number of high-scoring groups.
6. Mask experts outside the selected groups.
7. Select the top-k routed experts from the remaining candidates.
8. Normalize the selected routing weights.
9. Multiply by the route scaling factor.
10. Execute selected routed experts and add shared expert output.

The expert bias in step 3 affects expert selection. The final routing weights used to combine
expert outputs are still based on the original affinity scores, not the biased scores.

## Pseudocode

```text
input: token hidden state h
parameters:
  W_gate                    # one vector per routed expert
  expert_bias               # used for no-aux-load-balancing selection
  n_groups
  topk_groups
  topk_experts
  route_scale

scores = sigmoid(h @ W_gate.T)

selection_scores = scores + expert_bias
group_scores = score_groups(selection_scores, n_groups)
selected_groups = topk(group_scores, topk_groups)

masked_scores = mask_experts_outside_groups(selection_scores, selected_groups)
expert_ids = topk(masked_scores, topk_experts)

weights = gather(scores, expert_ids)
weights = normalize(weights)
weights = route_scale * weights

output = shared_experts(h)
for expert_id, weight in zip(expert_ids, weights):
    output += weight * routed_expert[expert_id](h)
```

## Group-Limited Routing

Group-limited routing reduces the candidate expert set before top-k expert selection. Experts are
divided into groups. The router first chooses a limited number of groups, then selects experts only
inside those groups.

This is important for distributed systems because expert groups can be aligned with physical
topology. A router that can restrict candidates to fewer groups can reduce communication pressure,
although the final load still depends on token distribution and expert popularity.

## Auxiliary-Loss-Free Load Balancing

Traditional MoE systems often add an auxiliary loss to encourage balanced expert usage. DeepSeek-V3
instead describes an auxiliary-loss-free balancing strategy based on expert bias terms.

The bias terms are updated to influence routing decisions toward underused experts and away from
overused experts. The important distinction is:

- biased scores guide top-k selection;
- original affinity scores determine the mixture weights.

This separation aims to improve load balance while reducing the training-quality tradeoff that can
come from forcing balance directly through an auxiliary objective.

## Relation To The Competition Simulator

The simulator does not expose token-level routing to submissions. Instead, participants receive
traces that summarize expert hotness over simulator timesteps. A hot expert is one that the router
selected frequently or heavily in the aggregated trace data.

Given these hotness traces, a placement policy must decide:

- which experts deserve redundant replicas;
- where replicas should be placed across devices;
- how quickly to redeploy when hotness changes;
- how to trade lower PAR against higher transmit amount.

DeepSeek routing explains why those traces are non-uniform. The competition asks participants to
solve the downstream systems problem created by that non-uniform routing.