# Simulator And Reference Submissions

This page connects the competition API to the reference code in the simulator repository. The
simulator evaluates a participant function named `rebalance` on recent expert-hotness traces and
then measures the resulting load balance and redeployment cost.

## Repository Map

The simulator repository is organized around a small set of entry points:

```text
dynamic_lb_simulator.py       # Original simulator loop and metrics
eplb_algorithms/deepseek.py   # DeepSeek EPLB implementation copied into the repo
experiments/                  # Reproducible sweeps and result tables
submissions/                  # Participant-style reference submissions
trace/                        # Small committed sample traces
```

The full competition traces live on the Codabench worker. The simulator repository includes only
small `LmSys.npy` sample traces for local tests.

## Submission API

Every submission exposes this function:

```python
def rebalance(hotness, n_device, n_red_expert):
    ...
```

The inputs are:

- `hotness`: recent trace window with shape `(collection_window, n_layers, n_experts)`;
- `n_device`: number of expert-parallel devices;
- `n_red_expert`: number of redundant physical expert slots.

Each `hotness[t]` is one aggregated simulator timestep with shape `(n_layers, n_experts)`.
It is not an epoch and is not guaranteed to correspond to exactly one request. The API does not
expose how many tokens, requests, sequence positions, or raw routing events contributed to a
timestep.

The window contains only the most recent `collection_window` timesteps. If a submission needs
longer history, it may keep bounded module-level state in `submission.py`. That state can persist
across `rebalance` calls within the evaluator process, so key or reset it by model shape and
expert-parallel setting instead of assuming a fresh process for every dataset, model, or EP case.

The return value is:

```python
(change, layers_priority, deployment_table, aux)
```

`deployment_table` has shape:

```text
(n_layers, n_device, (n_experts + n_red_expert) // n_device)
```

It maps every layer, device, and physical expert slot to a logical expert id.

`layers_priority` selects which layer rows from `deployment_table` are applied, and in what order.
For each selected layer, `deployment_table[layer]` is a full replacement placement for all physical
expert slots in that layer, not only the redundant replicas. Every logical expert must appear at
least once in each redeployed layer; repeated logical expert ids are replicas, while omitted ids
make the placement invalid.

Redeployment cost is counted slot by slot. Reordering experts without changing the replica counts
can still increase transmit amount, so preserve existing placements when possible.

## Smoke Submission

The smoke submission is the simplest valid API implementation. It builds the default placement and
returns `change=False`, so the simulator keeps the current deployment.

```python
def rebalance(hotness, n_device, n_red_expert):
    n_layers = hotness.shape[1]
    n_experts = hotness.shape[2]
    n_exp_per_dev = (n_experts + n_red_expert) // n_device

    deployment = np.zeros((n_layers, n_device, n_exp_per_dev), dtype=np.int64)

    for layer in range(n_layers):
        for device in range(n_device):
            for slot in range(n_exp_per_dev - 1):
                deployment[layer, device, slot] = (
                    device * (n_exp_per_dev - 1) + slot
                ) % n_experts
            deployment[layer, device, -1] = deployment[layer, device, -2]

    return False, [], deployment, None
```

Walkthrough:

1. Read the model shape from `hotness`.
2. Allocate one deployment table for all layers.
3. Fill each device with a deterministic round-robin logical expert assignment.
4. Duplicate the last base slot into the redundant slot.
5. Return `False` so no redeployment is scheduled.

This is useful for checking packaging and API compatibility, but it is not intended to be
competitive.

## Hot-Expert Baseline Submission

The hot-expert baseline uses the collection window to identify each layer's hottest experts and
places those experts into the redundant slots.

```python
def rebalance(hotness, n_device, n_red_expert):
    load = hotness.sum(axis=0)
    n_layers, n_experts = load.shape
    n_exp_per_dev = (n_experts + n_red_expert) // n_device

    deployment = np.zeros((n_layers, n_device, n_exp_per_dev), dtype=np.int64)
    base_slots = n_exp_per_dev - 1

    for layer in range(n_layers):
        for device in range(n_device):
            for slot in range(base_slots):
                deployment[layer, device, slot] = (
                    device * base_slots + slot
                ) % n_experts

        hottest = np.argsort(load[layer])[::-1]
        for device in range(n_device):
            deployment[layer, device, -1] = hottest[device % len(hottest)]

    layers_priority = np.arange(n_layers, dtype=np.int64)
    return True, layers_priority, deployment, None
```

Walkthrough:

1. Sum the trace window over time to estimate per-layer expert demand.
2. Fill the base slots with a deterministic placement so every logical expert is covered.
3. Sort experts by load in each layer.
4. Use the redundant slot on each device for one of the hottest experts.
5. Request redeployment for every layer in layer order.

This baseline can reduce PAR when hot experts are persistent, but it can also move many slots
because it always returns `change=True`.

## DeepSeek EPLB Walkthrough

DeepSeek EPLB is a placement algorithm for replicated experts. The simulator copy exposes the
entry point:

```python
phy2log, log2phy, logcnt = rebalance_experts(
    weight,
    num_replicas,
    num_groups,
    num_nodes,
    num_gpus,
    enable_hierarchical,
)
```

The key internal stages are:

1. Convert recent token or hotness statistics into per-layer expert weights.
2. Optionally group logical experts and pack those groups across nodes.
3. Replicate hot logical experts by repeatedly assigning extra physical slots to the current
   largest `weight / replica_count`.
4. Pack the resulting physical experts onto GPUs so each GPU receives the same number of experts
   and similar estimated load.
5. Return physical-to-logical and logical-to-physical maps, plus the replica count per logical
   expert.

The core replication step is:

```python
for i in range(num_log, num_phy):
    redundant_indices = (weight / logcnt).max(dim=-1).indices
    phy2log[:, i] = redundant_indices
    rank[:, i] = logcnt[arangen, redundant_indices]
    logcnt[arangen, redundant_indices] += 1
```

The expression `weight / logcnt` estimates the load each replica would carry. Adding the next
replica to the largest value greedily reduces the highest per-replica pressure.

The balanced packing step then sorts objects by weight and repeatedly places the next heaviest
object into the least-loaded pack that still has capacity:

```python
for group in indices[i]:
    pack = min(
        (i for i in range(num_packs) if pack_items[i] < groups_per_pack),
        key=pack_weights.__getitem__,
    )
    pack_index[i, group] = pack
    rank_in_pack[i, group] = pack_items[pack]
    pack_weights[pack] += weight[i, group]
    pack_items[pack] += 1
```

In competition terms, DeepSeek EPLB is the baseline to beat: a strong submission should improve
modeled runtime by lowering PAR enough to justify any additional expert movement.