Simulator And Reference Submissions

This page connects the competition API to the reference code in the simulator repository. The simulator evaluates a participant function named rebalance on recent expert-hotness traces and then measures the resulting load balance and redeployment cost.

Repository Map

The simulator repository is organized around a small set of entry points:

dynamic_lb_simulator.py       # Original simulator loop and metrics
eplb_algorithms/deepseek.py   # DeepSeek EPLB implementation copied into the repo
experiments/                  # Reproducible sweeps and result tables
submissions/                  # Participant-style reference submissions
trace/                        # Small committed sample traces

The full competition traces live on the Codabench worker. The simulator repository includes only small LmSys.npy sample traces for local tests.

Submission API

Every submission exposes this function:

def rebalance(hotness, n_device, n_red_expert):
    ...

The inputs are:

hotness: recent trace window with shape (collection_window, n_layers, n_experts);
n_device: number of expert-parallel devices;
n_red_expert: number of redundant physical expert slots.

Each hotness[t] is one aggregated simulator timestep with shape (n_layers, n_experts). It is not an epoch and is not guaranteed to correspond to exactly one request. The API does not expose how many tokens, requests, sequence positions, or raw routing events contributed to a timestep.

The window contains only the most recent collection_window timesteps. If a submission needs longer history, it may keep bounded module-level state in submission.py. That state can persist across rebalance calls within the evaluator process, so key or reset it by model shape and expert-parallel setting instead of assuming a fresh process for every dataset, model, or EP case.

The return value is:

(change, layers_priority, deployment_table, aux)

deployment_table has shape:

(n_layers, n_device, (n_experts + n_red_expert) // n_device)

It maps every layer, device, and physical expert slot to a logical expert id.

layers_priority selects which layer rows from deployment_table are applied, and in what order. For each selected layer, deployment_table[layer] is a full replacement placement for all physical expert slots in that layer, not only the redundant replicas. Every logical expert must appear at least once in each redeployed layer; repeated logical expert ids are replicas, while omitted ids make the placement invalid.

Redeployment cost is counted slot by slot. Reordering experts without changing the replica counts can still increase transmit amount, so preserve existing placements when possible.

Smoke Submission

The smoke submission is the simplest valid API implementation. It builds the default placement and returns change=False, so the simulator keeps the current deployment.

def rebalance(hotness, n_device, n_red_expert):
    n_layers = hotness.shape[1]
    n_experts = hotness.shape[2]
    n_exp_per_dev = (n_experts + n_red_expert) // n_device

    deployment = np.zeros((n_layers, n_device, n_exp_per_dev), dtype=np.int64)

    for layer in range(n_layers):
        for device in range(n_device):
            for slot in range(n_exp_per_dev - 1):
                deployment[layer, device, slot] = (
                    device * (n_exp_per_dev - 1) + slot
                ) % n_experts
            deployment[layer, device, -1] = deployment[layer, device, -2]

    return False, [], deployment, None

Walkthrough:

Read the model shape from hotness.
Allocate one deployment table for all layers.
Fill each device with a deterministic round-robin logical expert assignment.
Duplicate the last base slot into the redundant slot.
Return False so no redeployment is scheduled.

This is useful for checking packaging and API compatibility, but it is not intended to be competitive.

Hot-Expert Baseline Submission

The hot-expert baseline uses the collection window to identify each layer’s hottest experts and places those experts into the redundant slots.

def rebalance(hotness, n_device, n_red_expert):
    load = hotness.sum(axis=0)
    n_layers, n_experts = load.shape
    n_exp_per_dev = (n_experts + n_red_expert) // n_device

    deployment = np.zeros((n_layers, n_device, n_exp_per_dev), dtype=np.int64)
    base_slots = n_exp_per_dev - 1

    for layer in range(n_layers):
        for device in range(n_device):
            for slot in range(base_slots):
                deployment[layer, device, slot] = (
                    device * base_slots + slot
                ) % n_experts

        hottest = np.argsort(load[layer])[::-1]
        for device in range(n_device):
            deployment[layer, device, -1] = hottest[device % len(hottest)]

    layers_priority = np.arange(n_layers, dtype=np.int64)
    return True, layers_priority, deployment, None

Walkthrough:

Sum the trace window over time to estimate per-layer expert demand.
Fill the base slots with a deterministic placement so every logical expert is covered.
Sort experts by load in each layer.
Use the redundant slot on each device for one of the hottest experts.
Request redeployment for every layer in layer order.

This baseline can reduce PAR when hot experts are persistent, but it can also move many slots because it always returns change=True.

DeepSeek EPLB Walkthrough

DeepSeek EPLB is a placement algorithm for replicated experts. The simulator copy exposes the entry point:

phy2log, log2phy, logcnt = rebalance_experts(
    weight,
    num_replicas,
    num_groups,
    num_nodes,
    num_gpus,
    enable_hierarchical,
)

The key internal stages are:

Convert recent token or hotness statistics into per-layer expert weights.
Optionally group logical experts and pack those groups across nodes.
Replicate hot logical experts by repeatedly assigning extra physical slots to the current largest weight / replica_count.
Pack the resulting physical experts onto GPUs so each GPU receives the same number of experts and similar estimated load.
Return physical-to-logical and logical-to-physical maps, plus the replica count per logical expert.

The core replication step is:

for i in range(num_log, num_phy):
    redundant_indices = (weight / logcnt).max(dim=-1).indices
    phy2log[:, i] = redundant_indices
    rank[:, i] = logcnt[arangen, redundant_indices]
    logcnt[arangen, redundant_indices] += 1

The expression weight / logcnt estimates the load each replica would carry. Adding the next replica to the largest value greedily reduces the highest per-replica pressure.

The balanced packing step then sorts objects by weight and repeatedly places the next heaviest object into the least-loaded pack that still has capacity:

for group in indices[i]:
    pack = min(
        (i for i in range(num_packs) if pack_items[i] < groups_per_pack),
        key=pack_weights.__getitem__,
    )
    pack_index[i, group] = pack
    rank_in_pack[i, group] = pack_items[pack]
    pack_weights[pack] += weight[i, group]
    pack_items[pack] += 1

In competition terms, DeepSeek EPLB is the baseline to beat: a strong submission should improve modeled runtime by lowering PAR enough to justify any additional expert movement.