Introduction To Transformers
Transformers are the neural-network architecture behind modern large language models. They process sequences of tokens by repeatedly applying attention and feed-forward transformations to a hidden state for each token.
Self-Attention
Self-attention lets each token read information from other tokens in the same sequence. A token produces three vectors:
query: what this token is looking for;
key: what this token offers to other tokens;
value: the information this token contributes.
Attention compares queries with keys, converts those scores into weights, and uses the weights to mix value vectors. This gives the model a content-dependent way to move information across the sequence.
Multi-Head Attention
Transformers usually use multiple attention heads in parallel. Each head can learn a different pattern, such as local syntax, long-range references, or delimiter matching. The head outputs are combined and projected back into the model hidden dimension.
DeepSeek-V2 and DeepSeek-V3 use Multi-head Latent Attention (MLA), a variant designed to reduce key-value cache size during inference.
Feed-Forward Networks
After attention, a standard Transformer block applies a feed-forward network (FFN) independently to each token:
attention output -> FFN -> layer output
In dense models, every token uses the same FFN parameters. In MoE models, this FFN is replaced by a router plus multiple expert FFNs, so different tokens can activate different experts.
Residual Connections And Normalization
Transformer layers use residual connections so each sublayer learns an update rather than a full replacement:
h = h + attention(norm(h))
h = h + feed_forward(norm(h))
Normalization stabilizes training and keeps hidden-state scales manageable.
Autoregressive Inference
Decoder-only language models generate text one token at a time. At each step:
the current token sequence is processed by the Transformer;
the model predicts a probability distribution over the next token;
a token is selected;
the new token is appended and the process repeats.
Serving efficiency depends heavily on attention cache management and FFN compute. MoE models reduce FFN compute per token, but they introduce routing and expert-placement challenges.
Why Transformers Lead To MoE
In large dense Transformers, FFN layers contain a large fraction of the parameters and compute. MoE replaces dense FFNs with sparse expert FFNs. This increases total model capacity while keeping only a small subset of expert parameters active for each token.
That is the architectural step from dense Transformers to DeepSeek-style MoE models.