2025 Lecture 4 - MoEs
2025 Lecture 4 - MoEs
M IXTUR ES OF E XPERTS
CS336
Tatsu H
Mixture of experts
GPT4 (?)
What’s a MoE?
[Fedus et al 2022]
Replace big feedforward with (many) big feedforward networks and a selector layer
[Fedus et al 2022]
Why are MoEs getting popular?
Faster to train MoEs
[OlMoE]
Why are MoEs getting popular?
MoEs are most of the highest-performance open models, and are quite quick
Earlier MoE results from Chinese groups – Qwen
Chinese LLM companies are also doing quite a bit of MoE work on the smaller end
Earlier MoE results from Chinese groups - DeepSeek
There’s also some good recent ablation work on MoEs showing they’re generally good
Recent MoE results – DeepSeek v3
Why haven’t MoEs been more popular?
Infrastructure is complex / advantages on multi node
[Fedus et al 2022]
Training objectives are somewhat heuristic (and sometimes unstable)
[Zoph et al 2022]
What MoEs generally look like
Typical: replace MLP with MoE layer Less common: MoE for attention heads
[ModuleFormer, JetMoE]
MoE – what varies?
❖ Routing function
❖ Expert sizes
❖ Training objectives
Routing function - overview
Many of the routing algorithms boil down to ‘choose top k’
[Fedus et al 2022]
Other routing methods
[Fedus et al 2022]
Top-K routing in detail.
Most papers do the old and classic top-k routing. How does this work?
Gating
This is the
DeepSeek (V1-2) router
(Grok, Qwen do this too)
Mixtral, DBRX,
DeepSeek v3
softmaxes after the
TopK
Smaller, larger number of experts + a few shared experts that are always on.
Solutions?
From Shazeer et al 2017 – routing decisions are stochastic with gaussian perturbations.
Stochastic jitter in Fedus et al 2022. This does a uniform multiplicative perturbation for the
same goal of getting less brittle experts. This was later removed in Zoph et al 2022
Heuristic balancing losses
Another key issue – systems efficiency requires that we use experts evenly..
Modern libraries like MegaBlocks (used in many open MoEs) use smarter sparse MMs
Fun side issue – stochasticity of MoE models
There was speculation that GPT-4’s stochasticity was due to MoE..
Token dropping from routing happens at a batch level – this means that
other people’s queries can drop your token!
Issues with MoEs - stability
[Zoph 2022]
Solution: Use Float 32 just for the expert router (sometimes with an aux z-loss)
Z-loss stability for the router
Zoph et al solution – finetune non-MoE MLPs DeepSeek solution – use lots of data 1.4M SFT
Other training methods - upcycling
Simple MoE, shows gains from the base model with ~ 520B tokens for training
Upcycling example – Qwen MoE
Qwen MoE – Initialized from the Qwen 1.8B model top-k=4, 60 experts w/ 4 shared.
Similar architecture / setup to DeepSeekMoE, but one of the first (confirmed) upcycling successes
DeepSeek MoE v1-v2-v3
To wrap up, we’ll walk through the DeepSeek MoE architecture.
V1 (16B – 2.8 active):
Shared (2) + Fine-grained (64/4) experts
New things:
Top-M device routing Communication balancing loss – balancing both communication in and out
DeepSeek MoE v3
V2 (671B – 37 active):
Shared (1) + Fine-grained (258) experts, 8 active
New things
Benefits: when KV-caching, we only need to store 𝑐𝑡𝐾𝑉 , which can be much smaller.
𝑊 𝑈𝐾 can be merged into the Q projection
❖ MoEs take advantage of sparsity – not all inputs need the full model
❖ Lots of empirical evidence now that MoEs work, and are cost-effective