0% found this document useful (0 votes)

9 views

2025 Lecture 4 - MoEs

Mixture of Experts (MoE) models replace large feedforward networks with multiple smaller networks and a selector layer, allowing for increased parameters without additional computational cost. They are gaining popularity due to their faster training times, competitive performance, and ability to parallelize across devices, although their complex infrastructure and heuristic training objectives pose challenges. Recent advancements in MoE architectures, particularly from DeepSeek and other groups, demonstrate their effectiveness and potential for optimization through various routing methods and training strategies.

Uploaded by

rainnamratasg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

2025 Lecture 4 - MoEs

Uploaded by

rainnamratasg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Lecture 4

M IXTUR ES OF E XPERTS

CS336
Tatsu H
Mixture of experts

GPT4 (?)
What’s a MoE?

[Fedus et al 2022]

Replace big feedforward with (many) big feedforward networks and a selector layer

You can increase the # experts without affecting FLOPs

Why are MoEs getting popular?
Same FLOP, more param does better

[Fedus et al 2022]
Why are MoEs getting popular?
Faster to train MoEs

[OlMoE]
Why are MoEs getting popular?

Highly competitive vs dense equivalents

Why are MoEs getting popular?
Parallelizable to many devices
Some MoE results – from the west

MoEs are most of the highest-performance open models, and are quite quick
Earlier MoE results from Chinese groups – Qwen
Chinese LLM companies are also doing quite a bit of MoE work on the smaller end
Earlier MoE results from Chinese groups - DeepSeek
There’s also some good recent ablation work on MoEs showing they’re generally good
Recent MoE results – DeepSeek v3
Why haven’t MoEs been more popular?
Infrastructure is complex / advantages on multi node

[Fedus et al 2022]
Training objectives are somewhat heuristic (and sometimes unstable)

[Zoph et al 2022]
What MoEs generally look like

Typical: replace MLP with MoE layer Less common: MoE for attention heads

[ModuleFormer, JetMoE]
MoE – what varies?

❖ Routing function

❖ Expert sizes

❖ Training objectives
Routing function - overview
Many of the routing algorithms boil down to ‘choose top k’

Token chooses Expert chooses Global routing via

expert token optimization
[Fedus et al 2022]
Routing type
Almost all the MoEs do a standard ‘token choice topk’ routing. Some recent ablations
Common routing variants in detail
Used in most MoEs

Switch Transformer (k=1)

Top-k Gshard (k=2), Grok (2), Mixtral (2),
Qwen (4), DBRX (4),
DeepSeek (7)

Hashing Common baseline

[Fedus et al 2022]
Other routing methods

RL to learn routes Used in some of the earliest work

Bengio 2013, not common now

Solve a matching Linear assignment for routing

problem Used in various papers like Clark ‘22

[Fedus et al 2022]
Top-K routing in detail.
Most papers do the old and classic top-k routing. How does this work?

Gating
This is the
DeepSeek (V1-2) router
(Grok, Qwen do this too)

Mixtral, DBRX,
DeepSeek v3
softmaxes after the
TopK

Gates selected by a logistic regressor [Dai et al 2024]

Recent variations from DeepSeek and other Chinese LMs

Smaller, larger number of experts + a few shared experts that are always on.

(Used in DeepSeek / Qwen, originally from DeepSpeed MoE)

Various ablations from the DeepSeek paper

More experts, shared experts all seem to generally help

Ablations from OlMoE
Gains from fine-grained experts, none from shared experts.
Expert routing setups for recent MoEs
Model Routed Active Shared Fine-grained ratio
GShard 2048 2 0
Switch Transformer 64 1 0
ST-MOE 64 2 0
Mixtral 8 2 0
DBRX 16 4 0
Grok 8 2 0
DeepSeek v1 64 6 2 1/4
Qwen 1.5 60 4 4 1/8
DeepSeek v3 256 8 1 1/14
OlMoE 64 8 0 1/8
MiniMax 32 2 0 ~1/4
Llama 4 (maverick) 128 1 1 1/2
How do we train MoEs?
Major challenge: we need sparsity for training-time efficiency…
But sparse gating decisions are not differentiable!

Solutions?

1. Reinforcment learning to optimize gating policies

2. Stochastic perturbations
3. Heuristic ‘balancing’ losses.

Guess which one people use in practice?

RL for MoEs
RL via REINFORCE does work, but not so much better that it’s a clear win

(REINFORCE baseline approach, Clark et al 2020)

RL is the ‘right solution’ but gradient variances and complexity

means it’s not widely used
Stochastic approximations

From Shazeer et al 2017 – routing decisions are stochastic with gaussian perturbations.

1. This naturally leads to experts that are a bit more robust.

2. The softmax means that the model learns how to rank K experts
Stochastic approximations

Stochastic jitter in Fedus et al 2022. This does a uniform multiplicative perturbation for the
same goal of getting less brittle experts. This was later removed in Zoph et al 2022
Heuristic balancing losses
Another key issue – systems efficiency requires that we use experts evenly..

From the Switch Transformer [Fedus et al 2022]

𝛼𝑁
The derivative with respect to 𝑝𝑖 (𝑥) is 2 σ 1𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥 =𝑖 ,
𝑇
so more frequent use = stronger downweighting
Example from deepseek (v1-2)
Per-expert balancing – same as the switch transformer

Per-device balancing – the objective above, but aggregated by device.

DeepSeek v3 variation – per-expert biases
Set up a per-expert bias (making it more likely to get tokens) and use online learning

They call this ‘auxiliary loss free balancing’

(but the approach is not fully aux loss free..)

What happens when removing load balancing losses?
Training MoEs – the systems side

MoEs parallelize nicely – Each FFN Enables additional kinds of parallelism

can fit in a device
Training MoEs – the systems side
MoE routing allows for parallelism, but also some complexities

Modern libraries like MegaBlocks (used in many open MoEs) use smarter sparse MMs
Fun side issue – stochasticity of MoE models
There was speculation that GPT-4’s stochasticity was due to MoE..

Why would a MoE have additional randomness?

Token dropping from routing happens at a batch level – this means that
other people’s queries can drop your token!
Issues with MoEs - stability

[Zoph 2022]

Solution: Use Float 32 just for the expert router (sometimes with an aux z-loss)
Z-loss stability for the router

What happens when we remove the z-loss?

Issues with MoEs – fine-tuning
Sparse MoEs can overfit on smaller fine-tuning data

Zoph et al solution – finetune non-MoE MLPs DeepSeek solution – use lots of data 1.4M SFT
Other training methods - upcycling

Can we use a pre-trained LM to initialize a MoE?

Upcycling example - MiniCPM
Uses the MiniCPM model (topk=2, 8 experts, ~ 4B active params).

Simple MoE, shows gains from the base model with ~ 520B tokens for training
Upcycling example – Qwen MoE
Qwen MoE – Initialized from the Qwen 1.8B model top-k=4, 60 experts w/ 4 shared.

Similar architecture / setup to DeepSeekMoE, but one of the first (confirmed) upcycling successes
DeepSeek MoE v1-v2-v3
To wrap up, we’ll walk through the DeepSeek MoE architecture.
V1 (16B – 2.8 active):
Shared (2) + Fine-grained (64/4) experts

Standard, top-k routing Standard Aux-loss balancing (Expert + Device)

DeepSeek MoE v2
V2 (236B – 21 active):
Shared (2) + Fine-grained (160/10) experts, 6 active

New things:

Top-M device routing Communication balancing loss – balancing both communication in and out
DeepSeek MoE v3
V2 (671B – 37 active):
Shared (1) + Fine-grained (258) experts, 8 active

New things

Sigmoid+Softmax topK + topM Aux-loss-free + seq-wise aux

Bonus: What else do you need to make DeepSeek MoE v3?
MLA : Multihead, latent attention

Basic idea: express the Q, K, V as functions of a lower-dim, ‘latent’ activation

What else do you need to make DeepSeek MoE v3?
Basic idea: express the Q, K, V as functions of a lower-dim, ‘latent’ activation

Benefits: when KV-caching, we only need to store 𝑐𝑡𝐾𝑉 , which can be much smaller.
𝑊 𝑈𝐾 can be merged into the Q projection

(they also compress queries, for memory savings during training)

Complexity: rope conflicts with MLA-style caching.

Without RoPE – 𝑄, 𝐾 = ℎ𝑊 𝑄 , 𝑊 𝑈𝐾 𝑐𝑡𝐾𝑉 = ⟨ℎ 𝑊 𝑄 𝑊 𝑈𝐾 , 𝑐𝑡𝐾𝑉 ⟩
With RoPE - 𝑄𝑅𝑞 , 𝑅𝑘 𝐾 = ℎ𝑊 𝑄 𝑅𝑞 , 𝑅𝑘 𝑊 𝑈𝐾 𝑐𝑡𝐾𝑉 = ℎ 𝑊 𝑄 𝑅𝑞 𝑅𝑘 𝑊 𝑈𝐾 , 𝑐𝑡𝐾𝑉
The solution – Have a few non-latent key dimensions that can be rotated
What else do you need to make DeepSeek MoE v3?
MTP: Have small, lightweight models that predict multiple steps ahead

(But they only do MTP with one token ahead)

[Deepseek v3] [EAGLE]

(See paper for ablations)

MoE summary

❖ MoEs take advantage of sparsity – not all inputs need the full model

❖ Discrete routing is hard, but top-k heuristics seem to work

❖ Lots of empirical evidence now that MoEs work, and are cost-effective

Dark Runes - Writeup
100% (1)
Dark Runes - Writeup
4 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Internship Presentation PPT Format
No ratings yet
Internship Presentation PPT Format
26 pages
IPR Project Topics
90% (10)
IPR Project Topics
4 pages
Mixture of Experts Explained
No ratings yet
Mixture of Experts Explained
24 pages
MoE_1
No ratings yet
MoE_1
15 pages
Open Mixture-of-Experts Language Models
No ratings yet
Open Mixture-of-Experts Language Models
61 pages
choice routing
No ratings yet
choice routing
14 pages
2501.13074v1
No ratings yet
2501.13074v1
14 pages
Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
No ratings yet
Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
33 pages
Mixture of A Million Experts: Google Deepmind
No ratings yet
Mixture of A Million Experts: Google Deepmind
12 pages
Files
No ratings yet
Files
33 pages
Beyond Memory Limits Scaling Mixture of Experts Models
No ratings yet
Beyond Memory Limits Scaling Mixture of Experts Models
15 pages
2406.18219v2
No ratings yet
2406.18219v2
19 pages
DeepSeek-V3 : Efficient and Scalable AI With Mixture-Of-Experts
No ratings yet
DeepSeek-V3 : Efficient and Scalable AI With Mixture-Of-Experts
9 pages
MoE
No ratings yet
MoE
15 pages
Moe Pruner
No ratings yet
Moe Pruner
18 pages
2024-Prediction is All MoE Needs Expert Load Distribution Goes From Fluctuating to Stabilizing
No ratings yet
2024-Prediction is All MoE Needs Expert Load Distribution Goes From Fluctuating to Stabilizing
10 pages
Network_Flows
No ratings yet
Network_Flows
8 pages
2024_How Lightweight Can A Vision Transformer Be_Tan_arXiv
No ratings yet
2024_How Lightweight Can A Vision Transformer Be_Tan_arXiv
8 pages
2024 - Skywork-MoE - Wei Et Al
No ratings yet
2024 - Skywork-MoE - Wei Et Al
14 pages
Task-Based Moe For Multitask Multilingual Machine Translation
No ratings yet
Task-Based Moe For Multitask Multilingual Machine Translation
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
Mixture of Experts With Mixture of Precisions For Tuning Quality of Service
No ratings yet
Mixture of Experts With Mixture of Precisions For Tuning Quality of Service
7 pages
Shi Et Al_2024_Unchosen Experts Can Contribute Too
No ratings yet
Shi Et Al_2024_Unchosen Experts Can Contribute Too
19 pages
Switch Transformers - Scaling To Trillion Parameter Models With Simple and Efficient Sparsity
No ratings yet
Switch Transformers - Scaling To Trillion Parameter Models With Simple and Efficient Sparsity
40 pages
2502.06643v1
No ratings yet
2502.06643v1
15 pages
Reinforcement_Learning_Based_Routing_in_Networks_R
No ratings yet
Reinforcement_Learning_Based_Routing_in_Networks_R
9 pages
Neuralnetworks 1
No ratings yet
Neuralnetworks 1
65 pages
Deepseek v2 Tech Report
No ratings yet
Deepseek v2 Tech Report
50 pages
Unchosen Experts Can Contribute Too Unleashing MoE Models’ Power by Self-Contrast
No ratings yet
Unchosen Experts Can Contribute Too Unleashing MoE Models’ Power by Self-Contrast
25 pages
Deepseek-V2: A Strong, Economical, and Efficient Mixture-Of-Experts Language Model
No ratings yet
Deepseek-V2: A Strong, Economical, and Efficient Mixture-Of-Experts Language Model
50 pages
preprints202408.0583.v2 (1)
No ratings yet
preprints202408.0583.v2 (1)
32 pages
A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
No ratings yet
A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
41 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
AI IAT1 7m
No ratings yet
AI IAT1 7m
9 pages
Switch Transformers Scaling To Trillion Parameter Models With Simple and Efficient Sparsity by Fedus Et Al
No ratings yet
Switch Transformers Scaling To Trillion Parameter Models With Simple and Efficient Sparsity by Fedus Et Al
39 pages
SC_04
No ratings yet
SC_04
21 pages
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
No ratings yet
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
52 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
DeepSeek_V3
No ratings yet
DeepSeek_V3
53 pages
I2ml3e Chap12
No ratings yet
I2ml3e Chap12
26 pages
Dynamic Mixture of Experts
No ratings yet
Dynamic Mixture of Experts
22 pages
Brainformers - Trading Simplicity For Efficiency
No ratings yet
Brainformers - Trading Simplicity For Efficiency
12 pages
UCS_401_Unit-LV_ Trends in Machine Learning_Model and Symbols- Bagging and Boosting, Multitask
No ratings yet
UCS_401_Unit-LV_ Trends in Machine Learning_Model and Symbols- Bagging and Boosting, Multitask
44 pages
Vinija's Notes - Primers - Mixture of Experts
No ratings yet
Vinija's Notes - Primers - Mixture of Experts
39 pages
Mixture of Experts Explained Simply
No ratings yet
Mixture of Experts Explained Simply
8 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
206 pages
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
No ratings yet
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
31 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Parameter-Efficient Sparsity Crafting From Dense To Mixture-Of-Experts For Instruction Tuning On General Tasks
No ratings yet
Parameter-Efficient Sparsity Crafting From Dense To Mixture-Of-Experts For Instruction Tuning On General Tasks
13 pages
Data Science Learning Path
No ratings yet
Data Science Learning Path
43 pages
DeepSeek Modelss
No ratings yet
DeepSeek Modelss
52 pages
2502.02523v3
No ratings yet
2502.02523v3
9 pages
Dynamic Mixture of Experts_ an Auto-Tuning Approach for Efficient Transformer Models
No ratings yet
Dynamic Mixture of Experts_ an Auto-Tuning Approach for Efficient Transformer Models
27 pages
2024-Revisiting SMoE Language Models by Evaluating Inefficiencies With Task Specific Expert Pruning
No ratings yet
2024-Revisiting SMoE Language Models by Evaluating Inefficiencies With Task Specific Expert Pruning
18 pages
DL questions
No ratings yet
DL questions
30 pages
Drawing DeepSeek R1 Architecture and Training Process From Scratch _ by Fareed Khan _ Feb, 2025 _ Level Up Coding
No ratings yet
Drawing DeepSeek R1 Architecture and Training Process From Scratch _ by Fareed Khan _ Feb, 2025 _ Level Up Coding
39 pages
AutoRL Tutorials
No ratings yet
AutoRL Tutorials
80 pages
Groovy for Domain-Specific Languages, Second Edition: Extend and enhance your Java applications with domain-specific scripting in Groovy
From Everand
Groovy for Domain-Specific Languages, Second Edition: Extend and enhance your Java applications with domain-specific scripting in Groovy
Fergal Dearle
No ratings yet
SOLID .NET: Clean Code Principles Made Easy with Real Projects
From Everand
SOLID .NET: Clean Code Principles Made Easy with Real Projects
SAINISH
No ratings yet
LEARN MPLS FROM SCRATCH PART-A: A Beginner's Guide to Next Level of Networking
From Everand
LEARN MPLS FROM SCRATCH PART-A: A Beginner's Guide to Next Level of Networking
POONAM DEVI
No ratings yet
CSS
No ratings yet
CSS
29 pages
20bce2239 Project
No ratings yet
20bce2239 Project
36 pages
Notification Assistant Managers 1 2019
No ratings yet
Notification Assistant Managers 1 2019
7 pages
EE36-Data Structure and Algorithms Ii Eee
No ratings yet
EE36-Data Structure and Algorithms Ii Eee
159 pages
UserManual EN
No ratings yet
UserManual EN
62 pages
Downloaded From Manuals Search Engine
No ratings yet
Downloaded From Manuals Search Engine
64 pages
Postal_Ballot_Notice_2025
No ratings yet
Postal_Ballot_Notice_2025
14 pages
Total Amount Due $155.83: 23433205 01 Nov - 30 Nov 19 20 Nov 19 05 Nov 19
No ratings yet
Total Amount Due $155.83: 23433205 01 Nov - 30 Nov 19 20 Nov 19 05 Nov 19
2 pages
DC 250 Error Codes
No ratings yet
DC 250 Error Codes
40 pages
AIS-Dedicated Processors For Maritime Safety: CML Microcircuits
No ratings yet
AIS-Dedicated Processors For Maritime Safety: CML Microcircuits
12 pages
ORDER BY Clause - Sort Data in SQL - 1keydata
No ratings yet
ORDER BY Clause - Sort Data in SQL - 1keydata
3 pages
Tek 577-177 Service
100% (3)
Tek 577-177 Service
196 pages
gc-w800 Manual
No ratings yet
gc-w800 Manual
24 pages
Aashlesh - AEM Architect
No ratings yet
Aashlesh - AEM Architect
5 pages
F 5500 Series Thermal Mass Flow Meter Catalog Sheet Doc 0000849
No ratings yet
F 5500 Series Thermal Mass Flow Meter Catalog Sheet Doc 0000849
6 pages
CIS Oracle MySQL Enterprise Edition 5.7 Benchmark v1.0.0
No ratings yet
CIS Oracle MySQL Enterprise Edition 5.7 Benchmark v1.0.0
104 pages
Computer Hardware Servicing
No ratings yet
Computer Hardware Servicing
16 pages
vm7b Ce
No ratings yet
vm7b Ce
8 pages
Https WWW - Dgserver.dgsnd - Gov.in Reports Rwservlet KEY1&Report RcreportworkPDF
No ratings yet
Https WWW - Dgserver.dgsnd - Gov.in Reports Rwservlet KEY1&Report RcreportworkPDF
20 pages
BOD
No ratings yet
BOD
7 pages
DD2003
No ratings yet
DD2003
3 pages
Thesis 18
No ratings yet
Thesis 18
55 pages
Encoder HEDL 5540 500 CPT, 3 Channels, With Line Driver RS 422
No ratings yet
Encoder HEDL 5540 500 CPT, 3 Channels, With Line Driver RS 422
4 pages
Strike 5 Final
No ratings yet
Strike 5 Final
3 pages
Blue Force Tracking - Wikipedia
No ratings yet
Blue Force Tracking - Wikipedia
3 pages
Anatole France - Thais
No ratings yet
Anatole France - Thais
101 pages
Xy4 Command Reference Guide
No ratings yet
Xy4 Command Reference Guide
516 pages