Moe-I: Compressing Mixture of Experts Models Through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition
Moe-I: Compressing Mixture of Experts Models Through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition
Cheng Yang1* , Yang Sui1,2* , Jinqi Xiao1 , Lingyi Huang1 , Yu Gong1 , Yuanlin Duan1 ,
Wenqi Jia3 , Miao Yin 3 , Yu Cheng4 , Bo Yuan1
1
Rutgers University, 2 Rice University,
3
The University of Texas at Arlington, 4 The Chinese University of Hong Kong
[email protected], [email protected], [email protected]
pactful combinations of experts on a more global as candidate combinations in the i-th layer.
scale.
Layer-wise Genetic Search. To avoid extreme Block-wise KT-Reception Field. After obtaining
time consumption caused by brute-force search (Lu the n candidate combinations, we only keep K
et al., 2024), we leverage the genetic search to best combinations with the smallest loss in each
select the M candidate combinations in each layer. layer as the candidate combinations to be used
For the i-th layer, we define all possible prun- for the block-level optimization. We aim to select
ing combinations as CPi . Here, Pi represents the one of the K combinations from each layer such
number of experts to be be pruned in the i-th layer. that they minimize the output loss. During this
Given that there are Mi experts in the i-th layer, selection process, instead of only considering the
CPi denotes the number of combinations for se- importance of experts in just the current layer (Lu
lecting Pi experts to prune from the total of Mi et al., 2024), we extend the scope of candidate
experts. selection from one layer to T layers, achieving
In the initial stage of Genetic Search, we first a block-wise combination. L Specifically, we
initialize a population {CPi ,1 , CPi ,2 , . . . , CPi ,N }, partition all layers into T blocks. Within each
where the population size N = 100. We then calcu- block, we select the combination in a brute-force
late the loss for each combination in the population: scheme. Given K candidates in each layer, and
considering there are T layers in one block, we
X traverse all possible combinations by selecting one
Lni = ∥Fi (X ) − Fi (X , {Ei } \ CPi ,n ))∥F (2)
combination from each of T layers, yielding a total
B
of K T options. Subsequently, we calculate the
where Fi represents the output of layer i of the output loss and select the optimal combinations for
MoE model, and ∥ · ∥F donates Frobenius norm. pruning. The pipeline is shown in Figure 3.
We select the combinations with the smallest
loss from CPi ,n as parents. Using union and Expert Pruning. Given the to-be-pruned experts,
random sampling, we generate offspring combina- we conduct the expert pruning operation by remov-
tions. Each individual in the offspring population ing the entire expert in a structured manner.
undergoes some mutations, where a few experts to
be pruned are randomly replaced. This process is 3.2 Intra-Expert Decomposition
repeated iteratively in 50 steps and we can obtain In this stage, we propose to further compress the
the optimal a few combinations of expert pruning remaining experts in a fine-grained way by perform-
Figure 2: Importance analysis of Mixtral-8×7B (left), Qwen1.5-MoE-A2.7B (middle), and DeepSeek-V2-Lite
(right) models. A larger loss indicates greater importance. For Mixtral-8×7B and Qwen1.5-MoE-A2.7B, the
importance of the different layers is relatively consistent, but for DeepSeek-V2-Lite, the importance increases as
one approaches the output layer.
Input
we begin by calculating basic uniform rank values.
Layer 0
Given the overall compression ratio in the second
Top K (=3) Comb.
Comb.0
Layer 0 … … … …
Comb.1
Comb. 0 Comb. 1 Comb. 3 stage, and considering that the structure of all ex-
… … …
Comb.3 Layer 1 perts is entirely consistent, we directly calculate
…
25% pruning
Layer 2 … … … the target average rank for each expert after decom-
ratio
Layer 0 ~ Layer 2
Ranked by Comb. Performance
…
position, which is denoted as Ra . By considering
…
Expert
Available Layer
the important score of each expert, we calculate the
Combination
Pruned Selected
n-2
Layer
rank values for experts ei,j as:
Expert Combination
n-1
Optimal for Layer
(Iij + ϵ)α
n
Every T (=3)
Layer n-2 ~ Layer n ′
layers
Output Rij = P ′
· Ra · Mi (3)
Mi α
j=1 (Iij + ϵ)
Figure 3: The process of KT-Receptive Filed in Mixtral- ′
Here, Mi represents the number of experts remain-
8×7B for satisfied 25% pruning ratio. In this case, the
number of candidate combinations per layer is K = 3, ing in layer i of the model obtained after Intra-
and the number of layers per block is T = 3. For Expert Pruning, and α denotes the smooth factor
each layer, we select K optimal candidates using the used to avoid overly linearizing the distribution of
Layer-wise Genetic Search (top-left). Within a consec- rank values, set as 0.15. ϵ is set to 1 × 10−6 to
utive sequence of T layers, we employ the Block-wise avoid the numerical issue.
KT-Reception Field to identify the best-performing com-
bination within that block (T layers). 3.2.2 Intra-Expert Decomposition Strategy
Singular Value Decomposition (SVD) is a general
technique to reduce parameter size by decomposing
ing the low-rank decomposition on the parameters a large dense matrix into two smaller low-rank ma-
within each intra-expert. trices. Compared to the Vanilla SVD, which only
focuses on the initial weight matrix, (Wang et al.,
3.2.1 Expert Importance Analysis 2024) generates activation by truncation-aware data
As mentioned in (Chi et al., 2022), each expert whitening and provides hierarchical closed-form
has varying levels of importance. To achieve better updates for model compression. Inspired by SVD-
compression performance, instead of applying a LLM (Wang et al., 2024) working on dense models,
uniform compression ratio, we aim to retain more we extend SVD-LLM to MoE models by integrat-
parameters in the important experts and fewer in ing the non-uniform ranks Ri,j in Sec. 3.2.1.
the less important ones. That leads us to assign
higher ranks to the more important experts and 3.3 Efficient Fine-tuning
lower ranks to the less important ones. Therefore, To mitigate performance degradation caused by
to calculate the varying ranks, we analyze the rel- the two-stage compression, we fine-tune the MoE
ative importance of each expert. Based upon the by updating the weights. Instead of adjusting all
previous analysis in Sec. 3.1.1, we adopt the same weights, we integrate LoRA (Hu et al., 2021), a
importance metric, Ii,j in Eq. 1, as the expert im- low-rank approximation technique, into the post-
portance. training of the pruned model. The overall algorithm
To determine the varying ranks of each expert, is illustrated in Alg. 1.
Algorithm 1 The Algorithm of MoE-I2 paca (Taori et al., 2023) as calibration data to con-
Inputs: Initial Model M, Target Pruning Ratio PS , Expert duct the importance analysis. For the finetuning
Decomposition Rate D, Calibration Sample Sc , Finetune phase, similar to LLM-Pruner (Ma et al., 2023), we
Sample Sf .
Outputs: Compressed MoE-I2 , Mf
use Alpaca as the finetuning training set, totaling
1: for each layer li in M do approximately 50k samples. The batch size is set
2: Ii ← Layer Importance Analysis with Sc via as 64 and learning rates are from 3e-4 to 5e-4. The
Sec. 3.1.1;
3: end for experiments are conducted on 4 A100-80G GPUs.
4: Mp ← Inter-Expert Pruning(M, Sc , PS , I) via
Sec. 3.1.2; 4.2 Main Results
5: for each layer li in Mp do
6: Ri,j ← Expert Importance Analysis via Sec. 3.2.1;
MoE-I2 Results. Table 1 presents the zero-shot
7: end for performance of the models after applying the MoE-
8: Mc ← Intra-Expert Decomposition(Mp , Sc , D, R) via I2 framework. It is evident that pruning 25% of
Sec. 3.2.2;
9: Mf ← Low-Rank Finetune(Mc , Sf ) via Sec. 3.3; the expert parameters results in only a slight per-
formance loss. However, after finetuning the com-
pressed mode with only 2 epochs, the performance
4 Experiments can even surpass that of the original model, es-
pecially with an improvement of over 2% on the
4.1 Experimental Settings DeepSeek-V2-Lite model. This observation sug-
Model Settings. To demonstrate the effective- gests that pruning 25% of the experts in the first
ness of our method, we conducted experiments step is lossless. In the second step, we choose to
on three MoE models: Qwen1.5-MoE-A2.7B further compress the pruned model with an approx-
(14.3B), DeepSeek-V2-Lite (16B), and Mixtral- imate 40% compression ratio via low-rank decom-
8×7B (47B). Mixtral-8×7B has a larger number position. Finally, we perform the finetuning stage.
of parameters and relatively fewer experts (8 ex- As a result, we can see that while ensuring a reduc-
perts per layer in total 32 layers). On the other hand, tion of more than 50% in expert parameters, the
Qwen1.5-MoE-A2.7B and DeepSeek-V2-Lite have model’s performance is largely preserved.
fewer parameters but a greater number of experts Zero-shot Performance Comparisons with Ex-
(60 and 64 experts per layer in a total of 24 and 26 isting Methods.
layers, respectively). Table 2 shows the zero-shot performance of the
Evaluation and Datasets. To evaluate the pruned model by comparing Wanda (Sun et al.,
performance in a task-agnostic setting, we 2023), EEP (Lu et al., 2024), and our Inter-Expert
mainly adopt LLama-Pruner (Ma et al., 2023) Pruning method under the same sparsity rate. Our
evaluation methodology, conducting zero-shot method demonstrates significant advantages over
task classification across common sense reason- Wanda and EEP.
ing datasets such as BoolQ (Clark et al., 2019), PPL Comparisons with Existing Methods. Ta-
HellaSwag (Zellers et al., 2019), WinoGrande (Sak- ble 3 shows the zero-shot perplexity(PPL) of the
aguchi et al., 2021), ARC-easy (Clark et al., 2018), pruned model by comparing EEP, and our Inter-
ARC-challenge (Clark et al., 2018), and Open- Expert Pruning method under the same sparsity
bookQA (Mihaylov et al., 2018). Meanwhile, our rate. Our method demonstrates significant advan-
model evaluates results in multiple-choice tasks tages over EEP.
or generates answers in open-ended generation Inference Speedup with Existing Methods. Ta-
tasks (Gao et al., 2021). Furthermore, we supple- ble 4 shows the speedup of three models by com-
ment our evaluation with a zero-shot perplexity paring Wanda (Sun et al., 2023), EEP (Lu et al.,
(PPL) analysis on WikiText2 (Merity et al., 2016) 2024), and MoE-I2 method.
and PTB (Marcus et al., 1993).
4.3 Ablation Studies
Implementation Details. During the expert prun- Comparison of MoE-I2 and its Components. Ta-
ing phase, we use the same data as the (Lu et al., ble 5 demonstrates the necessity of the components
2024), which is 2048 randomly sampled data from within the MoE-I2 framework. It shows that MoE-
the C4 (Raffel et al., 2020) dataset as calibra- I2 has a significant advantage when compared to
tion data. In the expert decomposition phase, we applying only Inter-Expert Pruning or Intra-Expert
also use 2048 randomly sampled data from Al- Decomposition individually.
Table 1: Zero-shot performance of three models under our MoE-I2 Framework. The average is calculated among
seven classification datasets. “P” denotes the Inter-Expert Pruning operation, “D” represents the Intra-Expert
Decomposition operation, and “F” indicates the “Fine-tuning” operation based on LoRA. “Params” represents
the percentage reduction in the number of expert parameters. In the Inter-Expert Pruning stage, we prune 25%
of the experts. During the Intra-Expert Decomposition stage, for the Mixtral-8×7B model, we decompose the
remaining experts with an average rank of 2048, further reducing the parameters by approximately 37.5%. For the
Qwen1.5-MoE-A2.7B and DeepSeek-V2-Lite models, we perform decomposition with an average rank of 512,
further reducing the parameters by approximately 38.6%.
Model Method Params↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8×7B baseline 0 57.17 84.01 85.35 64.88 35.00 70.40 75.93 67.53
8×7B P 25% 51.79 81.36 84.07 61.99 32.80 71.12 75.85 65.57
8×7B P+F 25% 56.23 82.49 86.42 64.48 36.00 72.92 74.98 67.65
8×7B P+D 51.79% 40.70 71.51 67.83 45.34 26.00 61.37 67.56 54.33
8×7B MoE-I2 51.79% 52.20 78.22 82.62 61.07 34.00 72.20 71.50 64.55
Qwen baseline 0 41.89 73.11 79.76 57.90 30.40 70.04 68.67 60.25
Qwen P 25% 38.57 70.37 73.30 55.84 29.80 64.98 67.25 57.16
Qwen P+F 25% 45.14 75.93 78.01 57.83 32.80 71.12 68.51 61.33
Qwen P+D 53.98% 37.71 65.91 71.41 49.34 29.40 64.26 67.88 55.13
Qwen MoE-I2 53.98% 41.13 71.68 75.08 53.08 30.80 66.43 66.54 57.82
DeepSeek baseline 0 46.93 78.37 79.82 58.70 34.60 60.65 71.35 61.49
DeepSeek P 25% 45.31 74.62 67.95 57.38 33.20 59.93 70.01 58.34
DeepSeek P+F 25% 47.44 78.16 79.79 60.32 35.40 74.56 71.35 63.86
DeepSeek P+D 53.98% 38.48 71.42 70.09 48.15 27.80 60.65 65.98 54.65
DeepSeek MoE-I2 53.98% 42.58 71.80 76.79 55.16 32.60 70.76 67.64 59.62
Table 2: Zero-shot performance comparison with EEP (Lu et al., 2024) and Wanda (Sun et al., 2023)
Model Method Params↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8×7B EEP 25% 51.62 81.94 83.64 61.60 33.00 67.87 75.37 65.01
8×7B P 25% 51.79 81.36 84.07 61.99 32.80 71.12 75.85 65.57
8×7B Wanda 50% 42.06 74.16 76.64 53.16 27.00 63.90 70.96 58.27
8×7B EEP 50% 48.89 78.16 81.35 57.66 29.00 61.37 72.85 61.33
8×7B P 50% 48.38 78.66 81.41 58.35 27.00 64.62 74.19 61.80
Table 3: Zero-shot performance of experiment results Table 4: Inference speedup performance comparison
of comparison with EEP and Wanda. “↓” indicates that with EEP (Lu et al., 2024) and Wanda (Sun et al., 2023)
lower values are better. at a compression rate of 50% . “↓” indicates that lower
values are better.
Model Method Params↓ WikiText2 ↓ PTB ↓
8×7B baseline 0% 6.24 107.24 Model Method Mem (GB) ↓ Speedup Average
8×7B EEP 25% 8.16 141.1 8×7B baseline 87.7 1.0× 67.53
8×7B P 25% 8.01 133.38 8×7B EEP 45.78 1.20× 61.33
8×7B EEP 50% 11.02 207.4 8×7B Wanda 50.01 0.91× 58.27
8×7B P 50% 10.1 185.2 8×7B MoE-I² 43.49 1.28× 64.55
Qwen baseline 26.67 1.0× 60.25
Impact of Genetic Search. For Qwen1.5-MoE- Qwen MoE-I² 14.14 1.12× 57.82
A2.7B and DeepSeek-V2-Lite models, which have DeepSeek baseline 29.26 1.0× 61.49
DeepSeek MoE-I² 15.03 1.13× 59.62
60 and 64 experts per layer respectively, we only
iterate 50 times for Genetic Search. As shown in current layer to prune instead of considering the
Figure 4, the loss has converged in the majority of expert combination used in Genetic Search. As ob-
layers. Using EEP (Lu et al., 2024) for combinato- served, Genetic Search has a significant advantage
rial search would result in unimaginable time com- over other methods with similar low time costs on
plexity. For instance, if pruning 25% of the experts, seven classification tasks and PPL.
EEP would require searching C15 60 and C 64 times Impact of KT-Receptive Field. As shown in Fig-
16
for each layer respectively. Table 6 presents the per- ure 5, we also observe that a large KT-Receptive
formance of the pruned models obtained through Field is not always the best during calibration. This
our Inter-Expert Pruning compared to Random and is partially because we only use a small amount of
TopLoss methods in terms of zero-shot performance data for calibration (2048 samples selected from
of average(among seven classification datasets) and the C4 dataset). Additionally, there is a significant
perplexity tasks. The TopLoss denotes that we indi- difference between the C4 dataset and the seven
vidually select the Pi least important experts in the datasets used for zero-shot validation. Simply in-
Table 5: Comparison of zero-shot performance of the MoE-I2 framework and its components. To ensure the same
compression ratio and ease of computation as much as possible, when performing the “D+F”, we set the average
rank value of the experts as 14 of expert dimension, which is 352.
Model Method Params ↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8x7B P+F 50% 50.43 78.79 82.42 59.12 32.00 70.40 74.03 63.88
8x7B D+F 51.35% 46.08 75.34 81.41 54.02 27.80 72.20 68.27 60.73
8×7B MoE-I2 51.79% 52.20 78.22 82.62 61.07 34.00 72.20 71.50 64.55
Qwen P+F 50% 41.89 69.15 75.20 53.97 30.20 64.98 62.43 56.83
Qwen D+F 57.81% 36.69 69.01 74.56 47.29 29.40 72.92 68.27 56.88
Qwen MoE-I2 53.98% 41.13 71.68 75.08 53.08 30.80 66.43 66.54 57.82
DeepSeek P+F 50% 39.51 70.16 68.17 53.37 26.40 64.98 63.14 55.11
DeepSeek D+F 57.81% 69.68 70.33 74.19 51.98 29.20 71.12 67.01 57.64
DeepSeek MoE-I2 53.98% 42.58 71.80 76.79 55.16 32.60 70.76 67.64 59.62
Figure 4: The left and right figures represent the loss convergence for each layer of Qwen1.5-MoE-A2.7B and
DeepSeek-V2-Lite during the Genetic Search process, respectively. As shown in the figures, after 50 iterations,
nearly all layers have converged.
Model Method Params↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
DeepSeek Ba. 25% 44.20 73.91 68.26 57.07 32.00 57.76 69.93 57.59
DeepSeek Imba. 25% 45.31 74.62 67.95 57.38 33.20 59.93 70.01 58.34
DeepSeek Ba. 50% 31.74 60.19 61.28 45.34 22.40 50.90 60.62 47.50
DeepSeek Imba. 50% 31.74 61.87 61.74 44.79 23.60 54.87 56.67 47.90
Table 8: Zero-shot performance of experiment results produced by Intra-Expert Decomposition of comparison with
Imbalance (Imba.) and Balance (Ba.) rank in same layer in three models.
Model Rank(avg) Type ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8x7B 2048 Ba. 43.66 73.45 74.03 54.31 27.40 67.92 69.55 58.62
8x7B 2048 Imba. 43.94 73.95 74.56 55.91 27.80 68.23 69.85 59.18
8x7B 1550 Ba. 33.70 63.43 62.57 47.29 22.00 62.45 62.98 50.63
8x7B 1550 Imba. 34.59 63.67 62.59 47.68 22.00 63.05 63.15 50.96
Qwen 704 Ba. 40.19 72.94 77.95 54.50 30.40 68.95 69.06 59.14
Qwen 704 Imba. 40.44 73.40 77.74 54.54 31.60 68.95 69.30 59.43
Qwen 352 Ba. 35.92 67.55 73.64 44.09 26.40 70.04 67.17 54.97
Qwen 352 Imba. 36.26 67.89 73.15 44.34 27.20 72.20 66.69 55.39
DeepSeek 704 Ba. 43.60 76.94 77.77 53.98 30.40 62.82 69.22 59.25
DeepSeek 704 Imba. 44.11 77.19 78.50 54.20 30.40 63.54 69.30 59.61
DeepSeek 352 Ba. 33.45 65.11 63.05 39.07 25.20 61.75 64.88 50.35
DeepSeek 352 Imba. 34.04 65.95 63.76 39.53 25.80 60.29 65.19 50.65
Table 9: Performance of Pruning on Mixtral-8×7B between our Genetic Search and C-MoE (He et al., 2024). “P”
denotes ours Inter-Expert Pruning operation (Genetic Search). “E[n/m]” denotes dropping n out of m of experts per
MoE layer on average. “L[n/m]”, “B[n/m]” represents dropping n out of m corresponding modules with Layer
Drop and Block Drop respectively. These three methods are described in (He et al., 2024).
Model Method Mem(GB) ARC-c BoolQ HellaSwag OBQA RTE WinoGrande Average ∆↓
8×7B baseline(Ours/EEP) 87.7 59.81 84.92 83.97 47.00 71.12 76.32 70.52 -
8×7B P 66.7 56.66 83.46 81.72 46.40 71.12 75.85 69.02 ↓ 1.32
8×7B baseline (He et al., 2024) 87.7 59.4 84.2 84.00 46.80 70.40 75.60 70.07 -
8×7B E2/8 66.7 53.20 77.70 80.50 46.20 55.60 76.80 65.00 ↓ 5.07
8×7B L8/32 66.6 47.70 85.30 75.20 40.40 69.70 74.60 65.42 ↓ 4.65
8×7B B5/32 74.1 51.30 85.30 78.70 42.00 69.70 74.30 66.88 ↓ 3.19
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Patient knowledge distillation for bert model com- Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
pression. arXiv preprint arXiv:1908.09355. machine really finish your sentence? arXiv preprint
arXiv:1905.07830.
Siqi Sun, Zhe Gan, Yu Cheng, Yuwei Fang, Shuo-
Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhi-
hang Wang, and Jingjing Liu. 2020. Con-
jie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng
trastive distillation on intermediate representations
Li. 2021. Learning n: m fine-grained structured
for language model compression. arXiv preprint
sparse neural networks from scratch. arXiv preprint
arXiv:2009.14167.
arXiv:2102.04010.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
An instruction-following llama model. https://
github.com/tatsu-lab/stanford_alpaca.