0% found this document useful (0 votes)
19 views

Moe-I: Compressing Mixture of Experts Models Through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition

Uploaded by

geek.bill.0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Moe-I: Compressing Mixture of Experts Models Through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition

Uploaded by

geek.bill.0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MoE-I2 : Compressing Mixture of Experts Models through Inter-Expert

Pruning and Intra-Expert Low-Rank Decomposition

Cheng Yang1* , Yang Sui1,2* , Jinqi Xiao1 , Lingyi Huang1 , Yu Gong1 , Yuanlin Duan1 ,
Wenqi Jia3 , Miao Yin 3 , Yu Cheng4 , Bo Yuan1
1
Rutgers University, 2 Rice University,
3
The University of Texas at Arlington, 4 The Chinese University of Hong Kong
[email protected], [email protected], [email protected]

Abstract parameters during training and inference. For in-


stance, with a smaller model size, the Mixtral-
The emergence of Mixture of Experts (MoE) 8×7B model with a total of 47B parameters sur-
LLMs has significantly advanced the develop-
arXiv:2411.01016v1 [cs.LG] 1 Nov 2024

passes the performance of dense Transformer mod-


ment of language models. Compared to tra-
els like LLaMA-2-70B (Touvron et al., 2023b). Ad-
ditional LLMs, MoE LLMs outperform tradi-
tional LLMs by achieving higher performance
ditionally, Qwen1.5-MoE-A2.7B (Bai et al., 2023)
with considerably fewer activated parameters. demonstrates highly competitive performance com-
Despite this efficiency, their enormous param- pared to other 7B models, and the recently in-
eter size still leads to high deployment costs. troduced DeepSeekv2 MoE (DeepSeek-AI, 2024)
In this paper, we introduce a two-stage com- achieves performance levels comparable to GPT-4,
pression method tailored for MoE to reduce demonstrating the powerful capabilities of MoE
the model size and decrease the computational models.
cost. First, in the inter-expert pruning stage,
MoE models have garnered significant attention
we analyze the importance of each layer and
propose the Layer-wise Genetic Search and recently due to their ability to dynamically select
Block-wise KT-Reception Field with the non- subsets of parameters for each input, enabling effi-
uniform pruning ratio to prune the individual cient handling of diverse tasks. Despite their poten-
expert. Second, in the intra-expert decompo- tial, a notable challenge with MoE models is that
sition stage, we apply the low-rank decom- they are still burdened by substantial parameter
position to further compress the parameters size and computation cost. For example, Mixtral-
within the remaining experts. Extensive exper-
8×7B (Jiang et al., 2024) not only has 47B pa-
iments on Qwen1.5-MoE-A2.7B, DeepSeek-
V2-Lite, and Mixtral-8×7B demonstrate that
rameters but also activates 13B parameters during
our proposed methods can both reduce the inference. While this architecture allows for scala-
model size and enhance inference efficiency bility and flexibility, it also introduces complexities
while maintaining performance in various zero- and huge memory in deployment and inference,
shot tasks. The code will be available at https: particularly when considering resource constraints
//github.com/xiaochengsky/MoEI-2.git and efficiency. Consequently, decreasing and main-
taining these large-scale models remains a critical
area of research.
1 Introduction Model compression techniques, such as pruning,
Large Language Models (LLMs) have recently knowledge distillation, and quantization, have been
demonstrated remarkable language understand- utilized to slim the model size. (Lu et al., 2024)
ing and generation proficiency, excelling in com- proposed to reduce the parameter count of MoE
plex tasks (Achiam et al., 2023; Touvron et al., models by expert pruning, but it does not reduce
2023a; Wu et al., 2020). However, deploying the parameters during inference efficiently. (Li
these models presents substantial challenges due et al., 2024) merges several experts into one and ap-
to their significant storage and computational de- plies the low-rank decomposition to further reduce
mands. To overcome these issues, the Mixture- the model size. Although this approach achieves
of-Experts (MoE) LLM has been proposed (Jiang a good compression ratio and performance, it re-
et al., 2024), which activates only a subset of its quires calibration and fine-tuning for each down-
stream task individually, which is not suitable for
*
Equal Contribution. large-scale LLMs, and time costs are very high.
Several works (Zhou et al., 2021; Sun et al., 2023; • We conduct extensive experiments with
Frantar and Alistarh, 2023) focus on unstructured MoE models, including Qwen1.5-MoE-
sparsity to decrease the parameters of models while A2.7B (14.3B), DeepSeek-V2-Lite (16B), and
maintaining high performance. However, unstruc- Mixtral-8×7B (47B), across various nine
tured pruning struggles to achieve practical acceler- datasets to assess both the generation quality
ation, decrease inference, and save storage without and the zero-shot classification performance,
a specific design for hardware and libraries. demonstrating the effectiveness of our pro-
To solve these problems, we start by analyzing posed MoE-I2 framework.
parameter redundancy in the MoE model from mul-
tiple levels. First, since identifying redundant ex- 2 Related Works
perts using brute-force search (Lu et al., 2024)
2.1 Mixture-of-Experts LLMs
is infeasible in practice, it is necessary to design
efficient methods to reduce the time complexity. MoE-LLMs have gained significant attention in re-
Second, we aim to compress as many experts as cent years due to their ability to scale efficiently
possible while ensuring that the model maintains while maintaining high performance. MoE mod-
its zero-shot performance, rather than being lim- els divide the network into several experts and dy-
ited to handling a single down-stream task (Li et al., namically select a subset of these experts for each
2024). Finally, our method can adapt to any MoE input, which reduces computational overhead and
model, particularly those with a large number of enhances scalability. (Shazeer et al., 2017) intro-
experts and diverse structures, and automatically duced the MoE model in their work on the Sparsely-
identifies a suitable compression strategy for each Gated Mixture-of-Experts Layer, and (Lepikhin
type of MoE model without the need for manual et al., 2020) further advanced the MoE architecture
settings. by demonstrating its scalability to trillions of pa-
In this paper, we propose a novel end-to-end rameters while retaining manageable computation
framework for MoE models, MoE-I2 , for the task- costs by distributing the experts across multiple
agnostic compression of the MoE models. To our devices. With the recent advancements in decoder-
knowledge, MoE-I2 is the first end-to-end frame- only architecture(Touvron et al., 2023a), MoE mod-
work designed task-agnostic for structured com- els built on this structure have become increasingly
pression of MoE LLMs. Our contributions are popular (Jiang et al., 2024). In this paper, we focus
summarized as follows: on how to build an end-to-end framework to solve
post-training expert pruning and decomposition for
• We introduce a two-stage MoE compression MoE LLMs to decrease computation and storage.
framework for expert slimming that consid-
ers both inter-expert and intra-expert relation- 2.2 Compression on MoE LLMs
ships. Recent advancements in large language models
• In the inter-expert pruning stage, we analyze have underscored the need to reduce parameter
the importance of each MoE layer and pro- sizes and latency (Ma et al., 2023). Compres-
pose a non-uniform pruning ratio for each sion techniques for language models include net-
layer. Then, we find that previous MoE prun- work pruning (Xu et al., 2021), knowledge distilla-
ing methods lead to high time complexity and tion (Sun et al., 2019, 2020), quantization (Yao
local optima. To address these issues, we in- et al., 2022), decomposition (Hsu et al., 2022;
troduce a layer-wise genetic search to reduce Yuan et al., 2023; Wang et al., 2024), and meth-
time complexity and a block-wise combina- ods like early exit (Xin et al., 2020). Building
tion strategy to approximate a global optimum on these techniques, pruning, and sparsity is cru-
better. cial for MoE models, which often have up to 95%
of parameters dedicated to experts. Pruning MoE
• In the intra-expert decomposition stage, we models involves removing less important experts
measure the importance of each expert and or neurons to reduce the number of active pa-
assign non-uniform ranks accordingly. Sub- rameters during inference. For example, (Kim
sequently, we apply a low-rank decomposi- et al., 2021) retains the most activated experts to
tion to further compress the parameters within enhance machine translation MoE models, while
each expert in a fine-grained manner. (Koishekenov et al., 2022) introduces gate statistics-
X
based pruning during decoding. Although effec- Ii,j = L(X , {Ei } \ {ei,j }) (1)
tive, these methods are mostly confined to linguis- B
tic models in machine translation. (Chen et al., where {Ei } = {ei,1 , ei,2 , · · · , ei,Mi } denotes the
2022) dropping-while-training approach progres- set of all experts in i-th layer. X represents the
sively removes non-essential experts for specific calibration dataset, and B denotes the batche size.
tasks, tested on Switch Transformers (Fedus et al., L denotes the output of the MoE model under the
2022). The merge-compression (Li et al., 2024) condition that the j-th expert in the i-th layer is
method and EPP (Lu et al., 2024) approach, which removed. Once we have determined the impor-
is similar to ours, consider pruning and skipping in tance score of the j-th expert in the i-th layer, the
MoE models but face challenges in reducing com- overall importance score of i-th layer is defined
putational costs. Given a pruned or sparse model, as Ii = E
P i
I
j=1 i,j . Given the overall pruning rate,
finetuning aims to restore performance on original we normalize the layer importance to obtain the
tasks. Recent studies on LLMs (Sun et al., 2023; pruning rate for each layer.
Ma et al., 2023) focus on pruning linear layers, but Following this paradigm, we demonstrate the
these methods often fail to reduce computing costs layer importance for Mixtral-8×7B (Jiang et al.,
without specialized hardware or libraries. Efficient 2024), Qwen1.5-MoE-A2.7B (Bai et al., 2023),
post-finetuning expert pruning and sparsity meth- and DeepSeek-V2-Lite (DeepSeek-AI, 2024) as
ods for task-agnostic MoE LLMs remain underex- shown in Figure 2. Note that the previous work (Lu
plored. This gap highlights the need for advanced et al., 2024) overlooks the varying importance of
techniques to effectively balance pruning and spar- layers and simply applies a uniform pruning ratio
sity while maintaining or enhancing performance to each layer, leading to a suboptimal solution. In
across various tasks. contrast, our analysis shows that some models per-
form in ways that largely diverge from this strategy.
3 Method
For example, the analysis of DeepSeek-V2-Lite
In this section, we introduce the details of our pro- (Figure 2) reveals that layer importance rapidly in-
posed framework, MoE-I2 , which consists of three creases with depth, indicating that deeper layers
stages: Inter-Expert Pruning stage (Sec. 3.1), Intra- are more sensitive than shallower ones.
Expert Decomposition stage (Sec. 3.2), and fine-
3.1.2 Inter-Expert Pruning Strategy
tuning stage (Sec. 3.3). The overall pipeline is
shown in Figure 1. To answer the second question, it is required to
identify a combination of N experts that have the
3.1 Inter-Expert Pruning least impact on prediction loss. Previous work (Lu
In this stage, our goal is to prune individual unim- et al., 2024) utilizes brute-force search to find the
portant experts to reduce the parameter size and least impactful combination of N experts within
computational cost. It raises two crucial questions: each layer. However, this method presents two sig-
(1) Given an overall pruning ratio, how many ex- nificant drawbacks. First, the brute-force search
perts should be pruned in each layer? (2) How to has high time complexity, making it extremely time-
determine which experts to prune? consuming, especially when pruning the MoE with
a large number of experts. For example, Qwen1.5-
3.1.1 Layer Importance Analysis MoE-A2.7B and DeepSeek-V2-Lite have 60 and
To answer the first question, we start by analyzing 64 experts per layer, respectively. If 25% of experts
the importance of each layer. The layer importance need to be pruned, (Lu et al., 2024) needs to tra-
verse C1560 and C 64 times for each layer respectively,
of i-th layer, denoted by Ii , is defined as the average 16
loss degradation by removing individual experts which is unacceptable in terms of time consump-
within this layer. Specifically, to calculate Ii in the tion. Second, it restricts the search space within
i-th layer, we first calculate the expert importance. the current layer, only achieving a local optimum
We consecutively pruning j-th expert in the i-th and potentially missing a more globally optimal
layer, denoted by ei,j , where j = 1, 2, · · · , Mi . solution.
The Mi represents the total number of experts in To mitigate these challenges, we leverage Ge-
the i-th layer. Next, each pruned model predicts netic Search (Grefenstette, 1993; Alam et al., 2020)
the next token with the calibration samples. The with KT-Receptive Field methods to enhance search
expert importance of ei,j is calculated as: efficiency and concurrently identify the least im-
Figure 1: The three-stage pipeline of MoE-I2 . The first stage (left) represents Inter-Expert Pruning, where MoE-I2
conducts the Layer Importance Analysis on the target MoE model. By using a predefined overall pruning rate, it
determines varying pruning ratios of different layers. Subsequently, the unimportant experts in MoE are determined
by Layer-wise Genetic Search and Block-wise KT-Reception Field. The MoE is pruned accordingly. The second
stage (middle) represents Intra-Expert Analysis. Similarly, MoE-I2 automatically performs Expert Importance
Analysis on the pruned model and using a predefined overall decomposition rate, applies varying ranks and low-rank
decomposition to different experts, resulting in a final compressed model. The third stage (right) shows that we
fine-tuned the compressed MoE model to recover performance.

pactful combinations of experts on a more global as candidate combinations in the i-th layer.
scale.
Layer-wise Genetic Search. To avoid extreme Block-wise KT-Reception Field. After obtaining
time consumption caused by brute-force search (Lu the n candidate combinations, we only keep K
et al., 2024), we leverage the genetic search to best combinations with the smallest loss in each
select the M candidate combinations in each layer. layer as the candidate combinations to be used
For the i-th layer, we define all possible prun- for the block-level optimization. We aim to select
ing combinations as CPi . Here, Pi represents the one of the K combinations from each layer such
number of experts to be be pruned in the i-th layer. that they minimize the output loss. During this
Given that there are Mi experts in the i-th layer, selection process, instead of only considering the
CPi denotes the number of combinations for se- importance of experts in just the current layer (Lu
lecting Pi experts to prune from the total of Mi et al., 2024), we extend the scope of candidate
experts. selection from one layer to T layers, achieving
In the initial stage of Genetic Search, we first a block-wise combination. L Specifically, we
initialize a population {CPi ,1 , CPi ,2 , . . . , CPi ,N }, partition all layers into T blocks. Within each
where the population size N = 100. We then calcu- block, we select the combination in a brute-force
late the loss for each combination in the population: scheme. Given K candidates in each layer, and
considering there are T layers in one block, we
X traverse all possible combinations by selecting one
Lni = ∥Fi (X ) − Fi (X , {Ei } \ CPi ,n ))∥F (2)
combination from each of T layers, yielding a total
B
of K T options. Subsequently, we calculate the
where Fi represents the output of layer i of the output loss and select the optimal combinations for
MoE model, and ∥ · ∥F donates Frobenius norm. pruning. The pipeline is shown in Figure 3.
We select the combinations with the smallest
loss from CPi ,n as parents. Using union and Expert Pruning. Given the to-be-pruned experts,
random sampling, we generate offspring combina- we conduct the expert pruning operation by remov-
tions. Each individual in the offspring population ing the entire expert in a structured manner.
undergoes some mutations, where a few experts to
be pruned are randomly replaced. This process is 3.2 Intra-Expert Decomposition
repeated iteratively in 50 steps and we can obtain In this stage, we propose to further compress the
the optimal a few combinations of expert pruning remaining experts in a fine-grained way by perform-
Figure 2: Importance analysis of Mixtral-8×7B (left), Qwen1.5-MoE-A2.7B (middle), and DeepSeek-V2-Lite
(right) models. A larger loss indicates greater importance. For Mixtral-8×7B and Qwen1.5-MoE-A2.7B, the
importance of the different layers is relatively consistent, but for DeepSeek-V2-Lite, the importance increases as
one approaches the output layer.

Input
we begin by calculating basic uniform rank values.
Layer 0
Given the overall compression ratio in the second
Top K (=3) Comb.

Comb.0
Layer 0 … … … …
Comb.1
Comb. 0 Comb. 1 Comb. 3 stage, and considering that the structure of all ex-
… … …
Comb.3 Layer 1 perts is entirely consistent, we directly calculate

25% pruning
Layer 2 … … … the target average rank for each expert after decom-
ratio
Layer 0 ~ Layer 2
Ranked by Comb. Performance

position, which is denoted as Ra . By considering

Expert
Available Layer
the important score of each expert, we calculate the
Combination
Pruned Selected
n-2

Layer
rank values for experts ei,j as:
Expert Combination
n-1
 
Optimal for Layer
 (Iij + ϵ)α
n
 
Every T (=3)
Layer n-2 ~ Layer n ′
layers
Output Rij = P ′
 · Ra · Mi  (3)
Mi α
j=1 (Iij + ϵ)
Figure 3: The process of KT-Receptive Filed in Mixtral- ′
Here, Mi represents the number of experts remain-
8×7B for satisfied 25% pruning ratio. In this case, the
number of candidate combinations per layer is K = 3, ing in layer i of the model obtained after Intra-
and the number of layers per block is T = 3. For Expert Pruning, and α denotes the smooth factor
each layer, we select K optimal candidates using the used to avoid overly linearizing the distribution of
Layer-wise Genetic Search (top-left). Within a consec- rank values, set as 0.15. ϵ is set to 1 × 10−6 to
utive sequence of T layers, we employ the Block-wise avoid the numerical issue.
KT-Reception Field to identify the best-performing com-
bination within that block (T layers). 3.2.2 Intra-Expert Decomposition Strategy
Singular Value Decomposition (SVD) is a general
technique to reduce parameter size by decomposing
ing the low-rank decomposition on the parameters a large dense matrix into two smaller low-rank ma-
within each intra-expert. trices. Compared to the Vanilla SVD, which only
focuses on the initial weight matrix, (Wang et al.,
3.2.1 Expert Importance Analysis 2024) generates activation by truncation-aware data
As mentioned in (Chi et al., 2022), each expert whitening and provides hierarchical closed-form
has varying levels of importance. To achieve better updates for model compression. Inspired by SVD-
compression performance, instead of applying a LLM (Wang et al., 2024) working on dense models,
uniform compression ratio, we aim to retain more we extend SVD-LLM to MoE models by integrat-
parameters in the important experts and fewer in ing the non-uniform ranks Ri,j in Sec. 3.2.1.
the less important ones. That leads us to assign
higher ranks to the more important experts and 3.3 Efficient Fine-tuning
lower ranks to the less important ones. Therefore, To mitigate performance degradation caused by
to calculate the varying ranks, we analyze the rel- the two-stage compression, we fine-tune the MoE
ative importance of each expert. Based upon the by updating the weights. Instead of adjusting all
previous analysis in Sec. 3.1.1, we adopt the same weights, we integrate LoRA (Hu et al., 2021), a
importance metric, Ii,j in Eq. 1, as the expert im- low-rank approximation technique, into the post-
portance. training of the pruned model. The overall algorithm
To determine the varying ranks of each expert, is illustrated in Alg. 1.
Algorithm 1 The Algorithm of MoE-I2 paca (Taori et al., 2023) as calibration data to con-
Inputs: Initial Model M, Target Pruning Ratio PS , Expert duct the importance analysis. For the finetuning
Decomposition Rate D, Calibration Sample Sc , Finetune phase, similar to LLM-Pruner (Ma et al., 2023), we
Sample Sf .
Outputs: Compressed MoE-I2 , Mf
use Alpaca as the finetuning training set, totaling
1: for each layer li in M do approximately 50k samples. The batch size is set
2: Ii ← Layer Importance Analysis with Sc via as 64 and learning rates are from 3e-4 to 5e-4. The
Sec. 3.1.1;
3: end for experiments are conducted on 4 A100-80G GPUs.
4: Mp ← Inter-Expert Pruning(M, Sc , PS , I) via
Sec. 3.1.2; 4.2 Main Results
5: for each layer li in Mp do
6: Ri,j ← Expert Importance Analysis via Sec. 3.2.1;
MoE-I2 Results. Table 1 presents the zero-shot
7: end for performance of the models after applying the MoE-
8: Mc ← Intra-Expert Decomposition(Mp , Sc , D, R) via I2 framework. It is evident that pruning 25% of
Sec. 3.2.2;
9: Mf ← Low-Rank Finetune(Mc , Sf ) via Sec. 3.3; the expert parameters results in only a slight per-
formance loss. However, after finetuning the com-
pressed mode with only 2 epochs, the performance
4 Experiments can even surpass that of the original model, es-
pecially with an improvement of over 2% on the
4.1 Experimental Settings DeepSeek-V2-Lite model. This observation sug-
Model Settings. To demonstrate the effective- gests that pruning 25% of the experts in the first
ness of our method, we conducted experiments step is lossless. In the second step, we choose to
on three MoE models: Qwen1.5-MoE-A2.7B further compress the pruned model with an approx-
(14.3B), DeepSeek-V2-Lite (16B), and Mixtral- imate 40% compression ratio via low-rank decom-
8×7B (47B). Mixtral-8×7B has a larger number position. Finally, we perform the finetuning stage.
of parameters and relatively fewer experts (8 ex- As a result, we can see that while ensuring a reduc-
perts per layer in total 32 layers). On the other hand, tion of more than 50% in expert parameters, the
Qwen1.5-MoE-A2.7B and DeepSeek-V2-Lite have model’s performance is largely preserved.
fewer parameters but a greater number of experts Zero-shot Performance Comparisons with Ex-
(60 and 64 experts per layer in a total of 24 and 26 isting Methods.
layers, respectively). Table 2 shows the zero-shot performance of the
Evaluation and Datasets. To evaluate the pruned model by comparing Wanda (Sun et al.,
performance in a task-agnostic setting, we 2023), EEP (Lu et al., 2024), and our Inter-Expert
mainly adopt LLama-Pruner (Ma et al., 2023) Pruning method under the same sparsity rate. Our
evaluation methodology, conducting zero-shot method demonstrates significant advantages over
task classification across common sense reason- Wanda and EEP.
ing datasets such as BoolQ (Clark et al., 2019), PPL Comparisons with Existing Methods. Ta-
HellaSwag (Zellers et al., 2019), WinoGrande (Sak- ble 3 shows the zero-shot perplexity(PPL) of the
aguchi et al., 2021), ARC-easy (Clark et al., 2018), pruned model by comparing EEP, and our Inter-
ARC-challenge (Clark et al., 2018), and Open- Expert Pruning method under the same sparsity
bookQA (Mihaylov et al., 2018). Meanwhile, our rate. Our method demonstrates significant advan-
model evaluates results in multiple-choice tasks tages over EEP.
or generates answers in open-ended generation Inference Speedup with Existing Methods. Ta-
tasks (Gao et al., 2021). Furthermore, we supple- ble 4 shows the speedup of three models by com-
ment our evaluation with a zero-shot perplexity paring Wanda (Sun et al., 2023), EEP (Lu et al.,
(PPL) analysis on WikiText2 (Merity et al., 2016) 2024), and MoE-I2 method.
and PTB (Marcus et al., 1993).
4.3 Ablation Studies
Implementation Details. During the expert prun- Comparison of MoE-I2 and its Components. Ta-
ing phase, we use the same data as the (Lu et al., ble 5 demonstrates the necessity of the components
2024), which is 2048 randomly sampled data from within the MoE-I2 framework. It shows that MoE-
the C4 (Raffel et al., 2020) dataset as calibra- I2 has a significant advantage when compared to
tion data. In the expert decomposition phase, we applying only Inter-Expert Pruning or Intra-Expert
also use 2048 randomly sampled data from Al- Decomposition individually.
Table 1: Zero-shot performance of three models under our MoE-I2 Framework. The average is calculated among
seven classification datasets. “P” denotes the Inter-Expert Pruning operation, “D” represents the Intra-Expert
Decomposition operation, and “F” indicates the “Fine-tuning” operation based on LoRA. “Params” represents
the percentage reduction in the number of expert parameters. In the Inter-Expert Pruning stage, we prune 25%
of the experts. During the Intra-Expert Decomposition stage, for the Mixtral-8×7B model, we decompose the
remaining experts with an average rank of 2048, further reducing the parameters by approximately 37.5%. For the
Qwen1.5-MoE-A2.7B and DeepSeek-V2-Lite models, we perform decomposition with an average rank of 512,
further reducing the parameters by approximately 38.6%.

Model Method Params↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8×7B baseline 0 57.17 84.01 85.35 64.88 35.00 70.40 75.93 67.53
8×7B P 25% 51.79 81.36 84.07 61.99 32.80 71.12 75.85 65.57
8×7B P+F 25% 56.23 82.49 86.42 64.48 36.00 72.92 74.98 67.65
8×7B P+D 51.79% 40.70 71.51 67.83 45.34 26.00 61.37 67.56 54.33
8×7B MoE-I2 51.79% 52.20 78.22 82.62 61.07 34.00 72.20 71.50 64.55
Qwen baseline 0 41.89 73.11 79.76 57.90 30.40 70.04 68.67 60.25
Qwen P 25% 38.57 70.37 73.30 55.84 29.80 64.98 67.25 57.16
Qwen P+F 25% 45.14 75.93 78.01 57.83 32.80 71.12 68.51 61.33
Qwen P+D 53.98% 37.71 65.91 71.41 49.34 29.40 64.26 67.88 55.13
Qwen MoE-I2 53.98% 41.13 71.68 75.08 53.08 30.80 66.43 66.54 57.82
DeepSeek baseline 0 46.93 78.37 79.82 58.70 34.60 60.65 71.35 61.49
DeepSeek P 25% 45.31 74.62 67.95 57.38 33.20 59.93 70.01 58.34
DeepSeek P+F 25% 47.44 78.16 79.79 60.32 35.40 74.56 71.35 63.86
DeepSeek P+D 53.98% 38.48 71.42 70.09 48.15 27.80 60.65 65.98 54.65
DeepSeek MoE-I2 53.98% 42.58 71.80 76.79 55.16 32.60 70.76 67.64 59.62

Table 2: Zero-shot performance comparison with EEP (Lu et al., 2024) and Wanda (Sun et al., 2023)

Model Method Params↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8×7B EEP 25% 51.62 81.94 83.64 61.60 33.00 67.87 75.37 65.01
8×7B P 25% 51.79 81.36 84.07 61.99 32.80 71.12 75.85 65.57
8×7B Wanda 50% 42.06 74.16 76.64 53.16 27.00 63.90 70.96 58.27
8×7B EEP 50% 48.89 78.16 81.35 57.66 29.00 61.37 72.85 61.33
8×7B P 50% 48.38 78.66 81.41 58.35 27.00 64.62 74.19 61.80

Table 3: Zero-shot performance of experiment results Table 4: Inference speedup performance comparison
of comparison with EEP and Wanda. “↓” indicates that with EEP (Lu et al., 2024) and Wanda (Sun et al., 2023)
lower values are better. at a compression rate of 50% . “↓” indicates that lower
values are better.
Model Method Params↓ WikiText2 ↓ PTB ↓
8×7B baseline 0% 6.24 107.24 Model Method Mem (GB) ↓ Speedup Average
8×7B EEP 25% 8.16 141.1 8×7B baseline 87.7 1.0× 67.53
8×7B P 25% 8.01 133.38 8×7B EEP 45.78 1.20× 61.33
8×7B EEP 50% 11.02 207.4 8×7B Wanda 50.01 0.91× 58.27
8×7B P 50% 10.1 185.2 8×7B MoE-I² 43.49 1.28× 64.55
Qwen baseline 26.67 1.0× 60.25
Impact of Genetic Search. For Qwen1.5-MoE- Qwen MoE-I² 14.14 1.12× 57.82
A2.7B and DeepSeek-V2-Lite models, which have DeepSeek baseline 29.26 1.0× 61.49
DeepSeek MoE-I² 15.03 1.13× 59.62
60 and 64 experts per layer respectively, we only
iterate 50 times for Genetic Search. As shown in current layer to prune instead of considering the
Figure 4, the loss has converged in the majority of expert combination used in Genetic Search. As ob-
layers. Using EEP (Lu et al., 2024) for combinato- served, Genetic Search has a significant advantage
rial search would result in unimaginable time com- over other methods with similar low time costs on
plexity. For instance, if pruning 25% of the experts, seven classification tasks and PPL.
EEP would require searching C15 60 and C 64 times Impact of KT-Receptive Field. As shown in Fig-
16
for each layer respectively. Table 6 presents the per- ure 5, we also observe that a large KT-Receptive
formance of the pruned models obtained through Field is not always the best during calibration. This
our Inter-Expert Pruning compared to Random and is partially because we only use a small amount of
TopLoss methods in terms of zero-shot performance data for calibration (2048 samples selected from
of average(among seven classification datasets) and the C4 dataset). Additionally, there is a significant
perplexity tasks. The TopLoss denotes that we indi- difference between the C4 dataset and the seven
vidually select the Pi least important experts in the datasets used for zero-shot validation. Simply in-
Table 5: Comparison of zero-shot performance of the MoE-I2 framework and its components. To ensure the same
compression ratio and ease of computation as much as possible, when performing the “D+F”, we set the average
rank value of the experts as 14 of expert dimension, which is 352.

Model Method Params ↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8x7B P+F 50% 50.43 78.79 82.42 59.12 32.00 70.40 74.03 63.88
8x7B D+F 51.35% 46.08 75.34 81.41 54.02 27.80 72.20 68.27 60.73
8×7B MoE-I2 51.79% 52.20 78.22 82.62 61.07 34.00 72.20 71.50 64.55
Qwen P+F 50% 41.89 69.15 75.20 53.97 30.20 64.98 62.43 56.83
Qwen D+F 57.81% 36.69 69.01 74.56 47.29 29.40 72.92 68.27 56.88
Qwen MoE-I2 53.98% 41.13 71.68 75.08 53.08 30.80 66.43 66.54 57.82
DeepSeek P+F 50% 39.51 70.16 68.17 53.37 26.40 64.98 63.14 55.11
DeepSeek D+F 57.81% 69.68 70.33 74.19 51.98 29.20 71.12 67.01 57.64
DeepSeek MoE-I2 53.98% 42.58 71.80 76.79 55.16 32.60 70.76 67.64 59.62

Figure 4: The left and right figures represent the loss convergence for each layer of Qwen1.5-MoE-A2.7B and
DeepSeek-V2-Lite during the Genetic Search process, respectively. As shown in the figures, after 50 iterations,
nearly all layers have converged.

Table 6: Zero-shot performance of average and perplex-


ity of comparison with “Inter-Expert Pruning”, Random,
and TopLoss.
Model Method Params↓ Average WikiText2↓ PTB↓
Qwen baseline 0% 60.25 7.06 13.51
Qwen Random 25% 55.34 9.38 16.73
Qwen TopLoss 25% 56.51 8.06 15.39
Qwen P 25% 57.16 8.01 15.17
DeepSeek baseline 0% 61.49 10.22 46.43
DeepSeek Random 25% 43.93 48.05 628.97
DeepSeek TopLoss 25% 57.00 11.34 67.67
DeepSeek P 25% 58.34 11.49 65.80

creasing the values of K and T can lead to overfit-


ting on the calibration dataset. Empirically, K = 3 Figure 5: The impact of K and T on the perfor-
and T = 3 can achieve the best performance. mance of models Mixtral-8×7B, Qwen1.5-MoE-A2.7B,
DeepSeek-V2-Lite.
Impact of Non-uniform Pruning Ratio. We can
observe in Figure 2 that the importance of differ-
ent layers in the DeepSeek-V2-Lite model varies assigned to different experts.
significantly. Table 7 demonstrates that this distinc- Impact of Experts Pruning, Layers, and Blocks
tion in layer importance is effective. Compared to Pruning. Table 9 shows our expert pruning method
the balanced pruning ratio used by Mixtral-8×7B (Genetic Search) demonstrates significant advan-
and Qwen1.5-MoE-A2.7B, the imbalance pruning tages over concurrent approaches, such as Layer
ratio applied to DeepSeek-V2-Lite results in better Pruning and Block Pruning (He et al., 2024).
model performance. Our Genetic Search can retain more performance
Impact of Different Ranks. Table 8 shows that (1.32% vs. 3.19% performance drop) while main-
selecting an imbalanced rank approach yields better taining a higher pruning rate (23.95% vs. 15.51%
performance for all experts within the same layer. pruning ratio). Note that since (He et al., 2024)
This phenomenon highlights the differences among presents normalized zero-shot accuracy results, we
experts and indicates that different ranks should be have also normalized our results for fairness.
Table 7: Zero-shot performance of experiment results produced by Inter-Expert Pruning of comparison with
Imbalance (Imba.) and Balance (Ba.) pruning ratio in DeepSeek-V2-Lite.

Model Method Params↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
DeepSeek Ba. 25% 44.20 73.91 68.26 57.07 32.00 57.76 69.93 57.59
DeepSeek Imba. 25% 45.31 74.62 67.95 57.38 33.20 59.93 70.01 58.34
DeepSeek Ba. 50% 31.74 60.19 61.28 45.34 22.40 50.90 60.62 47.50
DeepSeek Imba. 50% 31.74 61.87 61.74 44.79 23.60 54.87 56.67 47.90

Table 8: Zero-shot performance of experiment results produced by Intra-Expert Decomposition of comparison with
Imbalance (Imba.) and Balance (Ba.) rank in same layer in three models.

Model Rank(avg) Type ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8x7B 2048 Ba. 43.66 73.45 74.03 54.31 27.40 67.92 69.55 58.62
8x7B 2048 Imba. 43.94 73.95 74.56 55.91 27.80 68.23 69.85 59.18
8x7B 1550 Ba. 33.70 63.43 62.57 47.29 22.00 62.45 62.98 50.63
8x7B 1550 Imba. 34.59 63.67 62.59 47.68 22.00 63.05 63.15 50.96
Qwen 704 Ba. 40.19 72.94 77.95 54.50 30.40 68.95 69.06 59.14
Qwen 704 Imba. 40.44 73.40 77.74 54.54 31.60 68.95 69.30 59.43
Qwen 352 Ba. 35.92 67.55 73.64 44.09 26.40 70.04 67.17 54.97
Qwen 352 Imba. 36.26 67.89 73.15 44.34 27.20 72.20 66.69 55.39
DeepSeek 704 Ba. 43.60 76.94 77.77 53.98 30.40 62.82 69.22 59.25
DeepSeek 704 Imba. 44.11 77.19 78.50 54.20 30.40 63.54 69.30 59.61
DeepSeek 352 Ba. 33.45 65.11 63.05 39.07 25.20 61.75 64.88 50.35
DeepSeek 352 Imba. 34.04 65.95 63.76 39.53 25.80 60.29 65.19 50.65

Table 9: Performance of Pruning on Mixtral-8×7B between our Genetic Search and C-MoE (He et al., 2024). “P”
denotes ours Inter-Expert Pruning operation (Genetic Search). “E[n/m]” denotes dropping n out of m of experts per
MoE layer on average. “L[n/m]”, “B[n/m]” represents dropping n out of m corresponding modules with Layer
Drop and Block Drop respectively. These three methods are described in (He et al., 2024).
Model Method Mem(GB) ARC-c BoolQ HellaSwag OBQA RTE WinoGrande Average ∆↓
8×7B baseline(Ours/EEP) 87.7 59.81 84.92 83.97 47.00 71.12 76.32 70.52 -
8×7B P 66.7 56.66 83.46 81.72 46.40 71.12 75.85 69.02 ↓ 1.32
8×7B baseline (He et al., 2024) 87.7 59.4 84.2 84.00 46.80 70.40 75.60 70.07 -
8×7B E2/8 66.7 53.20 77.70 80.50 46.20 55.60 76.80 65.00 ↓ 5.07
8×7B L8/32 66.6 47.70 85.30 75.20 40.40 69.70 74.60 65.42 ↓ 4.65
8×7B B5/32 74.1 51.30 85.30 78.70 42.00 69.70 74.30 66.88 ↓ 3.19

5 Conclusion ing the model at multiple fine-grants, we ensure


optimal compression while maintaining model per-
In this paper, we explore the efficiency of current
formance, making it more suitable for deployment.
large-scale MoE models and propose a general end-
Despite these advantages, due to computational lim-
to-end compression framework, MoE-I2 , that ad-
itations, we have not yet tested our framework on
dresses the issue of parameter redundancy in MoE
larger MoE models such as Mixtral-8×22B (141B),
models. In our approach, we first conduct the layer
and DeepSeek-V2 (236B). We aim to gradually test
importance analysis and Inter-Expert Pruning for
these larger MoE models in future work.
different MoE models. Subsequently, we perform
the expert important analysis based on the pruned
Ethics Statement
model, ensuring appropriate target ranks of each
expert when performing the Intra-Expert Decom- Our research focuses on developing an end-to-
position. Our MoE-I2 framework significantly re- end framework for the compression of Mixture-of-
duces the parameters of MoE models maintaining Experts (MoE) large language models (LLMs). By
high performance. In the future, we aim to sup- enhancing model compression techniques, we aim
port a wider variety of MoE models with larger to significantly reduce the model size and improve
parameters, enhancing their deployability. inference efficiency, ensuring these improvements
do not come at the cost of performance. While
Limitations
our work contributes to the advancement of de-
Our proposed framework, MoE-I2 , can perform ploying sophisticated LLMs more effectively, we
end-to-end compression on any MoE model and recognize the ethical considerations inherent in this
adaptively find suitable pruning and decomposition field. These include the need to address potential
strategies for the target MoE model. By compress- biases in the models, ensure the responsible and
fair use of LLMs, and safeguard privacy. We are Leo Gao, Jonathan Tow, Stella Biderman, Sid Black,
committed to transparency by making our com- Anthony DiPofi, Charles Foster, Laurence Golding,
Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff,
pression framework publicly available. We urge
et al. 2021. A framework for few-shot language
the community to apply our work ethically, with model evaluation. Version v0. 0.1. Sept, page 8.
careful attention to the broader societal impacts of
deploying compressed LLMs. John J Grefenstette. 1993. Genetic algorithms and ma-
chine learning. In Proceedings of the sixth annual
conference on Computational learning theory, pages
3–4.
References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Shwai He, Daize Dong, Liang Ding, and Ang Li.
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, 2024. Demystifying the compression of mixture-of-
Diogo Almeida, Janko Altenschmidt, Sam Altman, experts through a unified framework. arXiv preprint
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv:2406.02500.
arXiv preprint arXiv:2303.08774.
Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou,
Tanweer Alam, Shamimul Qamar, Amit Dixit, and Mo- Yilin Shen, and Hongxia Jin. 2022. Language model
hamed Benaida. 2020. Genetic algorithm: Reviews, compression with weighted low-rank factorization.
implementations, and applications. arXiv preprint arXiv preprint arXiv:2207.00112.
arXiv:2007.12673.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei and Weizhu Chen. 2021. Lora: Low-rank adap-
Huang, et al. 2023. Qwen technical report. arXiv tation of large language models. arXiv preprint
preprint arXiv:2309.16609. arXiv:2106.09685.
Tianyu Chen, Shaohan Huang, Yuan Xie, Binx- Albert Q Jiang, Alexandre Sablayrolles, Antoine
ing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, Roux, Arthur Mensch, Blanche Savary, Chris Bam-
and Furu Wei. 2022. Task-specific expert prun- ford, Devendra Singh Chaplot, Diego de las Casas,
ing for sparse mixture-of-experts. arXiv preprint Emma Bou Hanna, Florian Bressand, et al. 2024.
arXiv:2206.00277. Mixtral of experts. arXiv preprint arXiv:2401.04088.
Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Young Jin Kim, Ammar Ahmad Awan, Alexandre
Shuming Ma, Barun Patra, Saksham Singhal, Payal Muzio, Andres Felipe Cruz Salinas, Liyang Lu,
Bajaj, Xia Song, Xian-Ling Mao, et al. 2022. On the Amr Hendy, Samyam Rajbhandari, Yuxiong He, and
representation collapse of sparse mixture of experts. Hany Hassan Awadalla. 2021. Scalable and effi-
Advances in Neural Information Processing Systems, cient moe training for multitask multilingual models.
35:34600–34613. arXiv preprint arXiv:2109.10465.
Christopher Clark, Kenton Lee, Ming-Wei Chang,
Yeskendir Koishekenov, Alexandre Berard, and Vas-
Tom Kwiatkowski, Michael Collins, and Kristina
silina Nikoulina. 2022. Memory-efficient nllb-200:
Toutanova. 2019. Boolq: Exploring the surprising
Language-specific expert pruning of a massively mul-
difficulty of natural yes/no questions. arXiv preprint
tilingual machine translation model. arXiv preprint
arXiv:1905.10044.
arXiv:2212.09811.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu,
Tafjord. 2018. Think you have solved question an- Dehao Chen, Orhan Firat, Yanping Huang, Maxim
swering? try arc, the ai2 reasoning challenge. arXiv Krikun, Noam Shazeer, and Zhifeng Chen. 2020.
preprint arXiv:1803.05457. Gshard: Scaling giant models with conditional com-
putation and automatic sharding. arXiv preprint
DeepSeek-AI. 2024. Deepseek-v2: A strong, economi- arXiv:2006.16668.
cal, and efficient mixture-of-experts language model.
arXiv preprint arXiv:2405.04434. Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung,
Yu Cheng, Mohit Bansal, and Tianlong Chen. 2024.
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Merge, then compress: Demystify efficient SMoe
Switch transformers: Scaling to trillion parameter with hints from its routing policy. In The Twelfth In-
models with simple and efficient sparsity. Journal of ternational Conference on Learning Representations.
Machine Learning Research, 23(120):1–39.
Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan
Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Mas- Huang, Bo Zhang, Junchi Yan, and Hongsheng Li.
sive language models can be accurately pruned in 2024. Not all experts are equal: Efficient expert
one-shot. In International Conference on Machine pruning and skipping for mixture-of-experts large
Learning, pages 10323–10337. PMLR. language models. arXiv preprint arXiv:2402.14800.
Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Llm-pruner: On the structural pruning of large lan- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
guage models. In Advances in Neural Information Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Processing Systems. Bhosale, et al. 2023b. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
Mitch Marcus, Beatrice Santorini, and Mary Ann arXiv:2307.09288.
Marcinkiewicz. 1993. Building a large annotated cor-
pus of english: The penn treebank. Computational Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang.
linguistics, 19(2):313–330. 2024. Svd-llm: Truncation-aware singular value de-
composition for large language model compression.
Stephen Merity, Caiming Xiong, James Bradbury, and arXiv preprint arXiv:2403.07378.
Richard Socher. 2016. Pointer sentinel mixture mod-
els. arXiv preprint arXiv:1609.07843. Yiquan Wu, Kun Kuang, Yating Zhang, Xiaozhong Liu,
Changlong Sun, Jun Xiao, Yueting Zhuang, Luo Si,
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish and Fei Wu. 2020. De-biased court’s view gener-
Sabharwal. 2018. Can a suit of armor conduct elec- ation with causality. In Proceedings of the 2020
tricity? a new dataset for open book question answer- Conference on Empirical Methods in Natural Lan-
ing. arXiv preprint arXiv:1809.02789. guage Processing (EMNLP), pages 763–780, Online.
Association for Computational Linguistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and
ine Lee, Sharan Narang, Michael Matena, Yanqi Jimmy Lin. 2020. Deebert: Dynamic early exit-
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the ing for accelerating bert inference. arXiv preprint
limits of transfer learning with a unified text-to-text arXiv:2004.12993.
transformer. Journal of Machine Learning Research,
21(140):1–67. Dongkuan Xu, Ian EH Yen, Jinxi Zhao, and Zhibin
Xiao. 2021. Rethinking network pruning–under the
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- pre-train and fine-tune paradigm. arXiv preprint
ula, and Yejin Choi. 2021. Winogrande: An adver- arXiv:2104.08682.
sarial winograd schema challenge at scale. Commu-
nications of the ACM, 64(9):99–106. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang,
Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Zeroquant: Efficient and affordable post-training
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff quantization for large-scale transformers. Advances
Dean. 2017. Outrageously large neural networks: in Neural Information Processing Systems, 35:27168–
The sparsely-gated mixture-of-experts layer. arXiv 27183.
preprint arXiv:1701.06538.
Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Wu, Yan Yan, and Guangyu Sun. 2023. Asvd:
Kolter. 2023. A simple and effective pruning ap- Activation-aware singular value decomposition for
proach for large language models. arXiv preprint compressing large language models. arXiv preprint
arXiv:2306.11695. arXiv:2312.05821.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Patient knowledge distillation for bert model com- Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
pression. arXiv preprint arXiv:1908.09355. machine really finish your sentence? arXiv preprint
arXiv:1905.07830.
Siqi Sun, Zhe Gan, Yu Cheng, Yuwei Fang, Shuo-
Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhi-
hang Wang, and Jingjing Liu. 2020. Con-
jie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng
trastive distillation on intermediate representations
Li. 2021. Learning n: m fine-grained structured
for language model compression. arXiv preprint
sparse neural networks from scratch. arXiv preprint
arXiv:2009.14167.
arXiv:2102.04010.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
An instruction-following llama model. https://
github.com/tatsu-lab/stanford_alpaca.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier


Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, et al. 2023a. Llama: Open and effi-
cient foundation language models. arXiv preprint
arXiv:2302.13971.

You might also like