0% found this document useful (0 votes)

19 views

Moe-I: Compressing Mixture of Experts Models Through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition

Uploaded by

geek.bill.0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Moe-I: Compressing Mixture of Experts Models Through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition

Uploaded by

geek.bill.0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

MoE-I2 : Compressing Mixture of Experts Models through Inter-Expert

Pruning and Intra-Expert Low-Rank Decomposition

Cheng Yang1* , Yang Sui1,2* , Jinqi Xiao1 , Lingyi Huang1 , Yu Gong1 , Yuanlin Duan1 ,
Wenqi Jia3 , Miao Yin 3 , Yu Cheng4 , Bo Yuan1
1
Rutgers University, 2 Rice University,
3
The University of Texas at Arlington, 4 The Chinese University of Hong Kong
[email protected], [email protected], [email protected]

Abstract parameters during training and inference. For in-

stance, with a smaller model size, the Mixtral-
The emergence of Mixture of Experts (MoE) 8×7B model with a total of 47B parameters sur-
LLMs has significantly advanced the develop-
arXiv:2411.01016v1 [cs.LG] 1 Nov 2024

passes the performance of dense Transformer mod-

ment of language models. Compared to tra-
els like LLaMA-2-70B (Touvron et al., 2023b). Ad-
ditional LLMs, MoE LLMs outperform tradi-
tional LLMs by achieving higher performance
ditionally, Qwen1.5-MoE-A2.7B (Bai et al., 2023)
with considerably fewer activated parameters. demonstrates highly competitive performance com-
Despite this efficiency, their enormous param- pared to other 7B models, and the recently in-
eter size still leads to high deployment costs. troduced DeepSeekv2 MoE (DeepSeek-AI, 2024)
In this paper, we introduce a two-stage com- achieves performance levels comparable to GPT-4,
pression method tailored for MoE to reduce demonstrating the powerful capabilities of MoE
the model size and decrease the computational models.
cost. First, in the inter-expert pruning stage,
MoE models have garnered significant attention
we analyze the importance of each layer and
propose the Layer-wise Genetic Search and recently due to their ability to dynamically select
Block-wise KT-Reception Field with the non- subsets of parameters for each input, enabling effi-
uniform pruning ratio to prune the individual cient handling of diverse tasks. Despite their poten-
expert. Second, in the intra-expert decompo- tial, a notable challenge with MoE models is that
sition stage, we apply the low-rank decom- they are still burdened by substantial parameter
position to further compress the parameters size and computation cost. For example, Mixtral-
within the remaining experts. Extensive exper-
8×7B (Jiang et al., 2024) not only has 47B pa-
iments on Qwen1.5-MoE-A2.7B, DeepSeek-
V2-Lite, and Mixtral-8×7B demonstrate that
rameters but also activates 13B parameters during
our proposed methods can both reduce the inference. While this architecture allows for scala-
model size and enhance inference efficiency bility and flexibility, it also introduces complexities
while maintaining performance in various zero- and huge memory in deployment and inference,
shot tasks. The code will be available at https: particularly when considering resource constraints
//github.com/xiaochengsky/MoEI-2.git and efficiency. Consequently, decreasing and main-
taining these large-scale models remains a critical
area of research.
1 Introduction Model compression techniques, such as pruning,
Large Language Models (LLMs) have recently knowledge distillation, and quantization, have been
demonstrated remarkable language understand- utilized to slim the model size. (Lu et al., 2024)
ing and generation proficiency, excelling in com- proposed to reduce the parameter count of MoE
plex tasks (Achiam et al., 2023; Touvron et al., models by expert pruning, but it does not reduce
2023a; Wu et al., 2020). However, deploying the parameters during inference efficiently. (Li
these models presents substantial challenges due et al., 2024) merges several experts into one and ap-
to their significant storage and computational de- plies the low-rank decomposition to further reduce
mands. To overcome these issues, the Mixture- the model size. Although this approach achieves
of-Experts (MoE) LLM has been proposed (Jiang a good compression ratio and performance, it re-
et al., 2024), which activates only a subset of its quires calibration and fine-tuning for each down-
stream task individually, which is not suitable for
*
Equal Contribution. large-scale LLMs, and time costs are very high.
Several works (Zhou et al., 2021; Sun et al., 2023; • We conduct extensive experiments with
Frantar and Alistarh, 2023) focus on unstructured MoE models, including Qwen1.5-MoE-
sparsity to decrease the parameters of models while A2.7B (14.3B), DeepSeek-V2-Lite (16B), and
maintaining high performance. However, unstruc- Mixtral-8×7B (47B), across various nine
tured pruning struggles to achieve practical acceler- datasets to assess both the generation quality
ation, decrease inference, and save storage without and the zero-shot classification performance,
a specific design for hardware and libraries. demonstrating the effectiveness of our pro-
To solve these problems, we start by analyzing posed MoE-I2 framework.
parameter redundancy in the MoE model from mul-
tiple levels. First, since identifying redundant ex- 2 Related Works
perts using brute-force search (Lu et al., 2024)
2.1 Mixture-of-Experts LLMs
is infeasible in practice, it is necessary to design
efficient methods to reduce the time complexity. MoE-LLMs have gained significant attention in re-
Second, we aim to compress as many experts as cent years due to their ability to scale efficiently
possible while ensuring that the model maintains while maintaining high performance. MoE mod-
its zero-shot performance, rather than being lim- els divide the network into several experts and dy-
ited to handling a single down-stream task (Li et al., namically select a subset of these experts for each
2024). Finally, our method can adapt to any MoE input, which reduces computational overhead and
model, particularly those with a large number of enhances scalability. (Shazeer et al., 2017) intro-
experts and diverse structures, and automatically duced the MoE model in their work on the Sparsely-
identifies a suitable compression strategy for each Gated Mixture-of-Experts Layer, and (Lepikhin
type of MoE model without the need for manual et al., 2020) further advanced the MoE architecture
settings. by demonstrating its scalability to trillions of pa-
In this paper, we propose a novel end-to-end rameters while retaining manageable computation
framework for MoE models, MoE-I2 , for the task- costs by distributing the experts across multiple
agnostic compression of the MoE models. To our devices. With the recent advancements in decoder-
knowledge, MoE-I2 is the first end-to-end frame- only architecture(Touvron et al., 2023a), MoE mod-
work designed task-agnostic for structured com- els built on this structure have become increasingly
pression of MoE LLMs. Our contributions are popular (Jiang et al., 2024). In this paper, we focus
summarized as follows: on how to build an end-to-end framework to solve
post-training expert pruning and decomposition for
• We introduce a two-stage MoE compression MoE LLMs to decrease computation and storage.
framework for expert slimming that consid-
ers both inter-expert and intra-expert relation- 2.2 Compression on MoE LLMs
ships. Recent advancements in large language models
• In the inter-expert pruning stage, we analyze have underscored the need to reduce parameter
the importance of each MoE layer and pro- sizes and latency (Ma et al., 2023). Compres-
pose a non-uniform pruning ratio for each sion techniques for language models include net-
layer. Then, we find that previous MoE prun- work pruning (Xu et al., 2021), knowledge distilla-
ing methods lead to high time complexity and tion (Sun et al., 2019, 2020), quantization (Yao
local optima. To address these issues, we in- et al., 2022), decomposition (Hsu et al., 2022;
troduce a layer-wise genetic search to reduce Yuan et al., 2023; Wang et al., 2024), and meth-
time complexity and a block-wise combina- ods like early exit (Xin et al., 2020). Building
tion strategy to approximate a global optimum on these techniques, pruning, and sparsity is cru-
better. cial for MoE models, which often have up to 95%
of parameters dedicated to experts. Pruning MoE
• In the intra-expert decomposition stage, we models involves removing less important experts
measure the importance of each expert and or neurons to reduce the number of active pa-
assign non-uniform ranks accordingly. Sub- rameters during inference. For example, (Kim
sequently, we apply a low-rank decomposi- et al., 2021) retains the most activated experts to
tion to further compress the parameters within enhance machine translation MoE models, while
each expert in a fine-grained manner. (Koishekenov et al., 2022) introduces gate statistics-
X
based pruning during decoding. Although effec- Ii,j = L(X , {Ei } \ {ei,j }) (1)
tive, these methods are mostly confined to linguis- B
tic models in machine translation. (Chen et al., where {Ei } = {ei,1 , ei,2 , · · · , ei,Mi } denotes the
2022) dropping-while-training approach progres- set of all experts in i-th layer. X represents the
sively removes non-essential experts for specific calibration dataset, and B denotes the batche size.
tasks, tested on Switch Transformers (Fedus et al., L denotes the output of the MoE model under the
2022). The merge-compression (Li et al., 2024) condition that the j-th expert in the i-th layer is
method and EPP (Lu et al., 2024) approach, which removed. Once we have determined the impor-
is similar to ours, consider pruning and skipping in tance score of the j-th expert in the i-th layer, the
MoE models but face challenges in reducing com- overall importance score of i-th layer is defined
putational costs. Given a pruned or sparse model, as Ii = E
P i
I
j=1 i,j . Given the overall pruning rate,
finetuning aims to restore performance on original we normalize the layer importance to obtain the
tasks. Recent studies on LLMs (Sun et al., 2023; pruning rate for each layer.
Ma et al., 2023) focus on pruning linear layers, but Following this paradigm, we demonstrate the
these methods often fail to reduce computing costs layer importance for Mixtral-8×7B (Jiang et al.,
without specialized hardware or libraries. Efficient 2024), Qwen1.5-MoE-A2.7B (Bai et al., 2023),
post-finetuning expert pruning and sparsity meth- and DeepSeek-V2-Lite (DeepSeek-AI, 2024) as
ods for task-agnostic MoE LLMs remain underex- shown in Figure 2. Note that the previous work (Lu
plored. This gap highlights the need for advanced et al., 2024) overlooks the varying importance of
techniques to effectively balance pruning and spar- layers and simply applies a uniform pruning ratio
sity while maintaining or enhancing performance to each layer, leading to a suboptimal solution. In
across various tasks. contrast, our analysis shows that some models per-
form in ways that largely diverge from this strategy.
3 Method
For example, the analysis of DeepSeek-V2-Lite
In this section, we introduce the details of our pro- (Figure 2) reveals that layer importance rapidly in-
posed framework, MoE-I2 , which consists of three creases with depth, indicating that deeper layers
stages: Inter-Expert Pruning stage (Sec. 3.1), Intra- are more sensitive than shallower ones.
Expert Decomposition stage (Sec. 3.2), and fine-
3.1.2 Inter-Expert Pruning Strategy
tuning stage (Sec. 3.3). The overall pipeline is
shown in Figure 1. To answer the second question, it is required to
identify a combination of N experts that have the
3.1 Inter-Expert Pruning least impact on prediction loss. Previous work (Lu
In this stage, our goal is to prune individual unim- et al., 2024) utilizes brute-force search to find the
portant experts to reduce the parameter size and least impactful combination of N experts within
computational cost. It raises two crucial questions: each layer. However, this method presents two sig-
(1) Given an overall pruning ratio, how many ex- nificant drawbacks. First, the brute-force search
perts should be pruned in each layer? (2) How to has high time complexity, making it extremely time-
determine which experts to prune? consuming, especially when pruning the MoE with
a large number of experts. For example, Qwen1.5-
3.1.1 Layer Importance Analysis MoE-A2.7B and DeepSeek-V2-Lite have 60 and
To answer the first question, we start by analyzing 64 experts per layer, respectively. If 25% of experts
the importance of each layer. The layer importance need to be pruned, (Lu et al., 2024) needs to tra-
verse C1560 and C 64 times for each layer respectively,
of i-th layer, denoted by Ii , is defined as the average 16
loss degradation by removing individual experts which is unacceptable in terms of time consump-
within this layer. Specifically, to calculate Ii in the tion. Second, it restricts the search space within
i-th layer, we first calculate the expert importance. the current layer, only achieving a local optimum
We consecutively pruning j-th expert in the i-th and potentially missing a more globally optimal
layer, denoted by ei,j , where j = 1, 2, · · · , Mi . solution.
The Mi represents the total number of experts in To mitigate these challenges, we leverage Ge-
the i-th layer. Next, each pruned model predicts netic Search (Grefenstette, 1993; Alam et al., 2020)
the next token with the calibration samples. The with KT-Receptive Field methods to enhance search
expert importance of ei,j is calculated as: efficiency and concurrently identify the least im-
Figure 1: The three-stage pipeline of MoE-I2 . The first stage (left) represents Inter-Expert Pruning, where MoE-I2
conducts the Layer Importance Analysis on the target MoE model. By using a predefined overall pruning rate, it
determines varying pruning ratios of different layers. Subsequently, the unimportant experts in MoE are determined
by Layer-wise Genetic Search and Block-wise KT-Reception Field. The MoE is pruned accordingly. The second
stage (middle) represents Intra-Expert Analysis. Similarly, MoE-I2 automatically performs Expert Importance
Analysis on the pruned model and using a predefined overall decomposition rate, applies varying ranks and low-rank
decomposition to different experts, resulting in a final compressed model. The third stage (right) shows that we
fine-tuned the compressed MoE model to recover performance.

pactful combinations of experts on a more global as candidate combinations in the i-th layer.
scale.
Layer-wise Genetic Search. To avoid extreme Block-wise KT-Reception Field. After obtaining
time consumption caused by brute-force search (Lu the n candidate combinations, we only keep K
et al., 2024), we leverage the genetic search to best combinations with the smallest loss in each
select the M candidate combinations in each layer. layer as the candidate combinations to be used
For the i-th layer, we define all possible prun- for the block-level optimization. We aim to select
ing combinations as CPi . Here, Pi represents the one of the K combinations from each layer such
number of experts to be be pruned in the i-th layer. that they minimize the output loss. During this
Given that there are Mi experts in the i-th layer, selection process, instead of only considering the
CPi denotes the number of combinations for se- importance of experts in just the current layer (Lu
lecting Pi experts to prune from the total of Mi et al., 2024), we extend the scope of candidate
experts. selection from one layer to T layers, achieving
In the initial stage of Genetic Search, we first a block-wise combination. L Specifically, we
initialize a population {CPi ,1 , CPi ,2 , . . . , CPi ,N }, partition all layers into T blocks. Within each
where the population size N = 100. We then calcu- block, we select the combination in a brute-force
late the loss for each combination in the population: scheme. Given K candidates in each layer, and
considering there are T layers in one block, we
X traverse all possible combinations by selecting one
Lni = ∥Fi (X ) − Fi (X , {Ei } \ CPi ,n ))∥F (2)
combination from each of T layers, yielding a total
B
of K T options. Subsequently, we calculate the
where Fi represents the output of layer i of the output loss and select the optimal combinations for
MoE model, and ∥ · ∥F donates Frobenius norm. pruning. The pipeline is shown in Figure 3.
We select the combinations with the smallest
loss from CPi ,n as parents. Using union and Expert Pruning. Given the to-be-pruned experts,
random sampling, we generate offspring combina- we conduct the expert pruning operation by remov-
tions. Each individual in the offspring population ing the entire expert in a structured manner.
undergoes some mutations, where a few experts to
be pruned are randomly replaced. This process is 3.2 Intra-Expert Decomposition
repeated iteratively in 50 steps and we can obtain In this stage, we propose to further compress the
the optimal a few combinations of expert pruning remaining experts in a fine-grained way by perform-
Figure 2: Importance analysis of Mixtral-8×7B (left), Qwen1.5-MoE-A2.7B (middle), and DeepSeek-V2-Lite
(right) models. A larger loss indicates greater importance. For Mixtral-8×7B and Qwen1.5-MoE-A2.7B, the
importance of the different layers is relatively consistent, but for DeepSeek-V2-Lite, the importance increases as
one approaches the output layer.

Input
we begin by calculating basic uniform rank values.
Layer 0
Given the overall compression ratio in the second
Top K (=3) Comb.

Comb.0
Layer 0 … … … …
Comb.1
Comb. 0 Comb. 1 Comb. 3 stage, and considering that the structure of all ex-
… … …
Comb.3 Layer 1 perts is entirely consistent, we directly calculate
…
25% pruning
Layer 2 … … … the target average rank for each expert after decom-
ratio
Layer 0 ~ Layer 2
Ranked by Comb. Performance
…
position, which is denoted as Ra . By considering
…

Expert
Available Layer
the important score of each expert, we calculate the
Combination
Pruned Selected
n-2

Layer
rank values for experts ei,j as:
Expert Combination
n-1
 
Optimal for Layer
 (Iij + ϵ)α
n
 
Every T (=3)
Layer n-2 ~ Layer n ′
layers
Output Rij = P ′
 · Ra · Mi  (3)
Mi α
j=1 (Iij + ϵ)
Figure 3: The process of KT-Receptive Filed in Mixtral- ′
Here, Mi represents the number of experts remain-
8×7B for satisfied 25% pruning ratio. In this case, the
number of candidate combinations per layer is K = 3, ing in layer i of the model obtained after Intra-
and the number of layers per block is T = 3. For Expert Pruning, and α denotes the smooth factor
each layer, we select K optimal candidates using the used to avoid overly linearizing the distribution of
Layer-wise Genetic Search (top-left). Within a consec- rank values, set as 0.15. ϵ is set to 1 × 10−6 to
utive sequence of T layers, we employ the Block-wise avoid the numerical issue.
KT-Reception Field to identify the best-performing com-
bination within that block (T layers). 3.2.2 Intra-Expert Decomposition Strategy
Singular Value Decomposition (SVD) is a general
technique to reduce parameter size by decomposing
ing the low-rank decomposition on the parameters a large dense matrix into two smaller low-rank ma-
within each intra-expert. trices. Compared to the Vanilla SVD, which only
focuses on the initial weight matrix, (Wang et al.,
3.2.1 Expert Importance Analysis 2024) generates activation by truncation-aware data
As mentioned in (Chi et al., 2022), each expert whitening and provides hierarchical closed-form
has varying levels of importance. To achieve better updates for model compression. Inspired by SVD-
compression performance, instead of applying a LLM (Wang et al., 2024) working on dense models,
uniform compression ratio, we aim to retain more we extend SVD-LLM to MoE models by integrat-
parameters in the important experts and fewer in ing the non-uniform ranks Ri,j in Sec. 3.2.1.
the less important ones. That leads us to assign
higher ranks to the more important experts and 3.3 Efficient Fine-tuning
lower ranks to the less important ones. Therefore, To mitigate performance degradation caused by
to calculate the varying ranks, we analyze the rel- the two-stage compression, we fine-tune the MoE
ative importance of each expert. Based upon the by updating the weights. Instead of adjusting all
previous analysis in Sec. 3.1.1, we adopt the same weights, we integrate LoRA (Hu et al., 2021), a
importance metric, Ii,j in Eq. 1, as the expert im- low-rank approximation technique, into the post-
portance. training of the pruned model. The overall algorithm
To determine the varying ranks of each expert, is illustrated in Alg. 1.
Algorithm 1 The Algorithm of MoE-I2 paca (Taori et al., 2023) as calibration data to con-
Inputs: Initial Model M, Target Pruning Ratio PS , Expert duct the importance analysis. For the finetuning
Decomposition Rate D, Calibration Sample Sc , Finetune phase, similar to LLM-Pruner (Ma et al., 2023), we
Sample Sf .
Outputs: Compressed MoE-I2 , Mf
use Alpaca as the finetuning training set, totaling
1: for each layer li in M do approximately 50k samples. The batch size is set
2: Ii ← Layer Importance Analysis with Sc via as 64 and learning rates are from 3e-4 to 5e-4. The
Sec. 3.1.1;
3: end for experiments are conducted on 4 A100-80G GPUs.
4: Mp ← Inter-Expert Pruning(M, Sc , PS , I) via
Sec. 3.1.2; 4.2 Main Results
5: for each layer li in Mp do
6: Ri,j ← Expert Importance Analysis via Sec. 3.2.1;
MoE-I2 Results. Table 1 presents the zero-shot
7: end for performance of the models after applying the MoE-
8: Mc ← Intra-Expert Decomposition(Mp , Sc , D, R) via I2 framework. It is evident that pruning 25% of
Sec. 3.2.2;
9: Mf ← Low-Rank Finetune(Mc , Sf ) via Sec. 3.3; the expert parameters results in only a slight per-
formance loss. However, after finetuning the com-
pressed mode with only 2 epochs, the performance
4 Experiments can even surpass that of the original model, es-
pecially with an improvement of over 2% on the
4.1 Experimental Settings DeepSeek-V2-Lite model. This observation sug-
Model Settings. To demonstrate the effective- gests that pruning 25% of the experts in the first
ness of our method, we conducted experiments step is lossless. In the second step, we choose to
on three MoE models: Qwen1.5-MoE-A2.7B further compress the pruned model with an approx-
(14.3B), DeepSeek-V2-Lite (16B), and Mixtral- imate 40% compression ratio via low-rank decom-
8×7B (47B). Mixtral-8×7B has a larger number position. Finally, we perform the finetuning stage.
of parameters and relatively fewer experts (8 ex- As a result, we can see that while ensuring a reduc-
perts per layer in total 32 layers). On the other hand, tion of more than 50% in expert parameters, the
Qwen1.5-MoE-A2.7B and DeepSeek-V2-Lite have model’s performance is largely preserved.
fewer parameters but a greater number of experts Zero-shot Performance Comparisons with Ex-
(60 and 64 experts per layer in a total of 24 and 26 isting Methods.
layers, respectively). Table 2 shows the zero-shot performance of the
Evaluation and Datasets. To evaluate the pruned model by comparing Wanda (Sun et al.,
performance in a task-agnostic setting, we 2023), EEP (Lu et al., 2024), and our Inter-Expert
mainly adopt LLama-Pruner (Ma et al., 2023) Pruning method under the same sparsity rate. Our
evaluation methodology, conducting zero-shot method demonstrates significant advantages over
task classification across common sense reason- Wanda and EEP.
ing datasets such as BoolQ (Clark et al., 2019), PPL Comparisons with Existing Methods. Ta-
HellaSwag (Zellers et al., 2019), WinoGrande (Sak- ble 3 shows the zero-shot perplexity(PPL) of the
aguchi et al., 2021), ARC-easy (Clark et al., 2018), pruned model by comparing EEP, and our Inter-
ARC-challenge (Clark et al., 2018), and Open- Expert Pruning method under the same sparsity
bookQA (Mihaylov et al., 2018). Meanwhile, our rate. Our method demonstrates significant advan-
model evaluates results in multiple-choice tasks tages over EEP.
or generates answers in open-ended generation Inference Speedup with Existing Methods. Ta-
tasks (Gao et al., 2021). Furthermore, we supple- ble 4 shows the speedup of three models by com-
ment our evaluation with a zero-shot perplexity paring Wanda (Sun et al., 2023), EEP (Lu et al.,
(PPL) analysis on WikiText2 (Merity et al., 2016) 2024), and MoE-I2 method.
and PTB (Marcus et al., 1993).
4.3 Ablation Studies
Implementation Details. During the expert prun- Comparison of MoE-I2 and its Components. Ta-
ing phase, we use the same data as the (Lu et al., ble 5 demonstrates the necessity of the components
2024), which is 2048 randomly sampled data from within the MoE-I2 framework. It shows that MoE-
the C4 (Raffel et al., 2020) dataset as calibra- I2 has a significant advantage when compared to
tion data. In the expert decomposition phase, we applying only Inter-Expert Pruning or Intra-Expert
also use 2048 randomly sampled data from Al- Decomposition individually.
Table 1: Zero-shot performance of three models under our MoE-I2 Framework. The average is calculated among
seven classification datasets. “P” denotes the Inter-Expert Pruning operation, “D” represents the Intra-Expert
Decomposition operation, and “F” indicates the “Fine-tuning” operation based on LoRA. “Params” represents
the percentage reduction in the number of expert parameters. In the Inter-Expert Pruning stage, we prune 25%
of the experts. During the Intra-Expert Decomposition stage, for the Mixtral-8×7B model, we decompose the
remaining experts with an average rank of 2048, further reducing the parameters by approximately 37.5%. For the
Qwen1.5-MoE-A2.7B and DeepSeek-V2-Lite models, we perform decomposition with an average rank of 512,
further reducing the parameters by approximately 38.6%.

Model Method Params↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8×7B baseline 0 57.17 84.01 85.35 64.88 35.00 70.40 75.93 67.53
8×7B P 25% 51.79 81.36 84.07 61.99 32.80 71.12 75.85 65.57
8×7B P+F 25% 56.23 82.49 86.42 64.48 36.00 72.92 74.98 67.65
8×7B P+D 51.79% 40.70 71.51 67.83 45.34 26.00 61.37 67.56 54.33
8×7B MoE-I2 51.79% 52.20 78.22 82.62 61.07 34.00 72.20 71.50 64.55
Qwen baseline 0 41.89 73.11 79.76 57.90 30.40 70.04 68.67 60.25
Qwen P 25% 38.57 70.37 73.30 55.84 29.80 64.98 67.25 57.16
Qwen P+F 25% 45.14 75.93 78.01 57.83 32.80 71.12 68.51 61.33
Qwen P+D 53.98% 37.71 65.91 71.41 49.34 29.40 64.26 67.88 55.13
Qwen MoE-I2 53.98% 41.13 71.68 75.08 53.08 30.80 66.43 66.54 57.82
DeepSeek baseline 0 46.93 78.37 79.82 58.70 34.60 60.65 71.35 61.49
DeepSeek P 25% 45.31 74.62 67.95 57.38 33.20 59.93 70.01 58.34
DeepSeek P+F 25% 47.44 78.16 79.79 60.32 35.40 74.56 71.35 63.86
DeepSeek P+D 53.98% 38.48 71.42 70.09 48.15 27.80 60.65 65.98 54.65
DeepSeek MoE-I2 53.98% 42.58 71.80 76.79 55.16 32.60 70.76 67.64 59.62

Table 2: Zero-shot performance comparison with EEP (Lu et al., 2024) and Wanda (Sun et al., 2023)

Model Method Params↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8×7B EEP 25% 51.62 81.94 83.64 61.60 33.00 67.87 75.37 65.01
8×7B P 25% 51.79 81.36 84.07 61.99 32.80 71.12 75.85 65.57
8×7B Wanda 50% 42.06 74.16 76.64 53.16 27.00 63.90 70.96 58.27
8×7B EEP 50% 48.89 78.16 81.35 57.66 29.00 61.37 72.85 61.33
8×7B P 50% 48.38 78.66 81.41 58.35 27.00 64.62 74.19 61.80

Table 3: Zero-shot performance of experiment results Table 4: Inference speedup performance comparison
of comparison with EEP and Wanda. “↓” indicates that with EEP (Lu et al., 2024) and Wanda (Sun et al., 2023)
lower values are better. at a compression rate of 50% . “↓” indicates that lower
values are better.
Model Method Params↓ WikiText2 ↓ PTB ↓
8×7B baseline 0% 6.24 107.24 Model Method Mem (GB) ↓ Speedup Average
8×7B EEP 25% 8.16 141.1 8×7B baseline 87.7 1.0× 67.53
8×7B P 25% 8.01 133.38 8×7B EEP 45.78 1.20× 61.33
8×7B EEP 50% 11.02 207.4 8×7B Wanda 50.01 0.91× 58.27
8×7B P 50% 10.1 185.2 8×7B MoE-I² 43.49 1.28× 64.55
Qwen baseline 26.67 1.0× 60.25
Impact of Genetic Search. For Qwen1.5-MoE- Qwen MoE-I² 14.14 1.12× 57.82
A2.7B and DeepSeek-V2-Lite models, which have DeepSeek baseline 29.26 1.0× 61.49
DeepSeek MoE-I² 15.03 1.13× 59.62
60 and 64 experts per layer respectively, we only
iterate 50 times for Genetic Search. As shown in current layer to prune instead of considering the
Figure 4, the loss has converged in the majority of expert combination used in Genetic Search. As ob-
layers. Using EEP (Lu et al., 2024) for combinato- served, Genetic Search has a significant advantage
rial search would result in unimaginable time com- over other methods with similar low time costs on
plexity. For instance, if pruning 25% of the experts, seven classification tasks and PPL.
EEP would require searching C15 60 and C 64 times Impact of KT-Receptive Field. As shown in Fig-
16
for each layer respectively. Table 6 presents the per- ure 5, we also observe that a large KT-Receptive
formance of the pruned models obtained through Field is not always the best during calibration. This
our Inter-Expert Pruning compared to Random and is partially because we only use a small amount of
TopLoss methods in terms of zero-shot performance data for calibration (2048 samples selected from
of average(among seven classification datasets) and the C4 dataset). Additionally, there is a significant
perplexity tasks. The TopLoss denotes that we indi- difference between the C4 dataset and the seven
vidually select the Pi least important experts in the datasets used for zero-shot validation. Simply in-
Table 5: Comparison of zero-shot performance of the MoE-I2 framework and its components. To ensure the same
compression ratio and ease of computation as much as possible, when performing the “D+F”, we set the average
rank value of the experts as 14 of expert dimension, which is 352.

Model Method Params ↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8x7B P+F 50% 50.43 78.79 82.42 59.12 32.00 70.40 74.03 63.88
8x7B D+F 51.35% 46.08 75.34 81.41 54.02 27.80 72.20 68.27 60.73
8×7B MoE-I2 51.79% 52.20 78.22 82.62 61.07 34.00 72.20 71.50 64.55
Qwen P+F 50% 41.89 69.15 75.20 53.97 30.20 64.98 62.43 56.83
Qwen D+F 57.81% 36.69 69.01 74.56 47.29 29.40 72.92 68.27 56.88
Qwen MoE-I2 53.98% 41.13 71.68 75.08 53.08 30.80 66.43 66.54 57.82
DeepSeek P+F 50% 39.51 70.16 68.17 53.37 26.40 64.98 63.14 55.11
DeepSeek D+F 57.81% 69.68 70.33 74.19 51.98 29.20 71.12 67.01 57.64
DeepSeek MoE-I2 53.98% 42.58 71.80 76.79 55.16 32.60 70.76 67.64 59.62

Figure 4: The left and right figures represent the loss convergence for each layer of Qwen1.5-MoE-A2.7B and
DeepSeek-V2-Lite during the Genetic Search process, respectively. As shown in the figures, after 50 iterations,
nearly all layers have converged.

Table 6: Zero-shot performance of average and perplex-

ity of comparison with “Inter-Expert Pruning”, Random,
and TopLoss.
Model Method Params↓ Average WikiText2↓ PTB↓
Qwen baseline 0% 60.25 7.06 13.51
Qwen Random 25% 55.34 9.38 16.73
Qwen TopLoss 25% 56.51 8.06 15.39
Qwen P 25% 57.16 8.01 15.17
DeepSeek baseline 0% 61.49 10.22 46.43
DeepSeek Random 25% 43.93 48.05 628.97
DeepSeek TopLoss 25% 57.00 11.34 67.67
DeepSeek P 25% 58.34 11.49 65.80

creasing the values of K and T can lead to overfit-

ting on the calibration dataset. Empirically, K = 3 Figure 5: The impact of K and T on the perfor-
and T = 3 can achieve the best performance. mance of models Mixtral-8×7B, Qwen1.5-MoE-A2.7B,
DeepSeek-V2-Lite.
Impact of Non-uniform Pruning Ratio. We can
observe in Figure 2 that the importance of differ-
ent layers in the DeepSeek-V2-Lite model varies assigned to different experts.
significantly. Table 7 demonstrates that this distinc- Impact of Experts Pruning, Layers, and Blocks
tion in layer importance is effective. Compared to Pruning. Table 9 shows our expert pruning method
the balanced pruning ratio used by Mixtral-8×7B (Genetic Search) demonstrates significant advan-
and Qwen1.5-MoE-A2.7B, the imbalance pruning tages over concurrent approaches, such as Layer
ratio applied to DeepSeek-V2-Lite results in better Pruning and Block Pruning (He et al., 2024).
model performance. Our Genetic Search can retain more performance
Impact of Different Ranks. Table 8 shows that (1.32% vs. 3.19% performance drop) while main-
selecting an imbalanced rank approach yields better taining a higher pruning rate (23.95% vs. 15.51%
performance for all experts within the same layer. pruning ratio). Note that since (He et al., 2024)
This phenomenon highlights the differences among presents normalized zero-shot accuracy results, we
experts and indicates that different ranks should be have also normalized our results for fairness.
Table 7: Zero-shot performance of experiment results produced by Inter-Expert Pruning of comparison with
Imbalance (Imba.) and Balance (Ba.) pruning ratio in DeepSeek-V2-Lite.

Model Method Params↓ ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
DeepSeek Ba. 25% 44.20 73.91 68.26 57.07 32.00 57.76 69.93 57.59
DeepSeek Imba. 25% 45.31 74.62 67.95 57.38 33.20 59.93 70.01 58.34
DeepSeek Ba. 50% 31.74 60.19 61.28 45.34 22.40 50.90 60.62 47.50
DeepSeek Imba. 50% 31.74 61.87 61.74 44.79 23.60 54.87 56.67 47.90

Table 8: Zero-shot performance of experiment results produced by Intra-Expert Decomposition of comparison with
Imbalance (Imba.) and Balance (Ba.) rank in same layer in three models.

Model Rank(avg) Type ARC-c ARC-e BoolQ HellaSwag OBQA RTE WinoGrande Average
8x7B 2048 Ba. 43.66 73.45 74.03 54.31 27.40 67.92 69.55 58.62
8x7B 2048 Imba. 43.94 73.95 74.56 55.91 27.80 68.23 69.85 59.18
8x7B 1550 Ba. 33.70 63.43 62.57 47.29 22.00 62.45 62.98 50.63
8x7B 1550 Imba. 34.59 63.67 62.59 47.68 22.00 63.05 63.15 50.96
Qwen 704 Ba. 40.19 72.94 77.95 54.50 30.40 68.95 69.06 59.14
Qwen 704 Imba. 40.44 73.40 77.74 54.54 31.60 68.95 69.30 59.43
Qwen 352 Ba. 35.92 67.55 73.64 44.09 26.40 70.04 67.17 54.97
Qwen 352 Imba. 36.26 67.89 73.15 44.34 27.20 72.20 66.69 55.39
DeepSeek 704 Ba. 43.60 76.94 77.77 53.98 30.40 62.82 69.22 59.25
DeepSeek 704 Imba. 44.11 77.19 78.50 54.20 30.40 63.54 69.30 59.61
DeepSeek 352 Ba. 33.45 65.11 63.05 39.07 25.20 61.75 64.88 50.35
DeepSeek 352 Imba. 34.04 65.95 63.76 39.53 25.80 60.29 65.19 50.65

Table 9: Performance of Pruning on Mixtral-8×7B between our Genetic Search and C-MoE (He et al., 2024). “P”
denotes ours Inter-Expert Pruning operation (Genetic Search). “E[n/m]” denotes dropping n out of m of experts per
MoE layer on average. “L[n/m]”, “B[n/m]” represents dropping n out of m corresponding modules with Layer
Drop and Block Drop respectively. These three methods are described in (He et al., 2024).
Model Method Mem(GB) ARC-c BoolQ HellaSwag OBQA RTE WinoGrande Average ∆↓
8×7B baseline(Ours/EEP) 87.7 59.81 84.92 83.97 47.00 71.12 76.32 70.52 -
8×7B P 66.7 56.66 83.46 81.72 46.40 71.12 75.85 69.02 ↓ 1.32
8×7B baseline (He et al., 2024) 87.7 59.4 84.2 84.00 46.80 70.40 75.60 70.07 -
8×7B E2/8 66.7 53.20 77.70 80.50 46.20 55.60 76.80 65.00 ↓ 5.07
8×7B L8/32 66.6 47.70 85.30 75.20 40.40 69.70 74.60 65.42 ↓ 4.65
8×7B B5/32 74.1 51.30 85.30 78.70 42.00 69.70 74.30 66.88 ↓ 3.19

5 Conclusion ing the model at multiple fine-grants, we ensure

optimal compression while maintaining model per-
In this paper, we explore the efficiency of current
formance, making it more suitable for deployment.
large-scale MoE models and propose a general end-
Despite these advantages, due to computational lim-
to-end compression framework, MoE-I2 , that ad-
itations, we have not yet tested our framework on
dresses the issue of parameter redundancy in MoE
larger MoE models such as Mixtral-8×22B (141B),
models. In our approach, we first conduct the layer
and DeepSeek-V2 (236B). We aim to gradually test
importance analysis and Inter-Expert Pruning for
these larger MoE models in future work.
different MoE models. Subsequently, we perform
the expert important analysis based on the pruned
Ethics Statement
model, ensuring appropriate target ranks of each
expert when performing the Intra-Expert Decom- Our research focuses on developing an end-to-
position. Our MoE-I2 framework significantly re- end framework for the compression of Mixture-of-
duces the parameters of MoE models maintaining Experts (MoE) large language models (LLMs). By
high performance. In the future, we aim to sup- enhancing model compression techniques, we aim
port a wider variety of MoE models with larger to significantly reduce the model size and improve
parameters, enhancing their deployability. inference efficiency, ensuring these improvements
do not come at the cost of performance. While
Limitations
our work contributes to the advancement of de-
Our proposed framework, MoE-I2 , can perform ploying sophisticated LLMs more effectively, we
end-to-end compression on any MoE model and recognize the ethical considerations inherent in this
adaptively find suitable pruning and decomposition field. These include the need to address potential
strategies for the target MoE model. By compress- biases in the models, ensure the responsible and
fair use of LLMs, and safeguard privacy. We are Leo Gao, Jonathan Tow, Stella Biderman, Sid Black,
committed to transparency by making our com- Anthony DiPofi, Charles Foster, Laurence Golding,
Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff,
pression framework publicly available. We urge
et al. 2021. A framework for few-shot language
the community to apply our work ethically, with model evaluation. Version v0. 0.1. Sept, page 8.
careful attention to the broader societal impacts of
deploying compressed LLMs. John J Grefenstette. 1993. Genetic algorithms and ma-
chine learning. In Proceedings of the sixth annual
conference on Computational learning theory, pages
3–4.
References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Shwai He, Daize Dong, Liang Ding, and Ang Li.
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, 2024. Demystifying the compression of mixture-of-
Diogo Almeida, Janko Altenschmidt, Sam Altman, experts through a unified framework. arXiv preprint
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv:2406.02500.
arXiv preprint arXiv:2303.08774.
Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou,
Tanweer Alam, Shamimul Qamar, Amit Dixit, and Mo- Yilin Shen, and Hongxia Jin. 2022. Language model
hamed Benaida. 2020. Genetic algorithm: Reviews, compression with weighted low-rank factorization.
implementations, and applications. arXiv preprint arXiv preprint arXiv:2207.00112.
arXiv:2007.12673.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei and Weizhu Chen. 2021. Lora: Low-rank adap-
Huang, et al. 2023. Qwen technical report. arXiv tation of large language models. arXiv preprint
preprint arXiv:2309.16609. arXiv:2106.09685.
Tianyu Chen, Shaohan Huang, Yuan Xie, Binx- Albert Q Jiang, Alexandre Sablayrolles, Antoine
ing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, Roux, Arthur Mensch, Blanche Savary, Chris Bam-
and Furu Wei. 2022. Task-specific expert prun- ford, Devendra Singh Chaplot, Diego de las Casas,
ing for sparse mixture-of-experts. arXiv preprint Emma Bou Hanna, Florian Bressand, et al. 2024.
arXiv:2206.00277. Mixtral of experts. arXiv preprint arXiv:2401.04088.
Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Young Jin Kim, Ammar Ahmad Awan, Alexandre
Shuming Ma, Barun Patra, Saksham Singhal, Payal Muzio, Andres Felipe Cruz Salinas, Liyang Lu,
Bajaj, Xia Song, Xian-Ling Mao, et al. 2022. On the Amr Hendy, Samyam Rajbhandari, Yuxiong He, and
representation collapse of sparse mixture of experts. Hany Hassan Awadalla. 2021. Scalable and effi-
Advances in Neural Information Processing Systems, cient moe training for multitask multilingual models.
35:34600–34613. arXiv preprint arXiv:2109.10465.
Christopher Clark, Kenton Lee, Ming-Wei Chang,
Yeskendir Koishekenov, Alexandre Berard, and Vas-
Tom Kwiatkowski, Michael Collins, and Kristina
silina Nikoulina. 2022. Memory-efficient nllb-200:
Toutanova. 2019. Boolq: Exploring the surprising
Language-specific expert pruning of a massively mul-
difficulty of natural yes/no questions. arXiv preprint
tilingual machine translation model. arXiv preprint
arXiv:1905.10044.
arXiv:2212.09811.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu,
Tafjord. 2018. Think you have solved question an- Dehao Chen, Orhan Firat, Yanping Huang, Maxim
swering? try arc, the ai2 reasoning challenge. arXiv Krikun, Noam Shazeer, and Zhifeng Chen. 2020.
preprint arXiv:1803.05457. Gshard: Scaling giant models with conditional com-
putation and automatic sharding. arXiv preprint
DeepSeek-AI. 2024. Deepseek-v2: A strong, economi- arXiv:2006.16668.
cal, and efficient mixture-of-experts language model.
arXiv preprint arXiv:2405.04434. Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung,
Yu Cheng, Mohit Bansal, and Tianlong Chen. 2024.
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Merge, then compress: Demystify efficient SMoe
Switch transformers: Scaling to trillion parameter with hints from its routing policy. In The Twelfth In-
models with simple and efficient sparsity. Journal of ternational Conference on Learning Representations.
Machine Learning Research, 23(120):1–39.
Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan
Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Mas- Huang, Bo Zhang, Junchi Yan, and Hongsheng Li.
sive language models can be accurately pruned in 2024. Not all experts are equal: Efficient expert
one-shot. In International Conference on Machine pruning and skipping for mixture-of-experts large
Learning, pages 10323–10337. PMLR. language models. arXiv preprint arXiv:2402.14800.
Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Llm-pruner: On the structural pruning of large lan- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
guage models. In Advances in Neural Information Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Processing Systems. Bhosale, et al. 2023b. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
Mitch Marcus, Beatrice Santorini, and Mary Ann arXiv:2307.09288.
Marcinkiewicz. 1993. Building a large annotated cor-
pus of english: The penn treebank. Computational Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang.
linguistics, 19(2):313–330. 2024. Svd-llm: Truncation-aware singular value de-
composition for large language model compression.
Stephen Merity, Caiming Xiong, James Bradbury, and arXiv preprint arXiv:2403.07378.
Richard Socher. 2016. Pointer sentinel mixture mod-
els. arXiv preprint arXiv:1609.07843. Yiquan Wu, Kun Kuang, Yating Zhang, Xiaozhong Liu,
Changlong Sun, Jun Xiao, Yueting Zhuang, Luo Si,
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish and Fei Wu. 2020. De-biased court’s view gener-
Sabharwal. 2018. Can a suit of armor conduct elec- ation with causality. In Proceedings of the 2020
tricity? a new dataset for open book question answer- Conference on Empirical Methods in Natural Lan-
ing. arXiv preprint arXiv:1809.02789. guage Processing (EMNLP), pages 763–780, Online.
Association for Computational Linguistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and
ine Lee, Sharan Narang, Michael Matena, Yanqi Jimmy Lin. 2020. Deebert: Dynamic early exit-
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the ing for accelerating bert inference. arXiv preprint
limits of transfer learning with a unified text-to-text arXiv:2004.12993.
transformer. Journal of Machine Learning Research,
21(140):1–67. Dongkuan Xu, Ian EH Yen, Jinxi Zhao, and Zhibin
Xiao. 2021. Rethinking network pruning–under the
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- pre-train and fine-tune paradigm. arXiv preprint
ula, and Yejin Choi. 2021. Winogrande: An adver- arXiv:2104.08682.
sarial winograd schema challenge at scale. Commu-
nications of the ACM, 64(9):99–106. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang,
Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Zeroquant: Efficient and affordable post-training
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff quantization for large-scale transformers. Advances
Dean. 2017. Outrageously large neural networks: in Neural Information Processing Systems, 35:27168–
The sparsely-gated mixture-of-experts layer. arXiv 27183.
preprint arXiv:1701.06538.
Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Wu, Yan Yan, and Guangyu Sun. 2023. Asvd:
Kolter. 2023. A simple and effective pruning ap- Activation-aware singular value decomposition for
proach for large language models. arXiv preprint compressing large language models. arXiv preprint
arXiv:2306.11695. arXiv:2312.05821.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Patient knowledge distillation for bert model com- Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
pression. arXiv preprint arXiv:1908.09355. machine really finish your sentence? arXiv preprint
arXiv:1905.07830.
Siqi Sun, Zhe Gan, Yu Cheng, Yuwei Fang, Shuo-
Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhi-
hang Wang, and Jingjing Liu. 2020. Con-
jie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng
trastive distillation on intermediate representations
Li. 2021. Learning n: m fine-grained structured
for language model compression. arXiv preprint
sparse neural networks from scratch. arXiv preprint
arXiv:2009.14167.
arXiv:2102.04010.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
An instruction-following llama model. https://
github.com/tatsu-lab/stanford_alpaca.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier

Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, et al. 2023a. Llama: Open and effi-
cient foundation language models. arXiv preprint
arXiv:2302.13971.

Jeff Glover Brochure
No ratings yet
Jeff Glover Brochure
6 pages
Moe Pruner
No ratings yet
Moe Pruner
18 pages
2502.17298v1
No ratings yet
2502.17298v1
15 pages
2406.18219v2
No ratings yet
2406.18219v2
19 pages
Mixture of Experts Explained
No ratings yet
Mixture of Experts Explained
24 pages
A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
No ratings yet
A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
41 pages
Mixture of Experts With Mixture of Precisions For Tuning Quality of Service
No ratings yet
Mixture of Experts With Mixture of Precisions For Tuning Quality of Service
7 pages
GRIN-MoE: Microsoft's Revolutionary Mixture-Of-Experts Model
No ratings yet
GRIN-MoE: Microsoft's Revolutionary Mixture-Of-Experts Model
8 pages
2024-Revisiting SMoE Language Models by Evaluating Inefficiencies With Task Specific Expert Pruning
No ratings yet
2024-Revisiting SMoE Language Models by Evaluating Inefficiencies With Task Specific Expert Pruning
18 pages
Beyond Memory Limits Scaling Mixture of Experts Models
No ratings yet
Beyond Memory Limits Scaling Mixture of Experts Models
15 pages
2024_How Lightweight Can A Vision Transformer Be_Tan_arXiv
No ratings yet
2024_How Lightweight Can A Vision Transformer Be_Tan_arXiv
8 pages
2024 - Skywork-MoE - Wei Et Al
No ratings yet
2024 - Skywork-MoE - Wei Et Al
14 pages
Atc23 Li Jiamin
No ratings yet
Atc23 Li Jiamin
16 pages
Brainformers - Trading Simplicity For Efficiency
No ratings yet
Brainformers - Trading Simplicity For Efficiency
12 pages
MoE_1
No ratings yet
MoE_1
15 pages
Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
No ratings yet
Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
33 pages
preprints202408.0583.v2 (1)
No ratings yet
preprints202408.0583.v2 (1)
32 pages
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
No ratings yet
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
31 pages
preprints202408.0583.v1
No ratings yet
preprints202408.0583.v1
33 pages
2502.06643v1
No ratings yet
2502.06643v1
15 pages
Vinija's Notes - Primers - Mixture of Experts
No ratings yet
Vinija's Notes - Primers - Mixture of Experts
39 pages
MoE Instruction Tuning
No ratings yet
MoE Instruction Tuning
24 pages
Model Compression and Efficient Inference For Large Language Models: A Survey
No ratings yet
Model Compression and Efficient Inference For Large Language Models: A Survey
47 pages
Open Mixture-of-Experts Language Models
No ratings yet
Open Mixture-of-Experts Language Models
61 pages
2407.19985v2
No ratings yet
2407.19985v2
14 pages
2501.13074v1
No ratings yet
2501.13074v1
14 pages
Files
No ratings yet
Files
33 pages
Transformer_vs_MOE
No ratings yet
Transformer_vs_MOE
7 pages
2024-Prediction is All MoE Needs Expert Load Distribution Goes From Fluctuating to Stabilizing
No ratings yet
2024-Prediction is All MoE Needs Expert Load Distribution Goes From Fluctuating to Stabilizing
10 pages
JMockit in Practice: Definitive Reference for Developers and Engineers
From Everand
JMockit in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mixture of A Million Experts: Google Deepmind
No ratings yet
Mixture of A Million Experts: Google Deepmind
12 pages
MoE-Infinity - Offloading-Efficient MoE Model Serving
No ratings yet
MoE-Infinity - Offloading-Efficient MoE Model Serving
14 pages
10 Most Asked LLM Interview Questions
No ratings yet
10 Most Asked LLM Interview Questions
12 pages
Hợp Nhất Chuyên Gia
No ratings yet
Hợp Nhất Chuyên Gia
26 pages
Dynamic Mixture of Experts_ an Auto-Tuning Approach for Efficient Transformer Models
No ratings yet
Dynamic Mixture of Experts_ an Auto-Tuning Approach for Efficient Transformer Models
27 pages
Dynamic Mixture of Experts
No ratings yet
Dynamic Mixture of Experts
22 pages
Mixtral of Experts
No ratings yet
Mixtral of Experts
13 pages
Deepseek v2 Tech Report
No ratings yet
Deepseek v2 Tech Report
50 pages
Task-Based Moe For Multitask Multilingual Machine Translation
No ratings yet
Task-Based Moe For Multitask Multilingual Machine Translation
11 pages
Parameter-Efficient Sparsity Crafting From Dense To Mixture-Of-Experts For Instruction Tuning On General Tasks
No ratings yet
Parameter-Efficient Sparsity Crafting From Dense To Mixture-Of-Experts For Instruction Tuning On General Tasks
13 pages
choice routing
No ratings yet
choice routing
14 pages
LightGBM in Practice: Definitive Reference for Developers and Engineers
From Everand
LightGBM in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mixture of Experts Explained Simply
No ratings yet
Mixture of Experts Explained Simply
8 pages
MoE
No ratings yet
MoE
15 pages
Transformers: Principles and Applications
From Everand
Transformers: Principles and Applications
Richard Johnson
No ratings yet
EDP An Efficient Decomposition and Pruning Scheme
No ratings yet
EDP An Efficient Decomposition and Pruning Scheme
15 pages
Mockito Techniques for Effective Unit Testing: Definitive Reference for Developers and Engineers
From Everand
Mockito Techniques for Effective Unit Testing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Switch Transformers - Scaling To Trillion Parameter Models With Simple and Efficient Sparsity
No ratings yet
Switch Transformers - Scaling To Trillion Parameter Models With Simple and Efficient Sparsity
40 pages
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
No ratings yet
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
52 pages
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Deepseek-V2: A Strong, Economical, and Efficient Mixture-Of-Experts Language Model
No ratings yet
Deepseek-V2: A Strong, Economical, and Efficient Mixture-Of-Experts Language Model
50 pages
PostHoc_PRESENTATION_I220560_i220626_i220525_i221387
No ratings yet
PostHoc_PRESENTATION_I220560_i220626_i220525_i221387
34 pages
2025 Lecture 4 - MoEs
No ratings yet
2025 Lecture 4 - MoEs
47 pages
ELREA 多个lora适配器动态选取
No ratings yet
ELREA 多个lora适配器动态选取
29 pages
Shi Et Al_2024_Unchosen Experts Can Contribute Too
No ratings yet
Shi Et Al_2024_Unchosen Experts Can Contribute Too
19 pages
Maestro: Uncovering Low-Rank Structures Via Trainable Decomposition
No ratings yet
Maestro: Uncovering Low-Rank Structures Via Trainable Decomposition
22 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Optimising LLMs
No ratings yet
Optimising LLMs
8 pages
Switch Transformers Scaling To Trillion Parameter Models With Simple and Efficient Sparsity by Fedus Et Al
No ratings yet
Switch Transformers Scaling To Trillion Parameter Models With Simple and Efficient Sparsity by Fedus Et Al
39 pages
XGBoost in Practice: Definitive Reference for Developers and Engineers
From Everand
XGBoost in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Moq for .NET Developers: Definitive Reference for Developers and Engineers
From Everand
Practical Moq for .NET Developers: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
2403.14507v1
No ratings yet
2403.14507v1
11 pages
2308.01185v2
No ratings yet
2308.01185v2
8 pages
2305.17473v4
No ratings yet
2305.17473v4
62 pages
2308.00223v1
No ratings yet
2308.00223v1
35 pages
Adapting Language Models Via Token Translation: Zhili Feng Tanya Marwah Lester Mackey
No ratings yet
Adapting Language Models Via Token Translation: Zhili Feng Tanya Marwah Lester Mackey
5 pages
2403.15846v1
No ratings yet
2403.15846v1
22 pages
Near-Optimal Quantum Algorithm For Finding The Longest Common Substring Between Run-Length Encoded Strings
No ratings yet
Near-Optimal Quantum Algorithm For Finding The Longest Common Substring Between Run-Length Encoded Strings
21 pages
2406.14716v1
No ratings yet
2406.14716v1
5 pages
2405.16632v1
No ratings yet
2405.16632v1
24 pages
2305.11720v4
No ratings yet
2305.11720v4
29 pages
2405.17087v1
No ratings yet
2405.17087v1
59 pages
2406.03767v1
No ratings yet
2406.03767v1
26 pages
2406.02095v1
No ratings yet
2406.02095v1
13 pages
2406.01520v2
No ratings yet
2406.01520v2
55 pages
2406.16192v2
No ratings yet
2406.16192v2
36 pages
Modelling Silica Using MACE-MP-0 Machine Learnt Interatomic Potentials
No ratings yet
Modelling Silica Using MACE-MP-0 Machine Learnt Interatomic Potentials
20 pages
2405.20597v1
No ratings yet
2405.20597v1
24 pages
2406.02448v1
No ratings yet
2406.02448v1
21 pages
2409.20471v2
No ratings yet
2409.20471v2
14 pages
2410.14306v1
No ratings yet
2410.14306v1
10 pages
2408.01819v1
No ratings yet
2408.01819v1
12 pages
Petrov-Galerkin Model Reduction For Thermochemical Nonequilibrium Gas Mixtures
No ratings yet
Petrov-Galerkin Model Reduction For Thermochemical Nonequilibrium Gas Mixtures
32 pages
On The Free-Boundary Incompressible Elastodynamics With and Without Surface Tension
No ratings yet
On The Free-Boundary Incompressible Elastodynamics With and Without Surface Tension
27 pages
HC L-Diff: Hybrid Conditional Latent Diffusion With High Frequency Enhancement For CBCT-to-CT Synthesis
No ratings yet
HC L-Diff: Hybrid Conditional Latent Diffusion With High Frequency Enhancement For CBCT-to-CT Synthesis
13 pages
C G V2: E G A R L - S S: ITY Aussian Fficient and Eometrically Ccurate Econstruction FOR Arge Cale Cenes
No ratings yet
C G V2: E G A R L - S S: ITY Aussian Fficient and Eometrically Ccurate Econstruction FOR Arge Cale Cenes
17 pages
A L I T R A: Daptive Ength Mage Okenization Via Ecurrent Llocation
No ratings yet
A L I T R A: Daptive Ength Mage Okenization Via Ecurrent Llocation
21 pages
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
No ratings yet
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
16 pages
Hunyuan-Large: An Open-Source Moe Model With 52 Billion Activated Parameters by Tencent
No ratings yet
Hunyuan-Large: An Open-Source Moe Model With 52 Billion Activated Parameters by Tencent
18 pages
Direct Observation of Dynamical Quasi-Condensation On A Quantum Computer
No ratings yet
Direct Observation of Dynamical Quasi-Condensation On A Quantum Computer
11 pages
Scalable Quantum Simulations of Scattering in Scalar Field Theory On 120 Qubits
No ratings yet
Scalable Quantum Simulations of Scattering in Scalar Field Theory On 120 Qubits
50 pages
Our World 2ed AME 3 Lesson Planner
No ratings yet
Our World 2ed AME 3 Lesson Planner
344 pages
Semester 1 Final Exam Oracle PL SQL 3
No ratings yet
Semester 1 Final Exam Oracle PL SQL 3
24 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Igbt Gate Driver Reference Design For Parallel Igbts With Short-Circuit Protection and External BJT Buffer
No ratings yet
Igbt Gate Driver Reference Design For Parallel Igbts With Short-Circuit Protection and External BJT Buffer
33 pages
Em Single Disc - Bearing - Mounted - Clutches (Eda
No ratings yet
Em Single Disc - Bearing - Mounted - Clutches (Eda
8 pages
Symmetry & Antisymmetry-1
No ratings yet
Symmetry & Antisymmetry-1
6 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Effectiveness of PATTS Security Department Services
No ratings yet
Effectiveness of PATTS Security Department Services
79 pages
SentinelOne S1U Training DS
No ratings yet
SentinelOne S1U Training DS
10 pages
Answers Cambridge Checkpoint Mathematics Practicebook 8 PDF Mean Rectangle
No ratings yet
Answers Cambridge Checkpoint Mathematics Practicebook 8 PDF Mean Rectangle
1 page
UNIX环境高级编程
No ratings yet
UNIX环境高级编程
547 pages
DCC Letter To Mendocino
No ratings yet
DCC Letter To Mendocino
1 page
FILE HANDLING NOTES
No ratings yet
FILE HANDLING NOTES
10 pages
Typical Hydrostatic-Vents Drains
No ratings yet
Typical Hydrostatic-Vents Drains
1 page
Giáo Trình Tiếng Anh Du Lịch - English For Tourism - Nguyễn Hồng Chí (2007)
No ratings yet
Giáo Trình Tiếng Anh Du Lịch - English For Tourism - Nguyễn Hồng Chí (2007)
99 pages
Tutorial 2a - 2021 With Solutions
No ratings yet
Tutorial 2a - 2021 With Solutions
3 pages
Liminal Spaces
No ratings yet
Liminal Spaces
10 pages
User Needs White Paper
No ratings yet
User Needs White Paper
89 pages
ETE Report
No ratings yet
ETE Report
11 pages
Cab Center Console Cab Fuse Block: Electrical System 793F Off-Highway Truck
No ratings yet
Cab Center Console Cab Fuse Block: Electrical System 793F Off-Highway Truck
10 pages
Pioneer DEH-P40MP PDF
No ratings yet
Pioneer DEH-P40MP PDF
70 pages
CODE on 2025 Time Schedule Final (1)
No ratings yet
CODE on 2025 Time Schedule Final (1)
2 pages
Hips Total Hips 3351
No ratings yet
Hips Total Hips 3351
3 pages
Create A Simple BSP
No ratings yet
Create A Simple BSP
9 pages
Process 2 Enbal Inc
No ratings yet
Process 2 Enbal Inc
43 pages
FEMA FilterManual 2011
100% (1)
FEMA FilterManual 2011
362 pages
Kiem Tra Lop 8 So 2
No ratings yet
Kiem Tra Lop 8 So 2
4 pages
SURNAME-Activity-1-Decent-Work-Copy
No ratings yet
SURNAME-Activity-1-Decent-Work-Copy
9 pages
Quarter 2 - Module 3: Singing Self-Composed Melodies: Music
No ratings yet
Quarter 2 - Module 3: Singing Self-Composed Melodies: Music
19 pages

Moe-I: Compressing Mixture of Experts Models Through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition

Uploaded by

Moe-I: Compressing Mixture of Experts Models Through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition

Uploaded by

MoE-I2 : Compressing Mixture of Experts Models through Inter-Expert

Pruning and Intra-Expert Low-Rank Decomposition

Abstract parameters during training and inference. For in-

passes the performance of dense Transformer mod-

Table 6: Zero-shot performance of average and perplex-

creasing the values of K and T can lead to overfit-

5 Conclusion ing the model at multiple fine-grants, we ensure

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier

You might also like