Chen_AdaMV-MoE_Adaptive_Multi-Task_Vision_Mixture-of-Experts_ICCV_2023_paper
Chen_AdaMV-MoE_Adaptive_Multi-Task_Vision_Mixture-of-Experts_ICCV_2023_paper
Tianlong Chen1∗† , Xuxi Chen1∗ , Xianzhi Du2† , Abdullah Rashwan3 , Fan Yang3
Huizhong Chen3 , Zhangyang Wang1 , Yeqing Li3
1
University of Texas at Austin, 2 Apple, 3 Google
{tianlong.chen, xxchen, atlaswang}@utexas.edu, [email protected], {arashwan, fyangf, huizhongc, yeqing}@google.com
Abstract 80
ViT-Small*
17346
activated MoE experts (e.g., multilayer perception (MLP)). bination for MTL. Visualization of our learned task-
Then, task-dependent routing policies are enforced to se- specific routing decisions is provided and exposes spe-
lect a subset of task-relevant experts. Impressive results are cialization patterns, particularly for image contents.
demonstrated with this MTL MoE [46]. ⋆ Extensive experiments are conducted to reveal the ef-
Despite these preliminary investigations, key challenges fectiveness of AdaMV-MoE in MTL, as shown in Fig-
still persist in building an effective MTL system: How to ure 1. For example, our approaches surpass the vanilla
determine an appropriate network capacity for each task in MTL ViT with a shared feature extractor, by a signif-
MTL? By treating it as a hyperparameter, performing the icant performance margin of {6.66% ∼ 7.39% accu-
manual tuning for each task is laborious and infeasible due racy, 0.87% ∼ 1.13% AP, 0.84% ∼ 0.89% APmask }
to the entanglement between tasks. Thus, a fixed model size for {image classification, object detection, instance
across all tasks is a conventional setup of existing MTL ap- segmentation} on ImageNet and COCO datasets with
proaches (e.g., always using 4 experts in [46]). However, UViT-Base backbones [16].
this rigid and sub-optimal design potentially sacrifices the
learning of certain tasks, since excessive or insufficient net- 2. Related Works
work capacity leads to either over-fitting or under-fitting in Multi-Task Learning (MTL). MTL resolves multiple
simple or complex scenarios, respectively [72]. The dis- objectives and produces corresponding predictions for in-
advantages will be further amplified when optimizing mul- put samples. It has been investigated for a long history,
tiple tasks with a substantial variation in task complexity. and numerous solutions are proposed ranged from classic
Take image classification and object detection tasks as ex- learning algorithms [78, 36, 89, 4, 80, 43, 23, 41] to mod-
amples. First, the common benchmarks for classification ern deep neural networks. Deep learning methods gener-
have a lower input resolution like 32 × 32 for CIFAR [40] ate shared feature representations to model the common in-
and 224 × 224 for ImageNet [24], while object detection is formation across tasks, which can be categorized into two
normally evaluated on the COCO [47] dataset with a higher groups, i.e., encoder- and decoder-focused pipelines. The
resolution of 640×640 or 892×892. Second, to obtain a sat- former [52, 60, 28, 48] allows the task interactions in the
isfying performance, the routine network for detection [9] is encoder and attaches task-specific heads on top of it as inde-
usually larger than the ones for classification [69], such as pendent decoders. For example, [52] and [48] advocate the
ResNet-101 [34] versus ResNet-50. Third, as for the task linear combination and attention mechanism to learn shared
objectives, object detection contains both object localiza- encoder representations among tasks, respectively. The lat-
tion and recognition, and thus is more complicated than ter [77, 87, 86, 70] first creates initial task-dependent fea-
classification which can be essentially regarded as a sub- tures from decoders and then aggregates them to form the
task. As discovered in [18, 33], their mismatched learning final per-task prediction. Such pipelines consume heavy
goals emphasize different feature proprieties (i.e., location computations since they need to at least execute all tasks
invariant [8] versus sensitive). Given such heterogeneity of once for the initial decoder features, which limits their prac-
task complexity, these two tasks are notoriously difficult to tical usage in resource-constrained scenarios. In this paper,
learn together with a shared feature extractor and unified we mainly study encoder-focused architectures.
model size. An adaptive mechanism is therefore demanded. A conventional encoder architecture is a convolutional
In this paper, we propose AdaMV-MoE, to address the neural network (CNN) [48, 63, 84, 85]. As ViTs emerge,
aforementioned key barriers, by seamlessly customizing the IPT [11] leveraged transformer-based models to solve mul-
current state-of-the-art (SOTA) MTL MoE [46]. To be spe- tiple low-level vision tasks. [54] and [61] adopt similar ar-
cific, an adaptive expert selection mechanism is proposed chitectures for the tasks of {object detection, semantic seg-
to automatically determine the number of experts (or model mentation} and {scene and action understanding, score pre-
capacity) in use for different vision tasks. We monitor the diction} in the video, respectively. [7] further involves vi-
validation loss to adaptively determine activating more/less sion tasks from 3D domains. Our work considers jointly
experts to prevent under-fitting/over-fitting. Our contribu- learning classification, object detection, and instance seg-
tions are summarized below: mentation with ViT-based models. Note that it is highly
non-trivial since classification and detection & segmenta-
⋆ We target the problem of multi-task vision recognition,
tion emphasize location invariant [8] and sensitive features
and tackle the key challenge of choosing a suitable net-
respectively, which potentially contradict each other. Be-
work capacity for distinctive tasks. According to train-
sides, another theme in MTL investigates how to share and
ing dynamics, our algorithm controls the task-specific
separate parameter spaces for learning task-agnostic and -
model size in an adaptive and automatic manner.
specific knowledge respectively [66, 71, 55, 6, 46].
⋆ We introduce a customized MoE to resolve image clas-
sification, object detection, and instance segmentation Mixture-of-Experts (MoE). MoE duplicates some net-
simultaneously, which used to be a troublesome com- work components into a series of copies (named experts)
17347
Task I
Task II
... ...
Multi-Head Multi-Head
Attention Attention (Task I) Input Token Embedding
......
Normalization Normalization
and embraces the conditional computation in an input- relevant input for each expert; [59, 93] adopt deterministic
dependent way [37, 39, 12, 82]. The earliest variant of hashing and stochastic routing; and [14] promotes diversity
MoEs densely activates all experts for each input, and there- during training, respectively. In this paper, we not only ex-
fore it is computation-intensive [26]. Later on, [62, 44, 27] amine the aforementioned bottlenecks but also investigate
advocate a sparsely activated style for utilizing experts, new properties of routers such as policy convergence.
called sparse MoE (SMoE). It greatly reduces the cost at Several recent studies also explore the possibility of
both the training and inference stages, which grants impres- SMoE in the MTL scenarios. To be specific, [51, 3, 32,
sive scalability and even allows enormous language mod- 31, 90] use task-dependent router networks to select rel-
els with trillions of parameters [27]. The effectiveness of evant parts of the model with a fixed size for each task.
SMoEs has been widely proved in various NLP [62, 44, 91, They show positive results in small-scale applications like
88, 93, 38] and vision [58, 26, 2, 30, 74, 79, 1, 57] tasks. classification for medical signals [3], digital number im-
Particularly, the pioneering work [58] offers the first vision ages (MNIST) [32], and recommendation systems [51].
transformer-based SMoE for the image recognition task. [46] works on the efficient on-device MTL with a model-
With further investigations, several downsides of SMoE accelerator co-designed SMoE.
are revealed, including: i) Training instability. [92] con-
ducts a trade-off study of SMoE between its training sta- 3. Methodology
bility and quality, where they show many classic tricks 3.1. Revisiting Sparse Mixture of Experts
like gradient clipping stabilize training but sacrifice perfor- SMoE [62] is proposed to scale up the model capac-
mance and the router z-loss [92] seems to bring a win-win ity while maintaining low per-inference costs. In this
case. ii) Poor specialization. The ideal outcome of SMoE work, we consider SMoE for ViTs [25, 58], which in-
is to divide and conquer certain tasks by tackling each piece serts SMoE layers into every other transformer block. The
problem with selected experts [3, 32, 51, 53, 15]. Yet it SMoE layer contains a router network R and several ex-
is hard to reach unless explicitly enforcing specialization perts f1 , f2 , · · · , fE , where E is the number of experts. The
and trimming down the redundancy among experts [13] like expert module can be a few fully connected [62, 58] or con-
pre-defining a diverse expert assignment [22] or involving volutional layers [73], and we duplicate multi-layer percep-
multiple routing policies [32]. iii) Representation collapse. tions (MLP) as expert networks shown in Figure 2. Note
Naı̈vely trained SMoE is prone to load imbalance, e.g., that MLPs in ViTs contain around 2/3 of total parameter
only a few experts are frequently used while the others are counts, and [29, 20] demonstrate their significance as mem-
scarcely activated. To alleviate this issue, [62] adds Gaus- ory networks to store substantial knowledge.
sian noises to router networks; [44, 27] propose an auxiliary Another key component in SMoE layers, i.e., R, ac-
loss as the regularization; [45] formulates and solves a bal- tivates the top-k expert networks with the largest scores
anced linear assignment problem; [91] distributes the top-k R(x)i associated with input embedding x, where i is the
17348
expert index. Normally, the number of selected experts k is Algorithm 1 Adaptive Expert Selection (AES).
fixed and much smaller than the total number of experts E, 1: Input: Expert networks fi (i ∈ {1, 2, · · · , E}, routers Rj
which suggests the sparsely activated fashion of SMoE. The (j is the task index), the validation set Dval j
for task j, the
expert distribution can be formally depicted as below: j
objective function Lval on the validation set.
2: for a given task j do
\label {eqn:moe_output} \Mat {y}=\sum _{i=1}^{k} \mathcal {R}(\Mat {x})_i \cdot f_i(\Mat {x})&, \mathcal {R}(\Mat {x})=\texttt {TopK}(\texttt {softmax}(g(\Mat {x})),k), \\ \texttt {TopK}(\Mat {v},k)&= \left \{\begin {array}{ll} \Mat {v} & \text {if $\Mat {v}$ is the top $k$} \\ 0 & \text {otherwise} \end {array}\right . 3: Initial the number of selected experts as kj ← 1;
4: Initial an indicator Improved as True;
5: Initial the current best validation loss as Ljval(best) ← ∞;
6: while True do
7: if Ljval(best) does not decrease for ∆n iterations then
where fi (x) stands for the feature representations produced 8: if not improved then
from the expert fi , which is weighted by R(x)i to form the 9: break;
final output y. The network g is the learnable part within a 10: else
router R and it usually is one or a few layers MLP [62, 27]. 11: kj ← kj + 1; improved ← False;
TopK is a function that discards the small elements ranked 12: end if
after k. To reduce the negative effects of the imbalanced 13: end if
14: Continue training the model;
loading (or representation collapse [19]), we introduce reg-
15: if Ljval < Ljval(best) then
ularization terms to balance the expert assignments, follow- j j
16: Lval(best) ← Lval ; improved ← True;
ing the design and default hyperparameters in [58].
17: end if
3.2. AdaMV-MoE: Adaptive Multi-Task Vision 18: end while
19: kj = kj − 1 and fix kj ;
Recognition with Mixture-of-Experts
20: Continue training to the target number of iterations.
Overview of AdaMV-MoE Our proposed framework, i.e., 21: end for
AdaMV-MoE, consists of task-dependent router networks 22: Output: AdaMV-MoE with task-dependent top-kj routers.
and an adaptive expert selection (AES) mechanism. As de-
scribed in Figure 2, input token embeddings are fed into
corresponding router networks based on their task types. Existing literature [76] points out that a proper network ex-
The task-dependent routers then choose the most relevant pansion creates the possibility for escaping saddle points in
experts and aggregate their features for different tasks. The the functional space and further decreases the objective val-
number of selected experts is dynamically decided accord- ues. Meanwhile, if Ljval is larger than the previous best vali-
ing to the in-time training dynamics with AES. dation loss Ljval(best) , we reduce the selected expert number
by kj = kj − 1. Lastly, kj is fixed and the model is contin-
Task-dependent Routing Policies. Let Rj represents the ually trained under reaching the target number of training
router for the task j, and all expert networks {fi }|E
i=1 iterations. The above procedures are repeated for all tasks.
are shared across tasks. The SMoE equipped with task-
dependent router networks is defined as:
kj
X
4. Experiment
yj = Rj (x)i ·fi (x), Rj (x) = TopK(softmax(gj (x)), kj ), 4.1. Implementation Details
i=1
Network Backbone. Our experiments focus on ViT-
where kj and y j are the task-specific number of activated based backbones, including ViT [25] and its advanced vari-
experts and output, respectively. As supported by Section 4, ant - UViT [16]. Varying the model size, we establish four
the discrepancy among different routing policies brings the ViTs of {ViT-Small∗ , ViT-Small, ViT-Base, UViT-Base},
entanglement of parameter spaces, resulting in mitigated of which the details are exhibited in Table 2.
gradient conflicts of MTL and enhanced performance. Table 2. Detailed model sizes of (Dense) ViT variants.
Backbones # Transformer Layers # Attention Heads Hidden Dimension MLP Dimension
Adaptive Expert Selection (AES). The optimal network ViT-Small∗ 6 6 384 1536
ViT-Small 12 6 384 1536
size for various vision recognition tasks may alter signifi- ViT-Base 12 6 768 3072
UViT-Base 18 6 384 1536
cantly, due to the difference in task complexities. It is hard
to conclude manually without laborious trial and error. We The backbone first takes input images from the classifi-
instead adopt an automatic algorithm AES to determine the cation and detection datasets and then extracts features that
kj in a data-driven way. As shown in Algorithm 1, it first will further be processed by task-specific modules. A lin-
computes the task-specific objective Ljval on the validation ear classification layer and detection & segmentation heads
set. If Ljval does not decay in the next ∆n iterations, then we from Cascade Mask-RCNN [9] are chosen in our experi-
expand the activated model size by updating kj = kj + 1. ments. Following [58], ViT and SMoE layers are arranged
alternatively. More details are in Section A1.
17349
Table 1. Multi-task vision recognition performance of our proposed AdaMV-MoE. {Accuracy (%)}, {AP (%), AP50 (%), AP75 (%)}, and
{APmask (%)} are reported for classification (CLS) on ImageNet-1k, object detection (OD), and instance segmentation (IS) on COCO
respectively. # Parameters (M) indicates the adaptively allocated network capacity. ViT-Small∗ /Small/Base [25] and UViT-Base [16]
backbones are adopted, whose details are recorded in Table 2. ViT-Small∗ is a reduced variant of ViT-Small with half transformer layers.
Comparisons are conducted with the baseline MTL-ViT and a recent state-of-the-art MTL approach TAPS [71]. The total number of
experts E in our AdaMV-MoE is 8. {Dense and Large Dense, Sparse} means that the {entire, partial} network is used for each task at the
training and inference stages, respectively. N.A. denotes “Not Applicable”.
Classification Object Detection Instance Segmentation # Parameters (M)
Backbone Method
mask
Accuracy(%) AP(%) AP50 (%) AP75 (%) AP (%) CLS OD & IS
Dense ViT for CLS 73.00 N.A. N.A. N.A. N.A. 11.10 N.A.
Dense ViT for OD & IS N.A. 39.75 61.71 42.77 36.10 N.A. 13.22
Dense MTL-ViT 68.30 36.35 58.79 38.86 34.01 13.67
ViT-Small∗ Large Dense MTL-ViT 70.32 37.74 60.27 40.58 34.97 20.41
Dense TAPS 69.32 36.66 58.97 38.55 34.94 16.62 18.25
Sparse AdaMV-MoE (Ours) 72.99 39.04 61.16 42.43 35.76 16.33 19.00
Dense MTL-ViT 69.34 41.43 63.45 45.13 37.25 24.32
Large Dense MTL-ViT 71.87 42.07 64.48 45.66 38.07 37.03
ViT-Small Dense TAPS 74.63 37.38 60.15 39.89 34.74 27.86 30.22
Sparse AdaMV-MoE (Ours) 78.41 42.16 64.33 45.73 38.12 29.65 34.97
Dense MTL-ViT 74.18 42.63 64.31 46.53 38.30 91.10
Large Dense MTL-ViT 74.41 42.47 64.19 46.12 38.23 123.87
ViT-Base Dense TAPS 78.45 42.51 65.28 45.87 38.32 105.26 108.40
Sparse AdaMV-MoE (Ours) 78.59 42.70 65.12 46.05 38.49 112.37 123.00
Dense MTL-UViT 72.26 43.01 64.94 46.92 38.67 34.96
Large Dense MTL-UViT 72.99 43.27 64.79 47.21 38.62 53.66
UViT-Base Dense TAPS 77.23 40.58 63.41 43.72 36.94 39.68 41.45
Sparse AdaMV-MoE (Ours) 79.65 44.14 65.54 48.17 39.51 42.95 50.94
Dataset and Task. We examine our methods on Ima- employed to enhance the input samples.
geNet [24] and MS COCO 2017 [47] datasets, for classi- Our ViTs are optimized with AdamW [50], a weight de-
fication and detection & segmentation tasks respectively. cay of {6 × 10−3 , 1 × 10−4 , 5 × 10−4 }, initial learning
ImageNet contains 1.28M training images and 50K testing rates (LR) of 3 × 10−3 , {20, 2, 10}K iterations warm-up,
images of 1, 000 classes, while MS COCO 2017 has 118K and a cosine LR decay schedule for {CLS, OD & IS, MTL}.
training images and 5K validation images. The input res- Multiple loss functions are involved in the model training,
olution is 224 × 224 for classification and 640 × 640 for e.g., a cross-entropy loss for classification as well as the
object detection & instance segmentation. {class, box, mask} losses from Mask-RCNN for detection
and segmentation. The default hyperparameters of [16] are
Baselines. To support the effectiveness of our propos-
inherited in our cases. As for AdaMV-MoE, we add two
als, we consider three groups of comparison baselines: (1)
auxiliary loss terms of importance and loading regulariza-
Dense ViTs for single-task learning (STL), i.e., ViT for
tions [58] for router network learning. The coefficients of
CLS and ViT for OD & IS. (2) Dense ViTs for multi-task
these two losses are set to 5 × 10−3 [58]. The value of ∆n
learning, i.e., MTL-ViT. It shares the full feature extractor
is set as 2000 when applying AES, and 1% of the training
with task-specific heads attached. Large Dense implies a
samples is randomly held out to construct a validation set
strengthened baseline that has a larger hidden dimension
Dval for AES. For TAPS and AdaMV-MoE, we first train
and more parameter counts as shown in Appendix A1. (3)
the network to solve the classification task for 300K steps,
TAPS [71], a recent state-of-the-art multi-task approach that
and simultaneously train with two tasks (CLS and OD &
advocates the task adaptive parameter sharing.
IS) with additional 200K steps. For {Dense, Large Dense},
Training and Evaluation Details. The single-task learn- it is trained for 200K iterations. The ablation studies on the
ing baselines are trained with a batch size of 1, 024 and 256 training steps are in Appendix A2. Each experiment uses
for classification and object detection & instance segmenta- 16 ∼ 64 and 8 TPU-v3 for training and inference.
tion, respectively. For MTL training, the batch sizes for the To evaluate the performance of trained ViTs, we re-
two tasks are 1, 024 and 128, respectively. During training, port the test accuracy for classification, the validation {AP,
data augmentations are applied for both tasks. For classifi- AP50 , AP75 } for object detection tasks [16], and the vali-
cation, we use CutMix [83] and MixUp [65]. As for detec- dation APmask for instance segmentation [16]. Additionally,
tion and segmentation, random scaling augmentations are the number of activated parameters (in millions) is calcu-
17350
(Routing Decisions of CLS)
ImageNet Samples
(Routing Decisions of OD & IS)
COCO Samples
Figure 3. The routing specialization of AdaMV-MoE at the fined-grained patch level. Upper shows the routing decisions of classification
with ImageNet samples; Bottom presents the routing decisions of object detection and instance segmentation with COCO samples. Here
we only visualize the top-2 selected experts whose indexes are indicated by the color of the patch’s boundary and content.
lated to imply the used model capacity for each task. 0.23%), (0.19%, 0.26%)}, {(7.39%, 6.66%), (1.13%,
0.87%), (0.84%, 0.89%)} of {Accuracy (%), AP (%),
4.2. Superior Multi-Task Vision Recognition Per- APmask (%)} improvements for ViT-Small∗ /Small/Base
formance of AdaMV-MoE and UViT-Base, respectively. It validates the effectiveness
Comparisons with STL and MTL Approaches. We of our proposals. ❷ AdaMV-MoE adaptively allocates ad-
choose ViT-Small∗ /Small/Base and UViT-Base network equate network capacity to resolve classification, detection,
architectures, considering their vanilla (Dense), widened and segmentation tasks by activating different amounts of
(Large Dense), and SMoE (Sparse) variants. All methods model parameters. For instance, our proposals spend fewer
are examined on the benchmark of ImageNet classification parameter counts for CLS while more parameter budgets for
and COCO object detection & instance segmentation. The the challenging OD & IS tasks, e.g., 29.65M and 34.97M in
comparison results are collected in Table 1, where the fol- the case of ViT-Small, which aligns with our intuition. ❸
lowing observations can be drawn: ❶ Our AdaMV-MoE In additional, AdaMV-MoE consistently surpasses a recent
demonstrates great advantages with a clear performance SoTA MTL approach, i.e., TAPS [71], by {0.14% ∼ 3.78%
margin compared to MTL baselines with a shared ViT Accuracy, 0.29% ∼ 4.78% AP, 0.17% ∼ 3.38% APmask }
feature extractor, i.e., (Dense, Large Dense) MTL-ViT. on ImageNet and COCO datasets across four ViT back-
In detail, AdaMV-MoE obtains {(4.69%, 2.67%), (2.69%, bones. Meantime, with ViT-Small∗ , it reaches competitive
1.30%), (1.75%, 0.79%)}, {(9.06%, 6.54%), (0.73%, results compared to the single-task learning baselines, fur-
0.09%), (0.87%, 0.05%)}, {(4.41%, 4.18%), (0.07%, ther showing the superiority of our algorithms.
17351
Ablation Study of AdaMV-MoE. To investigate the con- into MTL-ViT if setting E = 1. We find that the perfor-
tributions of each component in AdaMV-MoE, comprehen- mance of AdaMV-MoE saturates with more than 8 experts,
sive experiments are conducted with ViT-Small∗ on multi- and E = 8 seems a “sweet-point” in our multi-task vision
task vision recognition. As shown in Table 3 and Table 4, recognition benchmark.
we conduct ablation on the router design, the need for adap- 4.3. In-Depth Dissection of AdaMV-MoE
tive network capacity during MTL, and the number of ex-
perts when employing AdaMV-MoE. Given the superiority of our AdaMV-MoE, we further
offer an in-depth dissection by studying its i) specializa-
Table 3. Ablation studies on AdaMV-MoE of i) router selection, tion, ii) routing quality, iii) adequate positions to introduce
i.e., task-agnostic R v.s. task-dependent R; ii) # used experts, SMoE layers, and iv) mitigation effects on gradient con-
i.e., activating fixed v.s. adaptive number of experts. “Ours w.
flicts from multiple training objectives.
task-dependent R” and “Ours w. AES” present the same variant,
which is also the one used to produce main results in Table 1.
Classification Detection Segmentation
Settings
Accuracy(%) AP(%) APmask (%)
MTL-ViT 68.30 36.35 34.01
MTL-MoE [46] 72.07 38.53 35.24
Ours w. task-agnostic R 72.56 37.54 34.71
Ours w. task-dependent R 72.99 39.04 35.76
Ours w.o. AES 72.04 38.61 35.23
Ours w. AES 72.99 39.04 35.76
Figure 4. Analysis on the routing specialization at the task (Left)
and class (Right) levels. The frequency of expert usage and the
Table 4. Ablation studies on # total experts (E) of our proposed class-wise usage of classification are recorded in Left and Right
AdaMV-MoE. MTL-ViT is the baseline that takes ViT as a shared figures, respectively. Visualizations are produced by AdaMV-MoE
backbone and multiple heads for different tasks. The backbone with ViT-Small. More qualitative results are in Appendix A2.
size of MTL-ViT is equal to the one of AdaMV-MoE with E = 1.
Q1: Is the expert selection specialized to different tasks,
Classification Detection Segmentation
Settings classes, and image contents? Yes. One key advantage of
Accuracy(%) AP(%) APmask (%) AdaMV-MoE is that it optimizes how many (i.e., adaptive
MTL-ViT 68.30 36.35 34.01 network capacity) and which (i.e., dynamic routing) experts
AdaMV-MoE w. E=4 71.74 36.35 34.01 to activate for each task and input sample during MTL. We
AdaMV-MoE w. E=8 72.99 39.04 35.76 examine triple levels of routing specializations from coarse
AdaMV-MoE w. E = 16 72.69 36.99 34.05 to fine-grained, including task, class, and patch levels.
AdaMV-MoE w. E = 32 72.66 36.30 33.37
▷ Task-level specialization. From Figure 4 (Left), we
▷ Task-agnostic versus task-dependent routers R. Re- find that ❶ there is an overall balanced loading across ex-
sults in Table 3 tell that task-dependent routing policies ben- perts, suggesting a sufficient utilization of all model param-
efit more than their task-agnostic counterpart, and enlarge eters; ❷ relatively, CLS prefers expert 1 & 6 and OD & IS
the performance gains compared to the MTL-ViT baseline. use expert 1, 3, & 7 more, according to the frequency.
▷ With or without adaptive expert selection (AES). ▷ Class-level specialization. Based on Figure 4 (Right)
Equipped with the AES, the activated model size is op- which presents the class-wise expert usage of the last SMoE
timized for different tasks, which significantly boosts the layer in AdaMV-MoE for classification, we observe that the
MTL performance. To be specific, our w. AES outper- expert 6 is preferred by most classes and other expert se-
forms its variant w.o. AES by {0.95% Accuracy, 0.43% lections seem to correlate with class types, which coincide
AP, 0.53% APmask } improvements, and a recently invented with the findings in [58]. Similar observations also exist for
MTL-MoE [46] by {0.92% Accuracy, 0.51% AP, 0.52% OD & IS, as shown in Appendix A2.
APmask } gains, which evidence the necessity of a cus- ▷ Patch-level specialization. In Figure 3, we visualize
tomized network capacity for each task. Note that both ours the expert assignments of AdaMV-MoE w. ViT-Small for
w.o. AES and MTL-MoE [46] adopt a fixed and unified each input patch. Specifically, the top-1 and top-2 of acti-
model size (or # selected experts) across all vision tasks, vated experts are implied by the color of the patch’s bound-
which potentially incurs inferior results. ary and content respectively, where different colors repre-
▷ The number of experts. Being one of the most impor- sent diverse experts. ❶ For classification, most patches
tant hyperparameters in an SMoE design, E roughly reflects from the background are assigned to two specific experts
the size of overall parameter spaces that allow explorations associated with the black patch boundary and red patch con-
via dynamic routing. Table 4 reports AdaMV-MoE’s results tent, while varied experts are leveraged to deal with the
with {1, 4, 8, 16, 32} experts, where AdaMV-MoE degrades main object in the foreground. ❷ As for the object detection
17352
and instance segmentation, a clear patch-wise specialization to replace the original ViT layer with an SMoE layer. We
is presented. For example, different image contents like the compare different design choices such as adopting SMoE in
object boundary, the main body of objects, and the back- the Early, Middle, Later, and Every Two layers, where each
ground are processed by distinctive and particular subsets AdaMV-MoE variant has half ViT and half SMoE layers.
of experts. In this way, the task is divided and conquered. Results in Table 5 reveal that only enforcing SMoE to early
layers incurs inferior MTL performance. A possible reason
is that early layers are usually responsible for learning com-
mon features like basic shapes or colors, which should be
shared across classes during vision recognition tasks.
Table 5. Ablation studies on positions of introduced SMoE layers.
Results are produced by AdaMV-MoE with ViT-Small∗ .
Classification Detection Segmentation
Settings of AdaMV-MoE
Classification OD & IS
Accuracy(%) AP(%) APmask (%)
Figure 5. Analysis on the representation collapse of the hidden Early Layers 69.38 37.76 34.67
states from router networks. The diversity of these hidden states is Middle Layers 72.67 38.49 35.04
calculated with Gaussian kernel density estimation and then is vi- Later Layers 73.19 38.00 34.92
sualized as circle heatmaps. Darker areas have more concentrated Every Two Layers 72.99 39.04 35.76
features. A more uniformly distributed circle heatmap means a
more balanced expert usage and a lower risk of representation col- Q4: Does AdaMV-MoE alleviate the issue of gradi-
lapse. Results are produced by AdaMV-MoE with ViT-Small. ent conflicts from diverse tasks? Yes. First, AdvMV-MoE
naturally disentangles parameter spaces for different tasks
thanks to its sparse and conditional computing manner.
Second, as shown in Figure 7, for the common parameters
for all tasks, the gradient conflicts are generally reduced by
our proposals, e.g., less negative and more positive cosine
distance between training gradients from CLS and OD.
Ours
Layer 1 Layer 5
20 Baseline
17353
References on Computer Vision and Pattern Recognition, pages 12020–
12030, 2022. 3
[1] Alhabib Abbas and Yiannis Andreopoulos. Biased mixtures [15] Tianlong Chen, Zhenyu Zhang, AJAY KUMAR JAISWAL,
of experts: Enabling computer vision inference under data Shiwei Liu, and Zhangyang Wang. Sparse moe as the new
transfer limitations. IEEE Transactions on Image Process- dropout: Scaling dense and self-slimmable transformers. In
ing, 29:7656–7667, 2020. 3 The Eleventh International Conference on Learning Repre-
[2] Karim Ahmed, Mohammad Haris Baig, and Lorenzo Torre- sentations, 2023. 3
sani. Network of experts for large-scale image categoriza- [16] Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiao-
tion. In European Conference on Computer Vision, pages hua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan
516–532. Springer, 2016. 3 Song, Zhangyang Wang, et al. A simple single-scale vision
[3] Raquel Aoki, Frederick Tung, and Gabriel L Oliveira. Het- transformer for object localization and instance segmenta-
erogeneous multi-task learning with expert diversity. arXiv tion. arXiv preprint arXiv:2112.09747, 2021. 2, 4, 5, A13
preprint arXiv:2106.10595, 2021. 3 [17] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and An-
[4] BJ Bakker and TM Heskes. Task clustering and gating for drew Rabinovich. Gradnorm: Gradient normalization for
bayesian multitask learning. Journal of Machine Learning adaptive loss balancing in deep multitask networks. In Inter-
Research, 2003. 2 national Conference on Machine Learning, pages 794–803.
[5] Josh Beal, Hao-Yu Wu, Dong Huk Park, Andrew Zhai, and PMLR, 2018. 1
Dmitry Kislyuk. Billion-scale pretraining with vision trans- [18] Bowen Cheng, Yunchao Wei, Honghui Shi, Rogerio Feris,
formers for multi-task visual representations. In Proceed- Jinjun Xiong, and Thomas Huang. Revisiting rcnn: On
ings of the IEEE/CVF Winter Conference on Applications of awakening the classification power of faster rcnn. In Pro-
Computer Vision, pages 564–573, 2022. 1 ceedings of the European conference on computer vision
[6] Rodrigo Berriel, Stephane Lathuillere, Moin Nabi, Tas- (ECCV), pages 453–468, 2018. 2
silo Klein, Thiago Oliveira-Santos, Nicu Sebe, and Elisa [19] Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming
Ricci. Budget-aware adapters for multi-domain learning. In Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song,
Proceedings of the IEEE/CVF International Conference on and Furu Wei. On the representation collapse of sparse mix-
Computer Vision, pages 382–391, 2019. 2 ture of experts. arXiv preprint arXiv:2204.09179, 2022. 4,
[7] Deblina Bhattacharjee, Tong Zhang, Sabine Süsstrunk, and 8
Mathieu Salzmann. Mult: An end-to-end multitask learning [20] Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, and Furu
transformer. In Proceedings of the IEEE/CVF Conference Wei. Knowledge neurons in pretrained transformers. arXiv
on Computer Vision and Pattern Recognition, pages 12031– preprint arXiv:2104.08696, 2021. 3
12041, 2022. 1, 2 [21] Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang
[8] Valerio Biscione and Jeffrey Bowers. Learning translation Sui, Baobao Chang, and Furu Wei. Stablemoe: Stable
invariance in cnns. arXiv preprint arXiv:2011.11757, 2020. routing strategy for mixture of experts. arXiv preprint
2 arXiv:2204.08396, 2022. 8
[9] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: high [22] Yong Dai, Duyu Tang, Liangxin Liu, Minghuan Tan, Cong
quality object detection and instance segmentation. IEEE Zhou, Jingquan Wang, Zhangyin Feng, Fan Zhang, Xueyu
transactions on pattern analysis and machine intelligence, Hu, and Shuming Shi. One model, multiple modalities: A
43(5):1483–1498, 2019. 2, 4 sparsely activated approach for text, sound, image, video and
[10] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas code. arXiv preprint arXiv:2205.06126, 2022. 3
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [23] Hal Daumé III. Bayesian multitask learning with latent hier-
end object detection with transformers. In European confer- archies. arXiv preprint arXiv:0907.0783, 2009. 2
ence on computer vision, pages 213–229. Springer, 2020. 1 [24] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
[11] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping and Li Fei-Fei. Imagenet: A large-scale hierarchical image
Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and database. In 2009 IEEE conference on computer vision and
Wen Gao. Pre-trained image processing transformer. In Pro- pattern recognition, pages 248–255. Ieee, 2009. 2, 5
ceedings of the IEEE/CVF Conference on Computer Vision [25] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
and Pattern Recognition, pages 12299–12310, 2021. 2 Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
[12] Ke Chen, Lei Xu, and Huisheng Chi. Improved learning Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
algorithms for mixture of experts in multiclass classification. vain Gelly, et al. An image is worth 16x16 words: Trans-
Neural networks, 12(9):1229–1252, 1999. 3 formers for image recognition at scale. arXiv preprint
[13] Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, arXiv:2010.11929, 2020. 1, 3, 4, 5
Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. Task- [26] David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever.
specific expert pruning for sparse mixture-of-experts. arXiv Learning factored representations in a deep mixture of ex-
preprint arXiv:2206.00277, 2022. 3 perts. arXiv preprint arXiv:1312.4314, 2013. 3
[14] Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadal- [27] William Fedus, Barret Zoph, and Noam Shazeer. Switch
lah, and Zhangyang Wang. The principle of diversity: Train- transformers: Scaling to trillion parameter models with sim-
ing stronger vision transformers calls for reducing all levels ple and efficient sparsity. arXiv preprint arXiv:2101.03961,
of redundancy. In Proceedings of the IEEE/CVF Conference 2021. 3, 4
17354
[28] Yuan Gao, Jiayi Ma, Mingbo Zhao, Wei Liu, and Alan L [43] Su-In Lee, Vassil Chatalbashev, David Vickrey, and Daphne
Yuille. Nddr-cnn: Layerwise feature fusing in multi-task Koller. Learning a meta-level prior for feature relevance
cnns by neural discriminative dimensionality reduction. In from multiple related tasks. In Proceedings of the 24th inter-
Proceedings of the IEEE/CVF Conference on Computer Vi- national conference on Machine learning, pages 489–496,
sion and Pattern Recognition, pages 3205–3214, 2019. 1, 2007. 2
2 [44] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao
[29] Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam
Transformer feed-forward layers are key-value memories. Shazeer, and Zhifeng Chen. Gshard: Scaling giant models
arXiv preprint arXiv:2012.14913, 2020. 3 with conditional computation and automatic sharding. arXiv
[30] Sam Gross, Marc’Aurelio Ranzato, and Arthur Szlam. Hard preprint arXiv:2006.16668, 2020. 3
mixtures of experts for large scale weakly supervised vision. [45] Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal,
In Proceedings of the IEEE Conference on Computer Vision and Luke Zettlemoyer. Base layers: Simplifying training of
and Pattern Recognition, pages 6865–6873, 2017. 3 large, sparse models. In International Conference on Ma-
[31] Shashank Gupta, Subhabrata Mukherjee, Krishan Subudhi, chine Learning, pages 6265–6274. PMLR, 2021. 3
Eduardo Gonzalez, Damien Jose, Ahmed H Awadallah, and [46] Hanxue Liang, Zhiwen Fan, Rishov Sarkar, Ziyu Jiang, Tian-
Jianfeng Gao. Sparsely activated mixture-of-experts are ro- long Chen, Yu Cheng, Cong Hao, and Zhangyang Wang.
bust multi-task learners. arXiv preprint arXiv:2204.07689, M³vit: Mixture-of-experts vision transformer for efficient
2022. 3 multi-task learning with model-accelerator co-design. In
[32] Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Ma- 36th Annual Conference on Neural Information Processing
heswaran Sathiamoorthy, Yihua Chen, Rahul Mazumder, System, 2022. 1, 2, 3, 7
Lichan Hong, and Ed Chi. Dselect-k: Differentiable selec- [47] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
tion in the mixture of experts with applications to multi-task Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
learning. Advances in Neural Information Processing Sys- Zitnick. Microsoft coco: Common objects in context. In
tems, 34, 2021. 3 European conference on computer vision, pages 740–755.
[33] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking im- Springer, 2014. 2, 5
agenet pre-training. In Proceedings of the IEEE/CVF Inter- [48] Shikun Liu, Edward Johns, and Andrew J Davison. End-
national Conference on Computer Vision, pages 4918–4927, to-end multi-task learning with attention. In Proceedings of
2019. 2 the IEEE/CVF conference on computer vision and pattern
[34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. recognition, pages 1871–1880, 2019. 1, 2
Deep residual learning for image recognition. In Proceed- [49] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
ings of the IEEE conference on computer vision and pattern Zhang, Stephen Lin, and Baining Guo. Swin transformer:
recognition, pages 770–778, 2016. 2 Hierarchical vision transformer using shifted windows. In
[35] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- Proceedings of the IEEE/CVF International Conference on
ian Q Weinberger. Deep networks with stochastic depth. In Computer Vision, pages 10012–10022, 2021. 1
European conference on computer vision, pages 646–661. [50] Ilya Loshchilov and Frank Hutter. Decoupled weight de-
Springer, 2016. A13 cay regularization. In International Conference on Learning
[36] Laurent Jacob, Jean-philippe Vert, and Francis Bach. Clus- Representations, 2018. 5
tered multi-task learning: A convex formulation. Advances [51] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong,
in neural information processing systems, 21, 2008. 2 and Ed H Chi. Modeling task relationships in multi-task
[37] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and learning with multi-gate mixture-of-experts. In Proceed-
Geoffrey E Hinton. Adaptive mixtures of local experts. Neu- ings of the 24th ACM SIGKDD International Conference
ral computation, 3(1):79–87, 1991. 3 on Knowledge Discovery & Data Mining, pages 1930–1939,
[38] Hao Jiang, Ke Zhan, Jianwei Qu, Yongkang Wu, Zhaoye Fei, 2018. 1, 3
Xinyu Zhang, Lei Chen, Zhicheng Dou, Xipeng Qiu, Zikai [52] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Mar-
Guo, et al. Towards more effective and economic sparsely- tial Hebert. Cross-stitch networks for multi-task learning. In
activated model. arXiv preprint arXiv:2110.07431, 2021. 3 Proceedings of the IEEE conference on computer vision and
[39] Michael I Jordan and Robert A Jacobs. Hierarchical mix- pattern recognition, pages 3994–4003, 2016. 1, 2
tures of experts and the em algorithm. Neural computation, [53] Sarthak Mittal, Yoshua Bengio, and Guillaume Lajoie.
6(2):181–214, 1994. 3 Is a modular architecture enough? arXiv preprint
[40] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple arXiv:2206.02713, 2022. 3
layers of features from tiny images. 2009. 2 [54] Eslam Mohamed and Ahmad El Sallab. Spatio-temporal
[41] Abhishek Kumar and Hal Daume III. Learning task group- multi-task learning transformer for joint moving object de-
ing and overlap in multi-task learning. arXiv preprint tection and segmentation. In 2021 IEEE International In-
arXiv:1206.6417, 2012. 2 telligent Transportation Systems Conference (ITSC), pages
[42] Dong-Gyu Lee. Fast drivable areas estimation with multi- 1470–1475. IEEE, 2021. 2
task learning for real-time autonomous driving assistant. Ap- [55] Pedro Morgado and Nuno Vasconcelos. Nettailor: Tuning
plied Sciences, 11(22):10713, 2021. 1 the architecture, not just the weights. In Proceedings of
17355
the IEEE/CVF Conference on Computer Vision and Pattern data-efficient image transformers & distillation through at-
Recognition, pages 3044–3054, 2019. 2 tention. In International Conference on Machine Learning,
[56] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdi- volume 139, pages 10347–10357, July 2021. 1
nov. Actor-mimic: Deep multitask and transfer reinforce- [69] Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selfie:
ment learning. arXiv preprint arXiv:1511.06342, 2015. 1 Self-supervised pretraining for image embedding. arXiv
[57] Svetlana Pavlitskaya, Christian Hubschneider, Michael We- preprint arXiv:1906.02940, 2019. 2
ber, Ruby Moritz, Fabian Huger, Peter Schlicht, and Marius [70] Simon Vandenhende, Stamatios Georgoulis, and Luc Van
Zollner. Using mixture of expert models to gain insights into Gool. Mti-net: Multi-scale task interaction networks for
semantic segmentation. In Proceedings of the IEEE/CVF multi-task learning. In European Conference on Computer
Conference on Computer Vision and Pattern Recognition Vision, pages 527–543. Springer, 2020. 2
Workshops, pages 342–343, 2020. 3 [71] Matthew Wallingford, Hao Li, Alessandro Achille, Avinash
[58] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Ravichandran, Charless Fowlkes, Rahul Bhotika, and Ste-
Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel fano Soatto. Task adaptive parameter sharing for multi-
Keysers, and Neil Houlsby. Scaling vision with sparse mix- task learning. In Proceedings of the IEEE/CVF Conference
ture of experts. Advances in Neural Information Processing on Computer Vision and Pattern Recognition, pages 7561–
Systems, 34, 2021. 3, 4, 5, 7, A13 7570, 2022. 2, 5, 6
[59] Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and [72] Weiyao Wang, Du Tran, and Matt Feiszli. What makes train-
Jason E Weston. Hash layers for large sparse models. In ing multi-modal classification networks hard? In Proceed-
A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman ings of the IEEE/CVF Conference on Computer Vision and
Vaughan, editors, Advances in Neural Information Process- Pattern Recognition, pages 12695–12705, 2020. 2
ing Systems, 2021. 3 [73] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu,
[60] Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and and Stella X Yu. Long-tailed recognition by rout-
Anders Søgaard. Latent multi-task architecture learning. ing diverse distribution-aware experts. arXiv preprint
In Proceedings of the AAAI Conference on Artificial Intel- arXiv:2010.01809, 2020. 3
ligence, volume 33, pages 4822–4829, 2019. 1, 2 [74] Xin Wang, Fisher Yu, Lisa Dunlap, Yi-An Ma, Ruth Wang,
[61] Hongje Seong, Junhyuk Hyun, and Euntai Kim. Video mul- Azalia Mirhoseini, Trevor Darrell, and Joseph E Gonzalez.
titask transformer network. In Proceedings of the IEEE/CVF Deep mixture of experts via shallow embedding. In Uncer-
International Conference on Computer Vision Workshops, tainty in artificial intelligence, pages 552–562. PMLR, 2020.
pages 0–0, 2019. 2 3
[62] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy [75] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu,
Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing
geously large neural networks: The sparsely-gated mixture- convolutions to vision transformers. In Proceedings of the
of-experts layer. arXiv preprint arXiv:1701.06538, 2017. 3, IEEE/CVF International Conference on Computer Vision,
4 pages 22–31, 2021. A13
[63] Trevor Standley, Amir Zamir, Dawn Chen, Leonidas Guibas, [76] Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest
Jitendra Malik, and Silvio Savarese. Which tasks should be descent for growing neural architectures. Advances in neural
learned together in multi-task learning? In International information processing systems, 32, 2019. 4
Conference on Machine Learning, pages 9120–9132. PMLR, [77] Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe.
2020. 2 Pad-net: Multi-tasks guided prediction-and-distillation net-
[64] Charles Sun, Jedrzej Orbik, Coline Manon Devin, Brian H. work for simultaneous depth estimation and scene parsing.
Yang, Abhishek Gupta, Glen Berseth, and Sergey Levine. In Proceedings of the IEEE Conference on Computer Vision
Fully autonomous real-world reinforcement learning with and Pattern Recognition, pages 675–684, 2018. 2
applications to mobile manipulation. In CoRL, volume 164 [78] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishna-
of Proceedings of Machine Learning Research, pages 308– puram. Multi-task learning for classification with dirichlet
319. PMLR, 2021. 1 process priors. Journal of Machine Learning Research, 8(1),
[65] Lichao Sun, Congying Xia, Wenpeng Yin, Tingting Liang, 2007. 2
Philip S Yu, and Lifang He. Mixup-transformer: dy- [79] Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan
namic data augmentation for nlp tasks. arXiv preprint Ngiam. Condconv: Conditionally parameterized convolu-
arXiv:2010.02394, 2020. 5 tions for efficient inference. Advances in Neural Information
[66] Ximeng Sun, Rameswar Panda, Rogerio Feris, and Kate Processing Systems, 32, 2019. 3
Saenko. Adashare: Learning what to share for efficient deep [80] Kai Yu, Volker Tresp, and Anton Schwaighofer. Learn-
multi-task learning. Advances in Neural Information Pro- ing gaussian processes from multiple tasks. In Proceedings
cessing Systems, 33:8728–8740, 2020. 2 of the 22nd international conference on Machine learning,
[67] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi- pages 1012–1019, 2005. 2
task bayesian optimization. Advances in neural information [81] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine,
processing systems, 26, 2013. 1 Karol Hausman, and Chelsea Finn. Gradient surgery for
[68] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco multi-task learning. Advances in Neural Information Pro-
Massa, Alexandre Sablayrolles, and Herve Jegou. Training cessing Systems, 33:5824–5836, 2020. 1
17356
[82] Seniha Esen Yuksel, Joseph N. Wilson, and Paul D. Gader.
Twenty years of mixture of experts. IEEE Transactions on
Neural Networks and Learning Systems, 23(8):1177–1193,
2012. 3
[83] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-
larization strategy to train strong classifiers with localizable
features. In Proceedings of the IEEE/CVF international con-
ference on computer vision, pages 6023–6032, 2019. 5
[84] Amir R Zamir, Alexander Sax, Nikhil Cheerla, Rohan Suri,
Zhangjie Cao, Jitendra Malik, and Leonidas J Guibas. Ro-
bust learning through cross-task consistency. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 11197–11206, 2020. 2
[85] Amir R Zamir, Alexander Sax, William Shen, Leonidas J
Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy:
Disentangling task transfer learning. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 3712–3722, 2018. 1, 2
[86] Zhenyu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang
Li, and Jian Yang. Joint task-recursive learning for semantic
segmentation and depth estimation. In Proceedings of the
European Conference on Computer Vision (ECCV), pages
235–251, 2018. 2
[87] Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe,
and Jian Yang. Pattern-affinitive propagation across depth,
surface normal and semantic segmentation. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 4106–4115, 2019. 2
[88] Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li,
Maosong Sun, and Jie Zhou. Moefication: Conditional com-
putation of transformer models for efficient inference. arXiv
preprint arXiv:2110.01786, 2021. 3
[89] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multi-
task learning via alternating structure optimization. Ad-
vances in neural information processing systems, 24, 2011.
2
[90] Qianyu Zhou, Ke-Yue Zhang, Taiping Yao, Ran Yi,
Shouhong Ding, and Lizhuang Ma. Adaptive mixture of ex-
perts learning for generalizable face anti-spoofing. In Pro-
ceedings of the 30th ACM International Conference on Mul-
timedia, pages 6009–6018, 2022. 3
[91] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang,
Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, and
James Laudon. Mixture-of-experts with expert choice rout-
ing. arXiv preprint arXiv:2202.09368, 2022. 3
[92] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yan-
ping Huang, Jeff Dean, Noam Shazeer, and William Fedus.
Designing effective sparse expert models. arXiv preprint
arXiv:2202.08906, 2022. 3
[93] Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany
Hassan, Ruofei Zhang, Jianfeng Gao, and Tuo Zhao. Taming
sparsely activated transformer with stochastic experts. In In-
ternational Conference on Learning Representations, 2022.
3
17357