2408.10174v2
2408.10174v2
2408.10174v2
A P REPRINT
Anke Tang1 , Li Shen2 , Yong Luo1 , Shuai Xie3 , Han Hu4 , Lefei Zhang1 , Bo Du1 , Dacheng Tao5
1
Wuhan University, 2 Sun Yat-sen University, 3 JD Explore Academy,
arXiv:2408.10174v2 [cs.LG] 26 Aug 2024
4
Beijing Institute of Technology, 5 Nanyang Technological University
1
{anketang,luoyong,zhanglefei,dubo}@whu.edu.cn, 2 mathshenli@gmail.com,
3
xieshuai@jd.com, 4 hhu@bit.edu.cn, 5 dacheng.tao@ntu.edu.sg
A BSTRACT
Deep model training on extensive datasets is increasingly becoming cost-prohibitive, prompting
the widespread adoption of deep model fusion techniques to leverage knowledge from pre-existing
models. From simple weight averaging to more sophisticated methods like AdaMerging, model fusion
effectively improves model performance and accelerates the development of new models. However,
potential interference between parameters of individual models and the lack of interpretability in the
fusion progress remain significant challenges. Existing methods often try to resolve the parameter
interference issue by evaluating attributes of parameters, such as their magnitude or sign, or by
parameter pruning. In this study, we begin by examining the fine-tuning of linear layers through the
lens of subspace analysis and explicitly define parameter interference as an optimization problem
to shed light on this subject. Subsequently, we introduce an innovative approach to model fusion
called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction, which allows for the
upscaling of source models into an MoE model without extra data or further training. Our approach
relies on the observation that fine-tuning mostly keeps the important parts from the pre-training, but it
uses less significant or unused areas to adapt to new tasks. Also, the issue of parameter interference,
which is intrinsically intractable in the original parameter space, can be managed by expanding the
dimensions. We conduct extensive experiments across diverse scenarios, such as image classification
and text generation tasks, using full fine-tuning and LoRA fine-tuning, and we apply our method to
large language models (CLIP models, Flan-T5 models, and Mistral-7B models), highlighting the
adaptability and scalability of SMILE. For full fine-tuned models, about 50% additional parameters
can achieve around 98-99% of the performance of eight individual fine-tuned ViT models, while for
LoRA fine-tuned Flan-T5 models, maintaining 99% performance with only 2% extra parameters.
Code is available at https://github.com/tanganke/fusion_bench.
Keywords Mixture of Experts · Model Fusion · Subspace Decomposition · Large Language Model
1 Introduction
In recent years, the field of deep learning has witnessed an exponential growth in model sizes and dataset scales,
making the training of large-scale deep models on extensive datasets increasingly cost-prohibitive, both in terms of
financial resources and environmental impact [Minaee et al., 2024, Hadi et al., 2023]. Deep model fusion techniques
have emerged as a promising solution, allowing the integration of knowledge from pre-existing models without the
need for extensive retraining [Li et al., 2023, Zheng et al., 2023, Yang et al., 2024a]. This approach not only reduces
computational costs but also enables the creation of more robust and versatile models by combining the strengths of
multiple models.
Following the categorization in Tang et al. [2024a], we classify model fusion methods into three main categories: model
ensemble methods, model merging methods, and model mixing methods. Model ensemble techniques aggregate the
predictions from several models to enhance performance [Sagi and Rokach, 2018]. While resource-intensive in terms of
memory and computation, it improves knowledge distillation training [Wan et al., 2024a,b]. Model merging methods,
on the other hand, combine the parameters of multiple models into a single model, often through weighted averaging
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
or parameter alignment [Matena and Raffel, 2022, Jin et al., 2022]. Model mixing methods involve the integration of
multiple models through gating mechanisms or depth concatenation, allowing for more flexible and adaptive fusion
strategies [Komatsuzaki et al., 2022, Kim et al., 2023]. These methods are particularly effective in multi-task learning
scenarios, where the merged model can simultaneously perform multiple tasks.
However, despite the promising advancements in model fusion, several critical challenges persist, hindering the full
realization of its potential. A primary concern is the potential interference between parameters of different models,
which leads to suboptimal performance. Additionally, the lack of interpretability in the fusion process remains a
significant hurdle, as current insights are largely confined to heuristic observations or simplified assumptions, such
as linear mode connectivity, parameter signs or importance [Ainsworth et al., 2022, Stoica et al., 2023, Yadav et al.,
2023, Yu et al., 2024]. Understanding how parameters are merged is crucial for building trust in the merged models and
for further improving fusion techniques. These challenges are particularly pronounced in complex, high-dimensional,
non-linear model architectures, where the interactions between parameters can be extremely intricate and non-intuitive.
Instead of relying on heuristic methods or simplified assumptions, we
propose a novel subspace perspective on understanding and address- 90
ing the parameter interference problem in this study. We first examine Ours
the fine-tuning process in linear layers through the lens of subspace Pretrained
80
analysis using matrix decomposition in Section 2. This allows us to Simple Average
Average Accuracy
decompose the prediction of a fine-tuned model into distinct com- Task Arithmetic
ponents, encompassing the pre-trained knowledge and task-specific Ties-Merging
70
adaptation. This approach provides insights into how models adapt Fisher Merging
RegMean
to downstream tasks while preserving pre-trained knowledge. Draw-
60 Layer-Wise AdaMerging
ing from experimental observations, we build a more comprehensive
Weight-Ensembling MoE
understanding of fine-tuning, we further formulate parameter inter- Individuals
ference as an optimization problem in Section 3, providing a more 50
rigorous and measurable perspective.
2 4 6 8
Based on our insights, we introduce an innovative approach called Normalized Number of Parameters
zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction,
enhancing existing source models into a more versatile MoE model. Figure 1: Multi-task model fusion experiment
The zero-shot aspect of our approach is particularly noteworthy, as on eight image classification tasks using CLIP-
it facilitates the immediate deployment of fused models in new en- ViT-B/32 models. Here we set kgate = 16 and
vironments or tasks, drastically minimizing the time and resources k is varied from 4 to 128 to investigate the trade-
typically required for model adaptation. off between performance and model size.
The effectiveness of our proposed method is rooted in two key obser-
vations derived from our subspace analysis. Firstly, we found that the fine-tuning largely preserves the most important
pre-trained weights and primarily utilizes less significant or previously unused dimensions of the parameter space to
adapt to new tasks. This preservation ensures that the critical pre-training knowledge encoded in the original models is
not lost during fine-tuning and implies that the parameter subspace required to accommodate new knowledge may vary
from task to task. Secondly, we found that while parameter interference is inherently difficult to address in the original
parameter space, it becomes more manageable when we increase the model’s dimensionality. This expansion creates
additional ‘room’ for task-specific parameter updates to coexist without mutual interference.
We conducted extensive experiments across various tasks and models in both the vision and language domains, utilizing
traditional full fine-tuning as well as Low-Rank Adaptation (LoRA) [Hu et al., 2021]. The results show that for models
that undergo full fine-tuning, adding approximately 50% more parameters allows us to achieve around 98-99% of
the performance of eight individual fine-tuned models. In the case of LoRA fine-tuned models, maintaining 99% of
the individual performance requires only a 2% increase in parameters. This method also offers trade-offs between
performance and model size, as illustrated in Figure 1, where we vary the rank k of local experts.
To summarize, our contributions in this study are as follows:
• We provide a novel subspace perspective on the fine-tuning process, shedding light on how models adapt
to new tasks while preserving pre-trained knowledge. In addition, We formulate the parameter interference
problem as an optimization problem, providing a more rigorous and measurable perspective on this issue.
• We introduce a zero-shot Sparse Mixture of Low-Rank Experts (SMILE) construction approach, enabling the
fusion of existing models into more unified versatile SMILE models. We also discuss the complexity of our
method, highlighting its potential for broader applications in deep learning research and practice.
• We demonstrate the effectiveness of our method through extensive experiments on a variety of tasks and setups,
showcasing its superior performance and efficiency compared to existing model fusion techniques.
2
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
Proof 1 For simplicity, let xij = ui vjT . The Frobenius inner product of two matrices xab and xcd is defined as
⟨xab , xcd ⟩ = tr ua vbT (uc vdT )T = tr ua vbT vd uTc ∈ R.
(1)
Since this holds for any a and b, we conclude that all αij = 0. This leads to a contradiction to the assumption that α is
nonzero. Therefore, the set {xij }ri,j=1 is linearly independent, which is the necessary and sufficient conditions for a set
of elements to form a basis for a vector space with dimension pq.
We start by decomposing the weight matrix W using the reduced SVD as W = Ur Σr VrT , where Ur ∈ Rm×r and
Vr ∈ Rr×n are orthonormal matrices containing the left singular vectors and right singular vectors, respectively,
Σr ∈ Rr×r is a diagonal matrix containing the singular values sorted in descending order, and r is the rank of the matrix
W [Olver and Shakiban, 2018]. In the case of full SVD, the matrices are U ∈ Rm×m , Σ ∈ Rm×n , and V ∈ Rn×n ,
which preserve all information about the matrix W , including its kernel (null space) and cokernel (left null space), as
shown in Figure 2a.
Remark 1 According to Theorem 1, the set of matrices {ui vjT |i ∈ [m], j ∈ [n]} forms an orthonormal basis for a
subspace of Rm×n with dimension mn. In P other words, for any real matrix A ∈ Rm×n , we can express it as a weighted
m Pn
sum of the elements in the basis, i.e. A = i=1 j=1 ⟨A, ui vjT ⟩ui vjT ∈ span({ui vjT }m,n
i,j ).
To gain insights into how fine-tuning modifies the pre-trained weights to adapt them to a specific task, we assume the
fine-tuned linear layer accepts an input x ∈ Rn and outputs y = Wf t x + bf t . Because the row space {vi }ni=1 is an
3
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
1.0
I
II
0.8
kernel
Accuracy
cokernel
III
0.6
0.4
or
SUN397 CARS RESISC45 EuroSAT SVHN GTSRB MNIST DTD
(a) Full SVD and subspace partition. (b) Accuracy comparison across different subspace projection strategies.
Figure 2: Here we show the SVD decomposition and subspace partition of the singular value matrix Σ, and the accuracy
comparison of different subspace projection strategies discussed in Section 2.
Pn
orthonormal basis for Rn , we can decompose x as x = i=1 ⟨x, vi ⟩vi , where ⟨·, ·⟩ denotes the vector inner product. On
the other hand, Wf t and bf t are updated from the pre-trained weights W and b. We can derive the following equation:
y = Wf t x + bf t = (W + ∆W )x + b + ∆b = W x + b + ∆W x + ∆b . (4)
| {z } | {z }
pre-trained part fine-tuned part
Now we expand the pre-trained part and fine-tuned part in Eq.(4) separately as follows:
Xn Xn Xr
pre-trained part = W ⟨x, vi ⟩vi + b = σj uj vj⊤ ⟨x, vi ⟩vi + b (5)
i=1 i=1 j=1
X r
n X r
X
= σj ⟨x, vi ⟩uj vj⊤ vi +b= σj ⟨x, vj ⟩uj + b, (6)
i=1 j=1 j=1
n
X n
X m X
X n
fine-tuned part = (Wf t − W )⟨x, vi ⟩vi + ∆b = δjk uj vk⊤ ⟨x, vi ⟩vi + ∆b (7)
i=1 i=1 j=1 k=1
X m X
n X n m X
X n
= δjk ⟨x, vi ⟩uj vk⊤ vi + ∆b = δjk ⟨x, vk ⟩uj + ∆b. (8)
i=1 j=1 k=1 j=1 k=1
Where δjk = ⟨∆W, uj vk⊤ ⟩ = (U ⊤ ∆W V )jk is the Frobenius inner product between the fine-tuned weight update
∆W and the rank-one matrix uj vk⊤ . It also quantifies how much the weight updates align with the direction specified
by uj vk⊤ and indicates which input-output transformation is enhanced or suppressed (or enhanced reversely) during
fine-tuning, based on its sign and magnitude. For example, a large positive δjk suggests that the connection between the
k-th input direction (vk ) and j-th output direction (uk ) is strengthened for the downstream task. This decomposition
shows how different the pre-trained and fine-tuned parts contribute to the output. So far, we only understand that the
fine-tuned update ∆W potentially uses all mn dimensions, while the pre-trained part only uses r dimensions.
We further split left/right singular vectors into three distinct subsets based on the distribution of the singular values
of ∆W , and design an ablation study corresponding to different zones in the projection coefficient matrix U ⊤ ∆W V :
(1) The top-left zone I contains the most significant singular values that cumulatively account Prhalffor 50% P of the total
r
sum of the singular values, we denote the number of singular values in this zone as rhalf , i=1 σi ≈ i=1 σi /2.
This zone is crucial for preserving pre-training information. (2) The middle zone II encompasses the singular values
that make up the remaining half of the cumulative sum. These values are still important but less so than those in
zone I. (3) Zone III contains no information about the pre-trained weights, as its range is beyond rank(W ). This
zone partition is illustrated as the Σ in Figure 2a. We fine-tune the pre-trained CLIP-ViT-B/32 model on eight
downstream tasks from different domains, including hand-written digit images, satellite images, regular patterns,
car images, and natural images. Then we project the ∆W onto the different subspaces, including (1) subspace I
span({ui vjT |i ∈ [rhalf ], j ∈ [rhalf ]}), (2) subspace II span({ui vjT |i ∈ [rhalf , r], j ∈ [rhalf , r]}), and (3) subspace
II & III span({ui vjT |i ∈ [rhalf , m], j ∈ [rhalf , n]}). We compare the performance of the pre-trained model, the
fine-tuned models, and modified fine-tuned models with different subspace projection strategies applied on ∆W in
Figure 2b. For more details, please refer to Appendix A.
4
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
Figure 2b demonstrates that projecting fine-tuned updates onto subspace I result in a slight improvement in performance
on downstream tasks compared to the pre-trained model, sometimes showing no improvement at all. Projection onto
subspace II leads to moderate improvement, while projection onto subspace II & III results in significant performance
gains, nearly reaching the level of the fine-tuned model. Based on these observations, we draw the following conclusions:
Fine-tuning largely maintains the most important pre-trained features, but leverages less significant dimensions
for task-specific learning and activates or repurposes previously unused dimensions in the weight space.
Where the pre-trained term is shared and remains constant during fine-tuning across all tasks. In the context of model
merging, these models are merged to construct a unified multi-task model that can perform all tasks simultaneously.
PT (l)
A common approach is to use a weighted average of the fine-tuned weights, i.e. Wmerged = l=1 λl Wf t and
PT (l)
bmerged = l=1 λl bf t . This is equivalent to merging the fine-tuned parts of the models, while the pre-trained parts are
shared across all tasks. Therefore, we express the output of the merged model as:
m X n T
! T
(l)
X X X
ymerged = pre-trained part + λl δjk ⟨x, vk ⟩uj + λl ∆b(l) . (10)
j=1 k=1 l=1 l=1
(i)
Substitute the input x with x from the i-th task (domain), we aim to minimize the discrepancy between the output of
the merged model and the output of the i-th fine-tuned model. We formulate the optimization problem as follows:
" T ! # ! 2
2 m X n T
(l) (i)
X X X
min ymerged − y (i) = min λl δjk − δjk ⟨x(i) , vk ⟩uj + λl ∆b(l) − ∆b(i) . (11)
λl 2 λl
j=1 k=1 l=1 l=1
2
Using triangle inequality, we can decompose the error into two parts and assert an upper bound:
" T ! # 2 ! 2
2 m X n T
(l) (i)
X X X
(i) (i) (l)
ymerged − y ≤ λl δjk − δjk ⟨x , vk ⟩uj + λl ∆b − ∆b(i) . (12)
2
j=1 k=1 l=1 l=1 2
| {z }2 | {z }
weight term bias term
For the bias term, we have a closed-form solution for λl = (∆B ⊤ ∆B)−1 ∆B∆b(i) , where ∆B is a matrix with
l-th columns as ∆b(l) . Notice that this solution varies with the task (domain) index of the coming input x(i) , so
a more straightforward solution is to fix the bias term during the fine-tuning process, so that ∆b(i) = 0 for all
i. As for the weight term, parameter interference does not occur on subspace that is orthogonal to the input, i.e.
span({ui vjT |i ∈ [m], j ∈ [n] and ⟨x(i) , vj ⟩ = 0}), thus a possible strategy is to enlarge the input size to increase the
dimension of the orthogonal subspace. This explains why model merging methods perform better on larger models
with more dimension redundancy. When the input activates certain dimensions, i.e. for k such that ⟨x(i) , vk ⟩ = ̸ 0,
the interference is inevitable unless the domain gap between different tasks is large enough to make the activation
dimensions disjoint. Note that we can gain the same conclusion within the original model parameter space, by simply
replacing the basis vectors {ui }i and {vi }i in this section with the standard Euclidean basis vectors {ei }i .
5
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
output SMILE
hidden output
Router routing
weights
concatenate
shared pre-trained part
compute norm:
... ...
... ...
Figure 3: The architecture of the proposed Sparse MIxture of Low-rank Experts (SMILE) model.
Where r(i) is the rank of the fine-tuned weight matrix ∆W (i) , and k is the rank of the low-rank approximation, which
(i) (i)
is determined as a hyperparameter. Uk and Vk contains the first k columns of U (i) and V (i) , respectively. Here
we drop the terms with indices j > k in the summation, which correspond to the less significant dimensions. Let
(i) (i) ⊤ (i)
A(i) = Σk Vk and B (i) = Uk , we can express the approximation similar to a LoRA adapter: ∆W x = B (i) A(i) x.
The following theorem states the optimality of this low-rank approximation.
6
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
Theorem 2 Given a matrix W ∈ Rm×n , its low-rank approximation Wk = Uk Σk Vk⊤ with rank k minimizes the
Frobenius norm of the difference between W and Wk , i.e. Wk = arg minrank(W ′ )=k ∥W − W ′ ∥F .
Another key component of the SMILE model is the router, which determines the routing weights. The routing weights
should reflect the importance of each expert for a given input, and we hypothesize that the most important dimensions
of the parameter updates have a larger probability of aligning with the input vector. We provide a rationale for this
hypothesis by examining the gradient flow throughout fine-tuning in Appendix B. Therefore, we design the routing
logits as the L2 norm of the projections of the input onto the low-rank matrices. Mathematically, we can express the
routing weights as:
kgate
(i)⊤
XD (i)
E
r(i) = x, vj = Vkgate x . (15)
2
j=1
2
Where kgate is the number of dimensions used for routing, which is a hyperparameter. kgate should not be excessively
large, which could diminish the distinctiveness of routing weights. In the extreme case where kgate = n, r(i) = ∥x∥2 ,
which is equivalent to a uniform distribution over all experts. In our hyperparameter analysis, we find that kgate = 4 or
8 is a good choice for most tasks. To summarize, the output of the SMILE module can be expressed as:
T
X λi (i) (i) (i) ⊤
y = (W x + b) + PT Uk Σk Vk x + ∆b(i) (16)
i=1 j=1 λi
pi , pi ∈ TopK({pj }Tj=1 , K)
λi = (17)
0, otherwise
(i)⊤
pi = softmaxi (r(i) ) = softmaxi Vkgate x . (18)
2
Complexity Analysis: The linear layer has m(n+1) parameters. The upscaled SMILE module has m(n+1)+T (mk +
nk + m) + nT kgate parameters, the additional parameters have a space complexity of O(T (mk + n(k + kgate ))).
For every input token, an additional number of parameters of nT kgate + K(mk + nk + m) are activated, with K
representing the top-K hyperparameter. For instance, with T = 8, kgate = 4, k = 32, K = 1 and m = n = 1024, the
SMILE model has 565K additional parameters, which is about 53.9% of the original parameter count. 99K additional
parameters are activated for each input token, which is only about 9.4% of the original parameter count.
Extending to Parameter-efficient fine-tuned (PEFT) models: It is straightforward to extend the SMILE upscaling
to PEFT models, such as LoRA fine-tuned models. We can still decompose the fine-tuning updates using SVD as
∆WLoRA = BLoRA ALoRA . Note that for parameter-efficient fine-tuned models, such as LoRA fine-tuned models,
kgate should be set to a smaller value than the hyperparameter rank of LoRA rLoRA , and k ≤ rLoRA .
5 Experiments
In this section, we evaluate the effectiveness of the Table 1: Requirements of different model fusion methods.
proposed SMILE on a variety of setups, including im-
age classification and text generation tasks, as well as Methods Validation Set Test-Time Adaptation
full fine-tuning and LoRA fine-tuning. Detailed infor- Weight Averaging
mation about the fine-tuned models is in Appendix C. Fisher-Merging ✓
We compare our method with several SOTA model fu- RegMean ✓
sion techniques, including Simple Averaging [Wolf Task Arithmetic ✓
et al., 2019b], Fisher merging [Matena and Raffel, Ties-Merging ✓
2022], RegMean [Jin et al., 2022], Task Arithmetic [Il- AdaMerging ✓
harco et al., 2022], Ties-Merging [Yadav et al., 2023], WEMoE ✓
AdaMerging [Yang et al., 2024c], and WEMoE [Tang SMILE (Ours)
et al., 2024c]. To further demonstrate the scalabil-
ity of SMILE upscaling, we also conduct experiments using off-the-shelf large language models fine-tuned from
Mistral-7B-v0.1. Our code implementation is based on FusionBench [Tang et al., 2024a].
7
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
Table 2: Multi-task model fusion methods on eight image classification tasks using CLIP-ViT-B/32 models. Here we
show two different hyperparameter settings for our method: (1) kgate = 16, k = 32 and the normalized parameter
count is 1.61; (2) kgate = 16, k = 128 and the normalized parameter count is 3.07.
Method SUN397 Cars RESISC45 EuroSAT SVHN GTSRB MNIST DTD Avg.
Individual 75.0 78.3 95.2 99.0 97.3 98.9 99.6 79.7 90.3 (100%)
Simple Averaging 65.4 62.6 70.8 76.9 64.5 54.9 86.3 50.9 66.5 (73.6%)
Fisher Merging 66.7 64.0 72.2 91.6 69.0 64.3 83.5 53.7 70.6 (78.2%)
RegMean 67.8 68.9 82.5 94.4 90.6 79.2 97.6 63.2 80.5 (89.1%)
Task Arithmetic 57.1 55.7 64.9 76.7 77.9 68.5 96.1 47.2 68.0 (75.3%)
Ties-Merging 67.1 64.2 74.1 76.8 77.7 69.4 94.1 54.0 72.2 (80.0%)
AdaMerging 67.9 71.3 83.5 92.7 87.4 92.9 98.2 67.0 82.6 (91.5%)
WEMoE (× 6.27) 73.7 76.8 93.4 98.2 96.8 98.2 99.6 76.6 89.2 (98.8%)
SMILE (1, ×1.61) 73.6 74.4 89.5 98.1 95.4 97.3 99.5 76.3 87.7 (97.1%)
SMILE (2, ×3.07) 73.6 77.8 92.0 98.3 96.9 98.1 99.6 78.1 89.3 (98.9%)
Table 3: Multi-task model fusion methods on eight image classification tasks using CLIP-ViT-L/14 models. Here we
show two different hyperparameter settings for our method: (1) kgate = 16, k = 32 and the normalized parameter
count is 1.47; (2) kgate = 16, k = 128 and the normalized parameter count is 2.56.
Method SUN397 Cars RESISC45 EuroSAT SVHN GTSRB MNIST DTD Avg.
Individual 82.8 92.9 97.4 99.2 97.9 99.2 99.8 85.5 94.3 (100%)
Simple Averaging 72.5 81.5 82.2 90.0 81.6 74.0 96.6 61.8 80.0 (84.8%)
Fisher Merging 70.6 79.4 84.1 98.1 74.7 85.0 89.5 61.0 80.3 (85.2%)
RegMean 75.3 88.4 90.0 97.1 95.9 92.4 98.5 72.6 88.8 (94.2%)
Task Arithmetic 72.0 79.0 80.5 86.0 87.5 83.5 98.0 58.8 80.7 (85.6%)
Ties-Merging 74.7 83.3 86.4 91.3 89.7 85.2 97.8 63.9 84.0 (89.1%)
AdaMerging 78.1 90.7 90.8 96.5 94.8 97.5 98.6 81.3 91.0 (96.5%)
WEMoE (×6.40) 81.5 92.3 96.5 98.8 97.6 99.4 99.6 84.5 93.8 (99.5%)
SMILE (1, ×1.47) 79.9 91.0 94.3 99.0 97.9 98.6 99.7 82.2 92.8 (98.4%)
SMILE (2, ×2.56) 81.9 92.3 95.5 99.1 98.0 98.9 99.7 83.6 93.6 (99.3%)
Task Arithmetic
SMILE is a training-free model fusion method that does not require
Ties-Merging
additional labeled samples or test-time adaptation. 80
Fisher Merging
Figures 1 and 4 illustrates the average accuracy of the merged model RegMean
75
across different methods, for SMILE kgate is set to 16 and k is varied Layer-Wise AdaMerging
from 4 to 128. These two figures demonstrate the effectiveness of 70 Weight-Ensembling MoE
SMILE and the trade-off between performance and model size. Individuals
65
In Tables 2 and 3, we compare the performance of various model
fusion methods on CLIP-ViT-B/32 and CLIP-ViT-L/14 models, re- 2 4 6 8
Normalized Number of Parameters
spectively. Our SMILE method achieves competitive results across
all tasks. For instance, with CLIP-ViT-L/14 models, SMILE (2: Figure 4: Multi-task model fusion experiment
kgate = 16, k = 128) achieves 99.3% of the individual model per- on eight image classification tasks using CLIP-
formance while using only 2.56 times the parameters of a single ViT-L/14 models (kgate = 16).
model, compared to eight individual fine-tuned models and Weight-
Ensembling MoE which requires 6.40 times the parameters.
1
https://huggingface.co/openai/clip-vit-base-patch32
2
https://huggingface.co/openai/clip-vit-large-patch14
8
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
Table 4: Multi-task performance when merging Flan-T5-base (full fine-tuned) models on all eight tasks. Here we show
two different hyperparameter settings for our method: (1) kgate = 4, k = 16 and the normalized parameter count is
1.26; (2) kgate = 8, k = 32 and the normalized parameter count is 1.52.
Method CoLA MNLI MRPC QNLI QQP RTE SST2 STSB Avg.
Individual 75.0 83.4 87.5 91.5 85.4 85.9 93.6 88.7 86.4 (100%)
Weight Averaging 69.1 62.6 79.4 89.8 83.9 81.2 91.7 73.2 78.9 (91.3%)
Task Arithmetic 70.5 57.8 78.4 90.2 83.6 80.5 92.3 77.8 78.9 (91.3%)
Ties-Merging 70.3 65.0 78.9 90.2 83.5 81.6 91.7 78.3 79.9 (92.5%)
SMILE (1, ×1.26) 72.0 84.2 84.3 91.3 84.7 84.1 93.3 87.0 85.1 (98.5%)
SMILE (2, ×1.52) 73.2 84.2 85.0 91.3 84.9 84.8 93.5 87.3 85.5 (99.0%)
Table 5: Multi-task performance when merging Flan-T5-base (LoRA fine-tuned) models on all eight tasks. We choose
kgate = 2, k = 4 and the normalized parameter count is 1.02.
Method CoLA MNLI MRPC QNLI QQP RTE SST2 STSB Avg.
Individual 69.1 82.7 85.5 90.9 84.0 84.4 92.9 87.4 84.6 (100%)
Weight Averaging 69.7 59.7 78.9 90.1 83.8 80.5 91.2 72.0 78.2 (92.4%)
Task Arithmetic 68.8 55.2 78.7 89.8 83.7 79.1 91.5 72.4 77.4 (91.5%)
Ties-Merging 68.3 56.3 79.4 89.8 83.7 79.4 91.6 71.2 77.5 (91.6%)
SMILE (×1.02) 69.3 82.9 83.8 90.6 83.9 83.4 93.1 85.1 84.0 (99.3%)
We further evaluate SMILE on text generation tasks using Flan-T5-base models 3 , which are fine-tuned on eight tasks
from the GLUE benchmark [Wang et al., 2018]. We use two different fine-tuning strategies: full fine-tuning and LoRA
fine-tuning with rLoRA = 16. We present the results in Tables 4 and 5, for full fine-tuned models and LoRA fine-tuned
models, respectively. For fully fine-tuned models, SMILE consistently outperforms other fusion methods across all
eight tasks. With just 1.52 times the parameters of a single model, SMILE (2: kgate = 8, k = 32) achieves 99.0% of
the individual model performance with 1.52 times the parameters of a single model. In the LoRA fine-tuned scenario,
SMILE maintains strong performance with minimal parameter increase (1.02 times). It achieves 99.3% of the individual
model performance, significantly surpassing other multi-task model fusion methods.
To better understand SMILE, we further conduct ablation studies using CLIP and Flan-T5 models.
Ablations on the low-rank approximation rank k and routing dimension kgate (hyperparameter analysis). Our
hyperparameter analysis demonstrates the flexibility and robustness of SMILE across different model architectures and
tasks. Figure 5 illustrates the impact of hyperparameters k and kgate on performance and parameter count for Flan-
T5-Base models in both full and LoRA fine-tuned scenarios. For CLIP-ViT models, Figures 8 and 9 provide detailed
heatmaps and line plots showing the relationship between hyperparameters, average accuracy, and parameter count.
Across all models, we observe a consistent trend: increasing k and kgate generally leads to improved performance,
but with diminishing returns as parameter count grows. Notably, SMILE achieves near-optimal performance with
relatively small values of k and kgate . This analysis highlights the effectiveness of SMILE in balancing performance and
efficiency, allowing users to fine-tune the trade-off based on their specific requirements. The stability of performance
across a range of hyperparameter values also underscores the robustness of our method, making it adaptable to various
multi-task fusion scenarios. For more details, please refer to Appendix D.
Ablations on Top-K routing (routing analysis). Here we compare different values K in the top-K routing mechanism.
The plots in Figure 6 illustrate the impact of varying K on both the average accuracy across all tasks (Figure 6a) and
the accuracy of each individual task (Figure 6b) when using the CLIP-ViT-B/32 model across eight image classification
tasks. We observe that the average accuracy decreases slightly as K increases. This suggests that larger values of K,
which allow more experts to be used for each input, are not necessary for multi-task model fusion where each expert is
3
https://huggingface.co/google/flan-t5-base
9
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
3.0 84.0
85.1 85.1 85.2 85.3 1.26 1.50 1.97 2.92 84.0 84.0 84.0 1.02 1.03 1.06
4
2
1.06
85.4 2.5
85.3 85.5 85.5 85.6 1.29 1.52 2.00 2.94
8
8
kgate
kgate
kgate
kgate
83.8 83.9 83.8 1.02 1.04 1.06
4
83.9
85.2 85.3 85.5 85.5 1.34 1.57 2.05 2.99 2.0 1.04
16
16
85.2
85.0 85.3 85.3 85.3 1.44 1.68 2.15 3.09 83.9 83.8 83.9 1.03 1.04 1.07
8
1.5
32
32
1.02
16 32 64 128 16 32 64 128 4 8 16 4 8 16
k k k k
(a) average performance (b) normalized parameters count (c) average performance (d) normalized parameters count
Figure 5: Hyperparameter analysis of the Flan-T5-Base models on eight tasks from GLUE benchmark. We show
how different values of hyperparameters k and kgate affect the average performance and the normalized number of
parameters in the upscaled model. Subfigures (a), and (b) show the results of the full fine-tuned models, while subfigures
(c), and (d) show the results of the fine-tuned models with rLoRA = 16.
100 SVHN
Average Accuracy (%)
Cars
Accuracy (%)
RESISC45
90
EuroSAT
GTSRB
80
MNIST
DTD
70 SUN397
2 4 6 8 2 4 6 8
Top-K Top-K
(a) Average accuracy. (b) Accuracy of each task.
Figure 6: Ablations on the Top-K routing for CLIP-ViT-B/32 models on eight image classification tasks (kgate =
16, k = 32). Here we show the average accuracy and the accuracy on each task, and the y-axis is shared.
specialized for a specific task. In general, the performance of individual tasks is relatively stable across different values
of K, indicating the robustness of the routing mechanism of SMILE (Equation (15)).
4
The expert models are meta-math/MetaMath-Mistral-7B, cognitivecomputations/dolphin-2.1-mistral-7b and
uukuguy/speechless-code-mistral-7b-v1.0 respectively.
10
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
Table 6: Comparison of individual Mistral-7B models and the upscaled model on various benchmark tasks.
Model MMLU TruthfulQA GSM8K ARC Challenge
M0 (pre-trained) 59.64 42.62 38.81 53.92
M1 60.56 44.79 71.49 51.02
M2 60.56 55.88 56.93 57.00
M3 61.18 47.47 48.98 57.68
M0;123 (11.2B, kgate = 8, k = 512) 60.66 52.79 67.85 54.35
Qwen1.5-14B (reference) 66.11 52.00 69.37 49.93
the GSM8K benchmark and M3 in the ARC Challenge. This indicates that each expert brings specialized knowledge,
which when combined, enhances the overall performance in a diverse set of tasks. More details are in Appendix E.
6 Related Work
Mixture of Experts. The concept of Mixture of Experts (MoE) is first introduced by Jacobs et al. [1991], involving
training multiple specialized models. This concept has gained significant attention in recent years [Jiang et al., 2024,
Dai et al., 2024], with much of the innovation focusing on routing mechanisms and expert design. Much innovation
revolves around the design of more efficient routers. For example, the Switch Transformer [Fedus et al., 2022b] selects
only the top expert for each token, simplifying the process and improving scalability. Similarly, [Lewis et al., 2021] use
a linear assignment to optimize token-expert affinities, ensuring an equal spread of tokens among experts. In [Ostapenko
et al., 2024], the authors propose to build a library of LoRA adapters using model-based clustering and build a MoE to
select the most relevant adapters based on input without retraining. For detailed reviews on MoE, see [Fedus et al.,
2022a], and for MoE in the context of model merging (MoEerging), refer to [Yadav et al., 2024].
Deep Model Fusion. Mode connectivity reveals that different model solutions can be linked by low-loss path in the
parameter space [Freeman and Bruna, 2016, Nagarajan and Kolter, 2019, Draxler et al., 2018, Frankle et al., 2020,
Entezari et al., 2021, Garipov et al., 2018, Tatro et al., 2020, Yunis et al., 2022, Benton et al., 2021], facilitating model
fusion by weight interpolation [Izmailov et al., 2018, Matena and Raffel, 2022, Wolf et al., 2019b, Kaddour, 2022,
Ilharco et al., 2022, Yadav et al., 2023, Yang et al., 2024c, Wu et al., 2023]. However, this strategy also poses challenges,
particularly when merging models with diverse structures. Alignment helps reduce model disparities by matching and
interpolating components [Li et al., 2015, Tatro et al., 2020]. Methods involve matching activations or weights [Stoica
et al., 2023, Jin et al., 2022, Yang et al., 2024b], using channel-wise graph matching [Liu et al., 2022], or applying
permutation invariance [Ainsworth et al., 2022]. Another line of research is model mixing, which combines models
through gating mechanisms or depth concatenation [Tang et al., 2024c, Lu et al., 2024, Tang et al., 2024b, Kim et al.,
2023], allowing for more flexible and adaptive fusion strategies.
11
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
References
Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation
symmetries. arXiv preprint arXiv:2209.04836, 2022.
Gregory Benton, Wesley Maddox, Sanae Lotfi, and Andrew Gordon Gordon Wilson. Loss surface simplexes for mode
connecting volumes and fast ensembling. In International Conference on Machine Learning, pages 769–779. PMLR,
2021.
Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the
art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the
wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu,
Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv
preprint arXiv:2401.06066, 2024.
Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network
energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.
Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode
connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning. arXiv preprint
arXiv:2209.01667, 2022a.
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple
and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022b.
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the
lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint
arXiv:1611.01540, 2016.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding,
Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria
Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang,
and Andy Zou. A framework for few-shot language model evaluation, 07 2024.
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode
connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh,
Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. Large language models: a comprehensive survey of its applications,
challenges, limitations, and future prospects. Authorea Preprints, 2023.
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning
benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing, 12(7):2217–2226, 2019.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi,
and Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights
leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.
Neural computation, 3(1):79–87, 1991.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural
networks. Advances in neural information processing systems, 31, 2018.
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint
arXiv:2401.04088, 2024.
Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of
language models. arXiv preprint arXiv:2212.09849, 2022.
12
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
Jean Kaddour. Stop wasting my time! saving days of imagenet and bert training with latest weight averaging. arXiv
preprint arXiv:2209.14981, 2022.
Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim,
Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling.
arXiv preprint arXiv:2312.15166, 2023.
Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay,
Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv
preprint arXiv:2212.05055, 2022.
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In
Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of
large, sparse models. In International Conference on Machine Learning, pages 6265–6274. PMLR, 2021.
Weishi Li, Yong Peng, Miao Zhang, Liang Ding, Han Hu, and Li Shen. Deep model fusion: A survey, 2023.
Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do different neural
networks learn the same representations? arXiv preprint arXiv:1511.07543, 2015.
Chang Liu, Chenfei Lou, Runzhong Wang, Alan Yuhan Xi, Li Shen, and Junchi Yan. Deep neural network fusion
via graph matching with applications to model ensemble and federated learning. In International Conference on
Machine Learning, pages 13857–13869. PMLR, 2022.
Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration
of modular expertise in model merging. arXiv preprint arXiv:2406.15479, 2024.
Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information
Processing Systems, 35:17703–17716, 2022.
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng
Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024.
Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning.
Advances in Neural Information Processing Systems, 32, 2019.
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural
images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning,
volume 2011, page 4. Granada, 2011.
Peter J. Olver and Chehrzad Shakiban. Applied Linear Algebra. Undergraduate Texts in Mathematics. Springer Interna-
tional Publishing, Cham, 2018. ISBN 978-3-319-91040-6 978-3-319-91041-3. doi: 10.1007/978-3-319-91041-3.
URL http://link.springer.com/10.1007/978-3-319-91041-3.
Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux, Matheus Pereira, Lucas
Caccia, and Alessandro Sordoni. Towards modular llms by building and reusing a library of loras. arXiv preprint
arXiv:2405.11157, 2024.
Omer Sagi and Lior Rokach. Ensemble learning: A survey. Wiley interdisciplinary reviews: data mining and knowledge
discovery, 8(4):e1249, 2018.
Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine
learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.
George Stoica, Daniel Bolya, Jakob Bjorner, Pratik Ramesh, Taylor Hearn, and Judy Hoffman. Zipit! merging models
from different tasks without training. arXiv preprint arXiv:2305.03053, 2023.
Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, and Dacheng Tao. Parameter efficient
multi-task model fusion with partial linearization. arXiv preprint arXiv:2310.04742, 2023.
Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Do, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep
model fusion. arXiv preprint arXiv:2406.03280, 2024a.
Anke Tang, Li Shen, Yong Luo, Shiwei Liu, Han Hu, and Bo Du. Towards efficient pareto set approximation via
mixture of experts based model fusion. arXiv preprint arXiv:2406.09770, 2024b.
Anke Tang, Li Shen, Yong Luo, Nan Yin, Lefei Zhang, and Dacheng Tao. Merging multi-task models via weight-
ensembling mixture of experts. arXiv preprint arXiv:2402.00433, 2024c.
13
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
Norman Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, and Rongjie Lai. Optimizing mode connectivity
via neuron alignment. Advances in Neural Information Processing Systems, 33:15300–15311, 2020.
Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language
models. arXiv preprint arXiv:2401.10491, 2024a.
Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, and Wei Bi. Fusechat: Knowledge fusion of
chat models. arXiv preprint arXiv:2402.16107, 2024b.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A Multi-Task
Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP
Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium,
2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL http://aclweb.org/
anthology/W18-5446.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.
arXiv preprint arXiv:1910.03771, 2019a.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.
arXiv preprint arXiv:1910.03771, 2019b.
Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, and Ping Luo. π-tuning: Transferring
multimodal foundation models with optimal multi-task interpolation. In International Conference on Machine
Learning, pages 37713–37727. PMLR, 2023.
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene
recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition,
pages 3485–3492. IEEE, 2010.
Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Resolving interference when merging
models. arXiv preprint arXiv:2306.01708, 1, 2023.
Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem
Choshen, and Alessandro Sordoni. A survey on model moerging: Recycling and routing among specialized experts
for collaborative learning. arXiv preprint arXiv:2408.07057, 2024.
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging
in llms, mllms, and beyond: Methods, theories, applications and opportunities. arXiv preprint arXiv:2408.07666,
2024a.
Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representation
surgery for multi-task model merging. Forty-first International Conference on Machine Learning, 2024b.
Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging:
Adaptive model merging for multi-task learning. The Twelfth International Conference on Learning Representations,
2024c.
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities
from homologous models as a free lunch. In Forty-first International Conference on Machine Learning, 2024.
David Yunis, Kumar Kshitij Patel, Pedro Henrique Pamplona Savarese, Gal Vardi, Jonathan Frankle, Matthew Walter,
Karen Livescu, and Michael Maire. On convexity and linear mode connectivity in neural networks. In OPT 2022:
Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
Hongling Zheng, Li Shen, Anke Tang, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Learn from model beyond
fine-tuning: A survey. arXiv preprint arXiv:2310.08184, 2023.
14
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
where αij = ⟨A, ui vj⊤ ⟩ is the projection of A onto the basis ui vj⊤ , ΣA is a real matrix and ΣA (i, j) = αij . This
decomposition provides a powerful framework for analyzing the fine-tuning process. When we fine-tune a pre-trained
model, we can interpret the weight updates ∆W as modifications to the singular value matrix Σ, while the singular
vectors U and V remain constant. Then we partition the singular matrix Σ into three zones:
Prhalf Pr
• Zone I & Subspace I: {1, . . . , rhalf }, where rhalf is chosen such that i=1 σi ≈ 21 i=1 σi . The basis of
this subspace is {ui vj⊤ |1 ≤ i, j ≤ rhalf }. The projection merged weights in this subspace can be computed as
follows:
rhalf rhalf
X X
WI = W + ⟨∆W, ui vj⊤ ⟩ui vj⊤ = W + Urhalf Ur⊤half ∆W Vrhalf Vr⊤half . (21)
i=1 j=1
Where Urhalf and Vrhalf are the first rhalf columns of U and V , respectively.
• Zone II & Subspace II: {rhalf + 1, . . . , r}, where r = rank(W ). The basis of this subspace is
{ui vi⊤ }ri=rhalf +1 . The basis of subspace II & III is {ui vj⊤ |rhalf + 1 ≤ i ≤ r, rhalf + 1 ≤ j ≤ r}.
The projection merged weights in this subspace can be computed as follows:
r
X r
X
WII = W + ⟨∆W, ui vj⊤ ⟩ui vj⊤ = W + Urhalf +1:r Ur⊤half +1:r ∆W Vrhalf +1:r Vr⊤half +1:r .
i=rhalf +1 j=rhalf +1
(22)
Where Urhalf +1:r and Vrhalf +1:r are the (rhalf + 1)-th to r-th columns of U and V , respectively.
• Zone III & Subspace II + III: The basis of this subspace is {ui vj⊤ |r + 1 ≤ i ≤ m, r + 1 ≤ j ≤ n} and the
projection merged weights in this subspace can be computed as follows:
m
X n
X
WII+III = W + ⟨∆W, ui vj⊤ ⟩ui vj⊤ = W + Ur+1:m Ur+1:m
⊤ ⊤
∆W Vr+1:n Vr+1:n . (23)
i=r+1 j=r+1
Where Ur+1:m and Vr+1:n are the (r + 1)-th to m-th columns of U and (r + 1)-th to n-th columns of V .
We then evaluate the performance of these modified weight matrices on the downstream tasks. The accuracy comparison
in Figure 2b is obtained by using these modified weight matrices in place of the original pre-trained weights. For other
layers instead of the linear layers, we keep the pre-trained weights unchanged.
15
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
the pre-linear layers fpre , the linear layers flinear (x) = Wf t x + bf t , and the post-linear layers fpost . Therefore, the
output of the network can be expressed as f (x) = fpost (flinear (fpre (x))). Without loss of generality, the pre-linear
layers fpre can be dropped, as our focus is on the fine-tuning process of the linear layer flinear .
During fine-tuning, the model parameters are updated by minimizing a loss function L with respect to the model
weights. Using stochastic gradient descent (SGD) as the optimization algorithm, the weight update ∆W and bias update
∆b at each step can be expressed as:
W (t+1) − W (t) = −η∇W L fpost W (t) x + b(t) , b(t+1) − b(t) = −η∇b L fpost W (t) x + b(t) . (24)
Where η is the learning rate, and ∇W L and ∇b L are the gradients of the loss function with respect to the weights and
biases, respectively. In Eq. (24), we omit the pre-linear layers fpre and use x as the input to the linear layer flinear for
simplicity. Let y = W x + b be the output of the linear layer, L′ = L ◦ fpost be the composed loss function. Starting
from the SGD update rule for the weights, we have:
W (t+1) − W (t) = −η∇W L′ W (t) x + b(t) (25)
= −η∇y L′ · ∇W W (t) x + b(t) (26)
= −η∇y L′ · xT , (27)
In practice, we typically use mini-batch SGD, where we average the gradients over a batch of samples. We can represent
this as an expectation:
W (t+1) − W (t) = −ηEx∼p(x) [∇y L · xT ] (28)
Given this gradient update rule, we can analyze how it relates to our choice of routing logits in SMILE. Recall that we
defined the routing logits as:
(i)⊤
r(i) = Vkgate x (29)
2
Let’s consider the expected weight update over many iterations when Wf t is close to W :
E[∆W ] ≈ −ηEx∼p(x) [∇y L · xT ] (30)
T
= −ηEx∼p(x) [g · x ] (31)
where g = ∇y L is the gradient of the loss with respect to the output of the linear layer. Now, let’s consider the singular
value decomposition (SVD) of the expected weight update:
E[∆W ] = U ΣV T (32)
The right singular vectors V represent the directions in the input space that are most important for the weight updates.
Our routing logits r(i) are based on projecting the input onto the top kgate right singular vectors of the fine-tuned weight
difference ∆W (i) . This choice is justified because: the right singular vectors of ∆W (i) are likely to be similar to
those of E[∆W ], as both represent important directions for task-specific updates. In addition, by projecting onto these
vectors, we’re measuring how much an input aligns with the directions that were most important during fine-tuning for
a specific task.
A NTK perspective. On the other hand, we can analyze the fine-tuning process from a neural tangent kernel (NTK)
perspective. Following [Tang et al., 2023], and according to the linear property of the linear layer, we have:
flinear (x; ϕf t ) − flinear (x; ϕ) (33)
⊤
= ∇ϕ flinear (x; ϕ) (ϕf t − ϕ) (34)
h i
⊤
≈ −ηEx′ ∼p(x) ∇ϕ flinear (x; ϕ) ∇ϕ L′ (flinear (x′ ; ϕ)) (35)
h i
⊤
= −ηEx′ ∼p(x) ∇ϕ flinear (x; ϕ) ∇ϕ flinear (x′ ; ϕ) ∇flinear L′ (flinear (x′ ; ϕ)) (36)
= −ηEx′ ∼p(x) [K(x, x′ ; ϕ)∇flinear L′ (flinear (x′ ; ϕ))] , (37)
Where ϕ denotes the pre-trained parameters of the flinear , i.e. W and b. ϕf t denotes the fine-tuned parameters Wf t
and bf t . K(x, x′ ; ϕ) = ⟨∇ϕ flinear (x; ϕ) , ∇ϕ flinear (x′ ; ϕ)⟩ is the neural tangent kernel (NTK) [Jacot et al., 2018]
of the linear layer flinear , and L′ = L ◦ fpost is the composed loss function. Note that for given x, K(x, x′ ; ϕ) is a
constant matrix.
16
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
Table 10: Performance of LoRA fine-tuned (rLoRA = 16) Flan-T5-Base models on eight downstream tasks.
Model CoLA MNLI MRPC QNLI QQP RTE SST2 STSB
Pre-trained 69.1 56.5 76.2 88.4 82.1 80.1 91.2 62.2
CoLA 69.1 39.9 75.2 89.1 81.1 81.9 90.7 54.0
MNLI 69.4 82.7 73.8 89.3 82.0 79.4 90.9 68.1
MRPC 64.0 44.9 85.5 82.6 81.0 69.0 88.6 73.6
QNLI 68.9 52.7 76.7 90.9 82.8 79.8 91.5 68.9
QQP 65.0 54.6 75.7 89.0 84.0 81.6 90.7 75.3
RTE 64.9 51.8 69.4 89.2 79.8 84.5 90.6 70.1
SST2 68.3 56.6 76.0 88.5 83.4 79.8 92.9 62.6
STSB 65.7 1.7 67.4 89.3 80.1 79.8 90.8 87.4
17
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
1. Task-Specific Improvements: Across all model types, fine-tuning consistently improves performance on the
target task compared to the pre-trained model. This demonstrates the effectiveness of task-specific adaptation.
2. Negative Transfer: In some cases, we observe negative transfer, where fine-tuning on one task harms perfor-
mance on another task. For example, in Table 7, the SVHN-tuned model performs worse on EuroSAT (9.8%)
compared to the pre-trained model (46.0%).
3. Task Relatedness: Some fine-tuned models show improved performance on related tasks. For example, in
Table 9, the MNLI-tuned model performs well on QNLI, suggesting a transfer of relevant knowledge between
these natural language inference tasks.
4. Varying Task Difficulty: The diagonal entries reveal that some tasks are inherently more challenging than
others. For instance, in the CLIP-ViT models, EuroSAT and MNIST consistently achieve very high accuracy,
while SUN397, Cars and DTD prove more challenging.
5. Model Size Impact: Comparing CLIP-ViT-B/32 and CLIP-ViT-L/14 results, we generally see improved
performance with the larger model, indicating that model capacity plays a role in both task-specific performance
and generalization.
D Hyperparameter Analysis
In this section, we present a comprehensive analysis of the hyperparameters k and kgate for the CLIP-ViT-B/32 and
CLIP-ViT-L/14 models across eight image classification datasets. We examine their impact on both model performance
(average accuracy) and model complexity (number of parameters). We also test the extreme cases when k = ∞, which
corresponds to full-rank experts, denoted as “Dense” in the figures. We normalize the number of parameters by the
number of parameters in the original model (87.5M for CLIP-ViT-B/32 and 303M for CLIP-ViT-L/14) to facilitate
comparison.
CLIP Models. From Figures 8 and 9, we observe that (1) The performance of the upscaled models is generally better
than the pre-trained models, which demonstrates the effectiveness of our fine-tuning strategy. (2) increasing the values
of k generally improves the performance of both the CLIP-ViT-B/32 and CLIP-ViT-L/14 models, though at the cost
of increased model complexity. (3) Increased values of kgate improve the performance of the upscaled models at the
beginning, but the performance starts to decrease when kgate is too large. This observation is consistent with our
discussion in Section 4 that a larger kgate may result in a less discriminative gating mechanism. (4) Better performance
preservation can be achieved with the CLIP-ViT-L/14 model than with the CLIP-ViT-B/32 model, which is consistent
with our discussion in Section 3 that the larger model has more dimension redundancy and is less severe to the parameter
interference problem.
In practice, when selecting hyperparameters for the upscaled models, it is crucial to balance the trade-off between
performance and parameter overhead. Take CLIP-ViT-B/32 as an example, a good trade-off between performance
and parameter overhead can be achieved with k ≈ 32 and kgate ≈ 16. For the CLIP-ViT-L/14 model, k ≈ 64 and
kgate ≈ 8 are recommended. By doing so, we obtain a multi-task model that achieves around 98% of the performance
of the fine-tuned model with only 20% of the total parameters compared to maintaining eight individual fine-tuned
models for each task. Note that the upscaled SMILE model is sparsely inferenced, increasing the number of parameters
by N only increases the activated parameters by about N/T for each token. Even with an extreme focus on storage and
18
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
80 1
79.5 83.6 86.1 87.7 88.5 89.0 89.3 2
768 512 256 128 64 32 16 8
Average Accuracy
82.5
80.3 84.0 86.6 88.2 88.9 89.3 89.6 4
80.0 8
kgate
8
12 Normalized Number of Parameters kgate
1.10 1.16 1.28 1.52 2.01 2.98 8.81
4
7 1
1.13 1.19 1.31 1.55 2.04 3.01 8.84
768 512 256 128 64 32 16 8
10 2
1.19 1.25 1.37 1.61 2.10 3.07 8.90 6 4
8
kgate
(c) Heatmap of the normalized number of parameters. (d) Line plot of normalized number of parameters.
Figure 8: Hyperparameter analysis of the CLIP-ViT-B/32 model on eight image classification datasets. Here we show
how different values of hyperparameters k and kgate affect the average performance and the number of parameters
(normalized by the number of parameters in the original model) in the upscaled model. We also show the average
accuracy of pre-trained models and individual fine-tuned models in subfigure (b).
inference costs, only 7% of the parameters overhead can achieve an average multi-task performance of about 90% of
the individual fine-tuned models.
Flan-T5 Models. Figure 5 shows the hyperparameter analysis of the Flan-T5-Base models on eight tasks from the
GLUE benchmark. We conduct the same analysis for both full fine-tuned models and LoRA fine-tuned models with
rLoRA = 16. It is observed that the performance is relatively stable, with most configurations yielding accuracies
around 85.1 to 85.6 for the full fine-tuned models and around 83.8 to 84.0 for the LoRA fine-tuned models. Upscaling
LoRA fine-tuned models is very parameter-efficient, with the number of parameters increasing by only 2% to 7%
compared to the original dense model. For a balanced trade-off between performance and parameter overhead, consider
setting k ≈ 32 and kgate ≈ 8 for the full fine-tuned model fusion, and k ≈ 8 and kgate ≈ 2 for the LoRA fine-tuned
model fusion.
19
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
95
1 86.0 88.4 90.4 91.7 92.4 92.7 93.0
93
90
86.9 89.1 90.9 92.1 92.7 93.0 93.2 92
2
85
Average Accuracy
91
87.5 89.8 91.4 92.6 93.1 93.4 93.6
4
kgate
90 80
kgate
87.8 90.1 91.6 92.7 93.3 93.5 93.8
8
89 1
75 2
87.9 90.2 91.7 92.8 93.4 93.6 93.9 88 4
16
70 8
87 16
87.9 90.3 91.8 92.9 93.4 93.6 93.9
32
65 32
86
4 8 16 32 64 128 Dense 0 20 40 60 80 100 120
k k
2.6
8
Normalized Number of Parameters
2.4
1.06 1.11 1.20 1.39 1.76 2.51 8.99
2
7
2.2
1.08 1.12 1.22 1.40 1.78 2.52 9.00 6
4
2.0
kgate
5 kgate
1.8
1.10 1.15 1.24 1.43 1.80 2.55 9.02
8
1
4
1.6 2
1.15 1.19 1.29 1.47 1.85 2.59 9.07 4
16
3 1.4
8
2 1.2 16
1.24 1.29 1.38 1.57 1.94 2.69 9.16
32
32
1.0
4 8 16 32 64 128 Dense 0 20 40 60 80 100 120
k k
(c) Heatmap of normalized parameter count. (d) Line plot of normalized parameter count.
Figure 9: Hyperparameter analysis of the CLIP-ViT-L/14 model on eight image classification datasets. Here we show
how different values of hyperparameters k and kgate affect the average performance and the number of parameters
(normalized by the number of parameters in the original model) in the upscaled model. We also show the average
accuracy of pre-trained models and individual fine-tuned models in subfigure (b).
For the SMILE models, the hyperparameter settings were as follows: kgate was consistently set to 8 across all
experiments, while k ranged from 8 to 512 (including 8, 16, 32, 64, 128, 256, 384, and 512), as shown in Figure 7.
Table 11 provides a more comprehensive view of the performance of individual models and various SMILE models
with different k values. We use EleutherAI/lm-evaluation-harness [Gao et al., 2024] to evaluate the models on
the four tasks: MMLU, TruthfulQA, GSM8K, and ARC Challenge. We merge the models on host memory and evaluate
them on two NVIDIA 4090 GPUs with 24GB of memory each.
It is notable that as the value of k increases, we generally see improved performance, especially in tasks like GSM8K
and TruthfulQA. The results also show a clear trade-off between model size and performance. The SMILE model with
k = 8 (7.3B parameters) already achieves comparable or better results than the pre-trained model on all tasks, while
larger models (k = 512, 11.2B parameters) approach the performance of individual expert models.
Limitations and future work for LLMs. Here we provide a brief discussion of the limitations of our experiments and
potential directions for future work.
• Limited expert model pool. In the experiments for CLIP models and Flan-t5 models, we use eight expert
models to evaluate the performance of the SMILE model. However, in the Mistral-7B experiments, the
experiments are currently limited to three expert models, which may not fully demonstrate the method’s
20
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT
Table 11: Detailed performance comparison of individual models and various SMILE models with different k values.
For all upscaled models, the kgate value was set to 8.
Model MMLU TruthfulQA GSM8K ARC Challenge
M0 (pre-trained) 59.64 42.62 38.81 53.92
M1 60.56 44.79 71.49 51.02
M2 60.56 55.88 56.93 57.00
M3 61.18 47.47 48.98 57.68
M0;123 (7.3B, k = 8) 60.28 46.31 46.55 55.55
M0;123 (7.5B, k = 32) 60.37 49.49 55.04 54.52
M0;123 (8.3B, k = 128) 60.43 50.91 63.76 54.35
M0;123 (9.3B, k = 256) 60.53 51.83 65.58 54.01
M0;123 (11.2B, k = 512) 60.66 52.79 67.85 54.35
capabilities with a larger, more diverse set of experts. Future work could explore the impact of additional
expert models on the performance of the SMILE model.
• LoRA fine-tuning. In the experiments, we use full fine-tuned Mistral-7B models as expert models. Where
the linear models are upscaled into MoE modules and the remaining parts of the model are copied directly
from the pre-trained model. The reason for this is that the top Mistral-7B models available on HuggingFace
are now fully fine-tuned. This approach, however, may not fully exploit SMILE’s potential. A more effective
strategy could involve using LoRA fine-tuned models as expert models. In this scenario, only specific linear
layers would be fine-tuned using low-rank techniques, with the rest of the model remaining frozen. This
approach could potentially enhance SMILE’s efficiency and effectiveness. As we have shown in the Flan-T5
experiments, LoRA fine-tuning can significantly reduce the number of additional parameters required to
achieve comparable performance.
21