2408.10174v2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

SMILE: Z ERO -S HOT S PARSE M IXTURE OF L OW-R ANK E XPERTS

C ONSTRUCTION F ROM P RE -T RAINED F OUNDATION M ODELS

A P REPRINT

Anke Tang1 , Li Shen2 , Yong Luo1 , Shuai Xie3 , Han Hu4 , Lefei Zhang1 , Bo Du1 , Dacheng Tao5
1
Wuhan University, 2 Sun Yat-sen University, 3 JD Explore Academy,
arXiv:2408.10174v2 [cs.LG] 26 Aug 2024

4
Beijing Institute of Technology, 5 Nanyang Technological University
1
{anketang,luoyong,zhanglefei,dubo}@whu.edu.cn, 2 mathshenli@gmail.com,
3
xieshuai@jd.com, 4 hhu@bit.edu.cn, 5 dacheng.tao@ntu.edu.sg

A BSTRACT
Deep model training on extensive datasets is increasingly becoming cost-prohibitive, prompting
the widespread adoption of deep model fusion techniques to leverage knowledge from pre-existing
models. From simple weight averaging to more sophisticated methods like AdaMerging, model fusion
effectively improves model performance and accelerates the development of new models. However,
potential interference between parameters of individual models and the lack of interpretability in the
fusion progress remain significant challenges. Existing methods often try to resolve the parameter
interference issue by evaluating attributes of parameters, such as their magnitude or sign, or by
parameter pruning. In this study, we begin by examining the fine-tuning of linear layers through the
lens of subspace analysis and explicitly define parameter interference as an optimization problem
to shed light on this subject. Subsequently, we introduce an innovative approach to model fusion
called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction, which allows for the
upscaling of source models into an MoE model without extra data or further training. Our approach
relies on the observation that fine-tuning mostly keeps the important parts from the pre-training, but it
uses less significant or unused areas to adapt to new tasks. Also, the issue of parameter interference,
which is intrinsically intractable in the original parameter space, can be managed by expanding the
dimensions. We conduct extensive experiments across diverse scenarios, such as image classification
and text generation tasks, using full fine-tuning and LoRA fine-tuning, and we apply our method to
large language models (CLIP models, Flan-T5 models, and Mistral-7B models), highlighting the
adaptability and scalability of SMILE. For full fine-tuned models, about 50% additional parameters
can achieve around 98-99% of the performance of eight individual fine-tuned ViT models, while for
LoRA fine-tuned Flan-T5 models, maintaining 99% performance with only 2% extra parameters.
Code is available at https://github.com/tanganke/fusion_bench.

Keywords Mixture of Experts · Model Fusion · Subspace Decomposition · Large Language Model

1 Introduction
In recent years, the field of deep learning has witnessed an exponential growth in model sizes and dataset scales,
making the training of large-scale deep models on extensive datasets increasingly cost-prohibitive, both in terms of
financial resources and environmental impact [Minaee et al., 2024, Hadi et al., 2023]. Deep model fusion techniques
have emerged as a promising solution, allowing the integration of knowledge from pre-existing models without the
need for extensive retraining [Li et al., 2023, Zheng et al., 2023, Yang et al., 2024a]. This approach not only reduces
computational costs but also enables the creation of more robust and versatile models by combining the strengths of
multiple models.
Following the categorization in Tang et al. [2024a], we classify model fusion methods into three main categories: model
ensemble methods, model merging methods, and model mixing methods. Model ensemble techniques aggregate the
predictions from several models to enhance performance [Sagi and Rokach, 2018]. While resource-intensive in terms of
memory and computation, it improves knowledge distillation training [Wan et al., 2024a,b]. Model merging methods,
on the other hand, combine the parameters of multiple models into a single model, often through weighted averaging
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

or parameter alignment [Matena and Raffel, 2022, Jin et al., 2022]. Model mixing methods involve the integration of
multiple models through gating mechanisms or depth concatenation, allowing for more flexible and adaptive fusion
strategies [Komatsuzaki et al., 2022, Kim et al., 2023]. These methods are particularly effective in multi-task learning
scenarios, where the merged model can simultaneously perform multiple tasks.
However, despite the promising advancements in model fusion, several critical challenges persist, hindering the full
realization of its potential. A primary concern is the potential interference between parameters of different models,
which leads to suboptimal performance. Additionally, the lack of interpretability in the fusion process remains a
significant hurdle, as current insights are largely confined to heuristic observations or simplified assumptions, such
as linear mode connectivity, parameter signs or importance [Ainsworth et al., 2022, Stoica et al., 2023, Yadav et al.,
2023, Yu et al., 2024]. Understanding how parameters are merged is crucial for building trust in the merged models and
for further improving fusion techniques. These challenges are particularly pronounced in complex, high-dimensional,
non-linear model architectures, where the interactions between parameters can be extremely intricate and non-intuitive.
Instead of relying on heuristic methods or simplified assumptions, we
propose a novel subspace perspective on understanding and address- 90
ing the parameter interference problem in this study. We first examine Ours
the fine-tuning process in linear layers through the lens of subspace Pretrained
80
analysis using matrix decomposition in Section 2. This allows us to Simple Average

Average Accuracy
decompose the prediction of a fine-tuned model into distinct com- Task Arithmetic
ponents, encompassing the pre-trained knowledge and task-specific Ties-Merging
70
adaptation. This approach provides insights into how models adapt Fisher Merging
RegMean
to downstream tasks while preserving pre-trained knowledge. Draw-
60 Layer-Wise AdaMerging
ing from experimental observations, we build a more comprehensive
Weight-Ensembling MoE
understanding of fine-tuning, we further formulate parameter inter- Individuals
ference as an optimization problem in Section 3, providing a more 50
rigorous and measurable perspective.
2 4 6 8
Based on our insights, we introduce an innovative approach called Normalized Number of Parameters
zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction,
enhancing existing source models into a more versatile MoE model. Figure 1: Multi-task model fusion experiment
The zero-shot aspect of our approach is particularly noteworthy, as on eight image classification tasks using CLIP-
it facilitates the immediate deployment of fused models in new en- ViT-B/32 models. Here we set kgate = 16 and
vironments or tasks, drastically minimizing the time and resources k is varied from 4 to 128 to investigate the trade-
typically required for model adaptation. off between performance and model size.
The effectiveness of our proposed method is rooted in two key obser-
vations derived from our subspace analysis. Firstly, we found that the fine-tuning largely preserves the most important
pre-trained weights and primarily utilizes less significant or previously unused dimensions of the parameter space to
adapt to new tasks. This preservation ensures that the critical pre-training knowledge encoded in the original models is
not lost during fine-tuning and implies that the parameter subspace required to accommodate new knowledge may vary
from task to task. Secondly, we found that while parameter interference is inherently difficult to address in the original
parameter space, it becomes more manageable when we increase the model’s dimensionality. This expansion creates
additional ‘room’ for task-specific parameter updates to coexist without mutual interference.
We conducted extensive experiments across various tasks and models in both the vision and language domains, utilizing
traditional full fine-tuning as well as Low-Rank Adaptation (LoRA) [Hu et al., 2021]. The results show that for models
that undergo full fine-tuning, adding approximately 50% more parameters allows us to achieve around 98-99% of
the performance of eight individual fine-tuned models. In the case of LoRA fine-tuned models, maintaining 99% of
the individual performance requires only a 2% increase in parameters. This method also offers trade-offs between
performance and model size, as illustrated in Figure 1, where we vary the rank k of local experts.
To summarize, our contributions in this study are as follows:

• We provide a novel subspace perspective on the fine-tuning process, shedding light on how models adapt
to new tasks while preserving pre-trained knowledge. In addition, We formulate the parameter interference
problem as an optimization problem, providing a more rigorous and measurable perspective on this issue.
• We introduce a zero-shot Sparse Mixture of Low-Rank Experts (SMILE) construction approach, enabling the
fusion of existing models into more unified versatile SMILE models. We also discuss the complexity of our
method, highlighting its potential for broader applications in deep learning research and practice.
• We demonstrate the effectiveness of our method through extensive experiments on a variety of tasks and setups,
showcasing its superior performance and efficiency compared to existing model fusion techniques.

2
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

2 Rethinking Model Fine-Tuning From a Subspace Perspective


In this study, we aim to construct a unified versatile model from multiple fine-tuned models, which can perform
multiple tasks or handle inputs from multiple domains simultaneously. We denote the number of fine-tuned models as T .
Before we delve into the proposed method’s details, we gain insights into the fine-tuning process from a singular value
decomposition (SVD) subspace perspective. In this section, we aim to (1) investigate and locate the task information in
the fine-tuned weights Wf t , and (2) understand how it is related to the pre-trained weights W .
Consider a linear layer of the pre-trained model with weight matrix W ∈ Rm×n and bias vector b ∈ Rm . After full
fine-tuning on a downstream task, the weight matrix and bias vector are updated to Wf t and bf t , respectively. To
achieve a deeper understanding of these updates, we need to employ mathematical tools that allow us to decompose the
parameter space into distinct ranges of importance, i.e. subspaces. We state the following theorem.
Theorem 1 Given two sets of orthonormal vectors {ui }pi=1 ⊂ Rm and {vi }qi=1 ⊂ Rn , 1 ≤ p ≤ m and 1 ≤ q ≤ n,
the set of matrices {ui vjT }p,q
i=1,j=1 forms an orthonormal basis for a subspace of R
m×n
with dimension pq.

Proof 1 For simplicity, let xij = ui vjT . The Frobenius inner product of two matrices xab and xcd is defined as
⟨xab , xcd ⟩ = tr ua vbT (uc vdT )T = tr ua vbT vd uTc ∈ R.
 
(1)

Orthonormality: we consider three cases:


1. If a = c and b = d, then ⟨xab , xcd ⟩ = tr ua uTa = u⊤

a ua = 1.

2. If b ̸= d, then vbT vd = 0 and ⟨xab , xcd ⟩ = 0.


3. If b = d and a ̸= c, then vbT vd = 1 and ⟨xab , xcd ⟩ = tr(ua u⊤ ⊤
c ) = ua uc = 0.

Thus, {xij }p,q


i,j=1 is orthonormal.
Pp Pq
Linear Independence: assume there exists a nonzero matrix α ∈ Rp×q such that i=1 j=1 αij xij = 0. For any
a ∈ [p]and b ∈ [q], take the inner product of both sides with xab . We obtain the following:
* p q +
XX
αij xij , xab = ⟨0, xab ⟩ = 0. (2)
i=1 j=1

By the linearity of the inner product and the orthogonality, we proved:


p X
X q
αij ⟨xij , xab ⟩ = αab ⟨xab , xab ⟩ = αab = 0. (3)
i=1 j=1

Since this holds for any a and b, we conclude that all αij = 0. This leads to a contradiction to the assumption that α is
nonzero. Therefore, the set {xij }ri,j=1 is linearly independent, which is the necessary and sufficient conditions for a set
of elements to form a basis for a vector space with dimension pq.

We start by decomposing the weight matrix W using the reduced SVD as W = Ur Σr VrT , where Ur ∈ Rm×r and
Vr ∈ Rr×n are orthonormal matrices containing the left singular vectors and right singular vectors, respectively,
Σr ∈ Rr×r is a diagonal matrix containing the singular values sorted in descending order, and r is the rank of the matrix
W [Olver and Shakiban, 2018]. In the case of full SVD, the matrices are U ∈ Rm×m , Σ ∈ Rm×n , and V ∈ Rn×n ,
which preserve all information about the matrix W , including its kernel (null space) and cokernel (left null space), as
shown in Figure 2a.
Remark 1 According to Theorem 1, the set of matrices {ui vjT |i ∈ [m], j ∈ [n]} forms an orthonormal basis for a
subspace of Rm×n with dimension mn. In P other words, for any real matrix A ∈ Rm×n , we can express it as a weighted
m Pn
sum of the elements in the basis, i.e. A = i=1 j=1 ⟨A, ui vjT ⟩ui vjT ∈ span({ui vjT }m,n
i,j ).

Remark 2 Let U and V be two subsets of {ui }m n T


i=1 and {vi }i=1 , respectively. {uv |u ∈ U, v ∈ V} forms a orthonormal
m×n
basis for a subspace of R , with dimension |U||V| ≤ mn.

To gain insights into how fine-tuning modifies the pre-trained weights to adapt them to a specific task, we assume the
fine-tuned linear layer accepts an input x ∈ Rn and outputs y = Wf t x + bf t . Because the row space {vi }ni=1 is an

3
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

Pretrained Finetuned Space I Space II Space II & III

1.0
I
II
0.8
kernel

Accuracy
cokernel
III
0.6

0.4

or
SUN397 CARS RESISC45 EuroSAT SVHN GTSRB MNIST DTD

(a) Full SVD and subspace partition. (b) Accuracy comparison across different subspace projection strategies.
Figure 2: Here we show the SVD decomposition and subspace partition of the singular value matrix Σ, and the accuracy
comparison of different subspace projection strategies discussed in Section 2.

Pn
orthonormal basis for Rn , we can decompose x as x = i=1 ⟨x, vi ⟩vi , where ⟨·, ·⟩ denotes the vector inner product. On
the other hand, Wf t and bf t are updated from the pre-trained weights W and b. We can derive the following equation:
y = Wf t x + bf t = (W + ∆W )x + b + ∆b = W x + b + ∆W x + ∆b . (4)
| {z } | {z }
pre-trained part fine-tuned part

Now we expand the pre-trained part and fine-tuned part in Eq.(4) separately as follows:
 
Xn Xn Xr
pre-trained part = W ⟨x, vi ⟩vi + b =  σj uj vj⊤  ⟨x, vi ⟩vi + b (5)
i=1 i=1 j=1

X r
n X r
X
= σj ⟨x, vi ⟩uj vj⊤ vi +b= σj ⟨x, vj ⟩uj + b, (6)
i=1 j=1 j=1
 
n
X n
X m X
X n
fine-tuned part = (Wf t − W )⟨x, vi ⟩vi + ∆b =  δjk uj vk⊤  ⟨x, vi ⟩vi + ∆b (7)
i=1 i=1 j=1 k=1

X m X
n X n m X
X n
= δjk ⟨x, vi ⟩uj vk⊤ vi + ∆b = δjk ⟨x, vk ⟩uj + ∆b. (8)
i=1 j=1 k=1 j=1 k=1

Where δjk = ⟨∆W, uj vk⊤ ⟩ = (U ⊤ ∆W V )jk is the Frobenius inner product between the fine-tuned weight update
∆W and the rank-one matrix uj vk⊤ . It also quantifies how much the weight updates align with the direction specified
by uj vk⊤ and indicates which input-output transformation is enhanced or suppressed (or enhanced reversely) during
fine-tuning, based on its sign and magnitude. For example, a large positive δjk suggests that the connection between the
k-th input direction (vk ) and j-th output direction (uk ) is strengthened for the downstream task. This decomposition
shows how different the pre-trained and fine-tuned parts contribute to the output. So far, we only understand that the
fine-tuned update ∆W potentially uses all mn dimensions, while the pre-trained part only uses r dimensions.
We further split left/right singular vectors into three distinct subsets based on the distribution of the singular values
of ∆W , and design an ablation study corresponding to different zones in the projection coefficient matrix U ⊤ ∆W V :
(1) The top-left zone I contains the most significant singular values that cumulatively account Prhalffor 50% P of the total
r
sum of the singular values, we denote the number of singular values in this zone as rhalf , i=1 σi ≈ i=1 σi /2.
This zone is crucial for preserving pre-training information. (2) The middle zone II encompasses the singular values
that make up the remaining half of the cumulative sum. These values are still important but less so than those in
zone I. (3) Zone III contains no information about the pre-trained weights, as its range is beyond rank(W ). This
zone partition is illustrated as the Σ in Figure 2a. We fine-tune the pre-trained CLIP-ViT-B/32 model on eight
downstream tasks from different domains, including hand-written digit images, satellite images, regular patterns,
car images, and natural images. Then we project the ∆W onto the different subspaces, including (1) subspace I
span({ui vjT |i ∈ [rhalf ], j ∈ [rhalf ]}), (2) subspace II span({ui vjT |i ∈ [rhalf , r], j ∈ [rhalf , r]}), and (3) subspace
II & III span({ui vjT |i ∈ [rhalf , m], j ∈ [rhalf , n]}). We compare the performance of the pre-trained model, the
fine-tuned models, and modified fine-tuned models with different subspace projection strategies applied on ∆W in
Figure 2b. For more details, please refer to Appendix A.

4
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

Figure 2b demonstrates that projecting fine-tuned updates onto subspace I result in a slight improvement in performance
on downstream tasks compared to the pre-trained model, sometimes showing no improvement at all. Projection onto
subspace II leads to moderate improvement, while projection onto subspace II & III results in significant performance
gains, nearly reaching the level of the fine-tuned model. Based on these observations, we draw the following conclusions:

Fine-tuning largely maintains the most important pre-trained features, but leverages less significant dimensions
for task-specific learning and activates or repurposes previously unused dimensions in the weight space.

3 Parameter Interference Between Task-Specific Models


From the previous section, we build an understanding of how fine-tuning modifies the pre-trained weights to adapt to
a specific downstream task. In this section, we investigate the parameter interference between models fine-tuned on
different tasks, which has been widely explored in multi-task learning and multi-task model merging, primarily within
(i) (i)
the model parameter space. We add superscripts to denote the task index, e.g. Wf t and bf t for the i-th task.
Assume we have T tasks, and each task has a fine-tuned model. In the simplest cases, we consider the linear layers,
accepting a common input x and outputting T different outputs y (1) , y (2) , ..., y (T ) . According to Eq.(4) and Eq.(8),
each y (i) can be decomposed into the pre-trained part and fine-tuned part as follows:
r n
m X
(i)
X X
y (i) = σj ⟨x, vj ⟩uj + b + δjk ⟨x, vk ⟩uj + ∆b(i) . (9)
j=1 j=1 k=1
| {z } | {z }
pre-trained part fine-tuned part

Where the pre-trained term is shared and remains constant during fine-tuning across all tasks. In the context of model
merging, these models are merged to construct a unified multi-task model that can perform all tasks simultaneously.
PT (l)
A common approach is to use a weighted average of the fine-tuned weights, i.e. Wmerged = l=1 λl Wf t and
PT (l)
bmerged = l=1 λl bf t . This is equivalent to merging the fine-tuned parts of the models, while the pre-trained parts are
shared across all tasks. Therefore, we express the output of the merged model as:
m X n T
! T
(l)
X X X
ymerged = pre-trained part + λl δjk ⟨x, vk ⟩uj + λl ∆b(l) . (10)
j=1 k=1 l=1 l=1

(i)
Substitute the input x with x from the i-th task (domain), we aim to minimize the discrepancy between the output of
the merged model and the output of the i-th fine-tuned model. We formulate the optimization problem as follows:
" T ! # ! 2
2 m X n T
(l) (i)
X X X
min ymerged − y (i) = min λl δjk − δjk ⟨x(i) , vk ⟩uj + λl ∆b(l) − ∆b(i) . (11)
λl 2 λl
j=1 k=1 l=1 l=1
2
Using triangle inequality, we can decompose the error into two parts and assert an upper bound:
" T ! # 2 ! 2
2 m X n T
(l) (i)
X X X
(i) (i) (l)
ymerged − y ≤ λl δjk − δjk ⟨x , vk ⟩uj + λl ∆b − ∆b(i) . (12)
2
j=1 k=1 l=1 l=1 2
| {z }2 | {z }
weight term bias term

For the bias term, we have a closed-form solution for λl = (∆B ⊤ ∆B)−1 ∆B∆b(i) , where ∆B is a matrix with
l-th columns as ∆b(l) . Notice that this solution varies with the task (domain) index of the coming input x(i) , so
a more straightforward solution is to fix the bias term during the fine-tuning process, so that ∆b(i) = 0 for all
i. As for the weight term, parameter interference does not occur on subspace that is orthogonal to the input, i.e.
span({ui vjT |i ∈ [m], j ∈ [n] and ⟨x(i) , vj ⟩ = 0}), thus a possible strategy is to enlarge the input size to increase the
dimension of the orthogonal subspace. This explains why model merging methods perform better on larger models
with more dimension redundancy. When the input activates certain dimensions, i.e. for k such that ⟨x(i) , vk ⟩ = ̸ 0,
the interference is inevitable unless the domain gap between different tasks is large enough to make the activation
dimensions disjoint. Note that we can gain the same conclusion within the original model parameter space, by simply
replacing the basis vectors {ui }i and {vi }i in this section with the standard Euclidean basis vectors {ei }i .

5
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

output SMILE
hidden output
Router routing
weights

concatenate
shared pre-trained part

compute norm:

... ...
... ...

input routing input


Router
hidden weights hidden

Figure 3: The architecture of the proposed Sparse MIxture of Low-rank Experts (SMILE) model.

4 Resolving Parameter Interference using Sparse Mixture of Low-Rank Experts


Understanding that addressing parameter interference by model merging is difficult, even just for the bias terms, the
optimal method for weight combination has a closed-form solution that varies by task. To manage this challenge, we
introduce an innovative approach with a Sparse MIxture of Low-rank Experts (SMILE) model in this section, which
operates in a zero-shot fashion, meaning no data or training is required. An overview is shown in Figure 3.
We upscale the linear layers from source models to the SMILE model, which consists of a shared pre-trained part, a
router, and several low-rank experts. Figure 3 is organized into two primary sections: the overall model architecture
(left) and the routing mechanism (right).
Recall the output decomposition in Eq.(4) and Eq.(10), we can express the output of a merged model as the output of
the pre-trained model plus a weighted sum of the fine-tuned parts of the individual models. If we can identify the most
relevant experts for a given input, we can dynamically select the corresponding fine-tuned parts to combine with the
pre-trained part. Then the merging error in Eq.( 11) can be minimized. Mathematically, we can express this idea as:
m X n T
! T
   
(l)
X X X
(i)
ymerged = pre-trained part + λi x δjk ⟨x(i) , vk ⟩uj + λi x(i) ∆b(l) . (13)
j=1 k=1 l=1 l=1

Here, λ is a function that maps the input to a one-hot probability distribution over the tasks, i.e. λj x(i) = 1 if j = i,

and λj x(i) = 0 otherwise. However, a naive implementation of this idea would require a training process to learn the
parameters of the router and a large number of additional parameters to store the fine-tuned weights of all tasks. A
more efficient approach is to remove less significant terms from the fine-tuned components in Eq.(13), focusing on
retaining the most pertinent knowledge for each task. Therefore, the parameter space must be ranked by the importance
of its dimensions. However, from previous findings in Section 2, we know that the fine-tuned information is distributed
across less significant dimensions (Space II & III), which is a large portion of the whole space. We opt to use SVD to
decompose the parameter differences ∆W (i) for each task, and then apply a low-rank approximation to extract the
most important part as follows:
(i)
r k
(i) ⊤ (i) (i) (i) ⊤ (i) (i) (i) ⊤ (i) (i) (i) ⊤
X X
(i) i (i)
∆W =U Σ V = σj uj vj ≈ σj uj vj = Uk Σk Vk , 1 ≤ k ≤ r(i) . (14)
j=1 j=1

Where r(i) is the rank of the fine-tuned weight matrix ∆W (i) , and k is the rank of the low-rank approximation, which
(i) (i)
is determined as a hyperparameter. Uk and Vk contains the first k columns of U (i) and V (i) , respectively. Here
we drop the terms with indices j > k in the summation, which correspond to the less significant dimensions. Let
(i) (i) ⊤ (i)
A(i) = Σk Vk and B (i) = Uk , we can express the approximation similar to a LoRA adapter: ∆W x = B (i) A(i) x.
The following theorem states the optimality of this low-rank approximation.

6
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

Theorem 2 Given a matrix W ∈ Rm×n , its low-rank approximation Wk = Uk Σk Vk⊤ with rank k minimizes the
Frobenius norm of the difference between W and Wk , i.e. Wk = arg minrank(W ′ )=k ∥W − W ′ ∥F .

Another key component of the SMILE model is the router, which determines the routing weights. The routing weights
should reflect the importance of each expert for a given input, and we hypothesize that the most important dimensions
of the parameter updates have a larger probability of aligning with the input vector. We provide a rationale for this
hypothesis by examining the gradient flow throughout fine-tuning in Appendix B. Therefore, we design the routing
logits as the L2 norm of the projections of the input onto the low-rank matrices. Mathematically, we can express the
routing weights as:
kgate
(i)⊤
XD (i)
E
r(i) = x, vj = Vkgate x . (15)
2
j=1
2

Where kgate is the number of dimensions used for routing, which is a hyperparameter. kgate should not be excessively
large, which could diminish the distinctiveness of routing weights. In the extreme case where kgate = n, r(i) = ∥x∥2 ,
which is equivalent to a uniform distribution over all experts. In our hyperparameter analysis, we find that kgate = 4 or
8 is a good choice for most tasks. To summarize, the output of the SMILE module can be expressed as:
T  
X λi (i) (i) (i) ⊤
y = (W x + b) + PT Uk Σk Vk x + ∆b(i) (16)
i=1 j=1 λi

pi , pi ∈ TopK({pj }Tj=1 , K)
λi = (17)
0, otherwise
(i)⊤
 
pi = softmaxi (r(i) ) = softmaxi Vkgate x . (18)
2

Complexity Analysis: The linear layer has m(n+1) parameters. The upscaled SMILE module has m(n+1)+T (mk +
nk + m) + nT kgate parameters, the additional parameters have a space complexity of O(T (mk + n(k + kgate ))).
For every input token, an additional number of parameters of nT kgate + K(mk + nk + m) are activated, with K
representing the top-K hyperparameter. For instance, with T = 8, kgate = 4, k = 32, K = 1 and m = n = 1024, the
SMILE model has 565K additional parameters, which is about 53.9% of the original parameter count. 99K additional
parameters are activated for each input token, which is only about 9.4% of the original parameter count.
Extending to Parameter-efficient fine-tuned (PEFT) models: It is straightforward to extend the SMILE upscaling
to PEFT models, such as LoRA fine-tuned models. We can still decompose the fine-tuning updates using SVD as
∆WLoRA = BLoRA ALoRA . Note that for parameter-efficient fine-tuned models, such as LoRA fine-tuned models,
kgate should be set to a smaller value than the hyperparameter rank of LoRA rLoRA , and k ≤ rLoRA .

5 Experiments

In this section, we evaluate the effectiveness of the Table 1: Requirements of different model fusion methods.
proposed SMILE on a variety of setups, including im-
age classification and text generation tasks, as well as Methods Validation Set Test-Time Adaptation
full fine-tuning and LoRA fine-tuning. Detailed infor- Weight Averaging
mation about the fine-tuned models is in Appendix C. Fisher-Merging ✓
We compare our method with several SOTA model fu- RegMean ✓
sion techniques, including Simple Averaging [Wolf Task Arithmetic ✓
et al., 2019b], Fisher merging [Matena and Raffel, Ties-Merging ✓
2022], RegMean [Jin et al., 2022], Task Arithmetic [Il- AdaMerging ✓
harco et al., 2022], Ties-Merging [Yadav et al., 2023], WEMoE ✓
AdaMerging [Yang et al., 2024c], and WEMoE [Tang SMILE (Ours)
et al., 2024c]. To further demonstrate the scalabil-
ity of SMILE upscaling, we also conduct experiments using off-the-shelf large language models fine-tuned from
Mistral-7B-v0.1. Our code implementation is based on FusionBench [Tang et al., 2024a].

5.1 Multi-Task Model Fusion on Open-Vocabulary Image Classification Tasks

7
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

Table 2: Multi-task model fusion methods on eight image classification tasks using CLIP-ViT-B/32 models. Here we
show two different hyperparameter settings for our method: (1) kgate = 16, k = 32 and the normalized parameter
count is 1.61; (2) kgate = 16, k = 128 and the normalized parameter count is 3.07.
Method SUN397 Cars RESISC45 EuroSAT SVHN GTSRB MNIST DTD Avg.
Individual 75.0 78.3 95.2 99.0 97.3 98.9 99.6 79.7 90.3 (100%)
Simple Averaging 65.4 62.6 70.8 76.9 64.5 54.9 86.3 50.9 66.5 (73.6%)
Fisher Merging 66.7 64.0 72.2 91.6 69.0 64.3 83.5 53.7 70.6 (78.2%)
RegMean 67.8 68.9 82.5 94.4 90.6 79.2 97.6 63.2 80.5 (89.1%)
Task Arithmetic 57.1 55.7 64.9 76.7 77.9 68.5 96.1 47.2 68.0 (75.3%)
Ties-Merging 67.1 64.2 74.1 76.8 77.7 69.4 94.1 54.0 72.2 (80.0%)
AdaMerging 67.9 71.3 83.5 92.7 87.4 92.9 98.2 67.0 82.6 (91.5%)
WEMoE (× 6.27) 73.7 76.8 93.4 98.2 96.8 98.2 99.6 76.6 89.2 (98.8%)
SMILE (1, ×1.61) 73.6 74.4 89.5 98.1 95.4 97.3 99.5 76.3 87.7 (97.1%)
SMILE (2, ×3.07) 73.6 77.8 92.0 98.3 96.9 98.1 99.6 78.1 89.3 (98.9%)

Table 3: Multi-task model fusion methods on eight image classification tasks using CLIP-ViT-L/14 models. Here we
show two different hyperparameter settings for our method: (1) kgate = 16, k = 32 and the normalized parameter
count is 1.47; (2) kgate = 16, k = 128 and the normalized parameter count is 2.56.
Method SUN397 Cars RESISC45 EuroSAT SVHN GTSRB MNIST DTD Avg.
Individual 82.8 92.9 97.4 99.2 97.9 99.2 99.8 85.5 94.3 (100%)
Simple Averaging 72.5 81.5 82.2 90.0 81.6 74.0 96.6 61.8 80.0 (84.8%)
Fisher Merging 70.6 79.4 84.1 98.1 74.7 85.0 89.5 61.0 80.3 (85.2%)
RegMean 75.3 88.4 90.0 97.1 95.9 92.4 98.5 72.6 88.8 (94.2%)
Task Arithmetic 72.0 79.0 80.5 86.0 87.5 83.5 98.0 58.8 80.7 (85.6%)
Ties-Merging 74.7 83.3 86.4 91.3 89.7 85.2 97.8 63.9 84.0 (89.1%)
AdaMerging 78.1 90.7 90.8 96.5 94.8 97.5 98.6 81.3 91.0 (96.5%)
WEMoE (×6.40) 81.5 92.3 96.5 98.8 97.6 99.4 99.6 84.5 93.8 (99.5%)
SMILE (1, ×1.47) 79.9 91.0 94.3 99.0 97.9 98.6 99.7 82.2 92.8 (98.4%)
SMILE (2, ×2.56) 81.9 92.3 95.5 99.1 98.0 98.9 99.7 83.6 93.6 (99.3%)

We first evaluate our proposed SMILE method on eight diverse open- 95


vocabulary image classification tasks using CLIP models from Hug-
gingFace 12 . For each task, the text encoder of the pre-trained model Ours
90
is frozen, and only the vision encoder is fine-tuned. Table 1 presents Pretrained
the requirements of different model fusion methods, highlighting that Simple Average
85
Average Accuracy

Task Arithmetic
SMILE is a training-free model fusion method that does not require
Ties-Merging
additional labeled samples or test-time adaptation. 80
Fisher Merging
Figures 1 and 4 illustrates the average accuracy of the merged model RegMean
75
across different methods, for SMILE kgate is set to 16 and k is varied Layer-Wise AdaMerging
from 4 to 128. These two figures demonstrate the effectiveness of 70 Weight-Ensembling MoE
SMILE and the trade-off between performance and model size. Individuals
65
In Tables 2 and 3, we compare the performance of various model
fusion methods on CLIP-ViT-B/32 and CLIP-ViT-L/14 models, re- 2 4 6 8
Normalized Number of Parameters
spectively. Our SMILE method achieves competitive results across
all tasks. For instance, with CLIP-ViT-L/14 models, SMILE (2: Figure 4: Multi-task model fusion experiment
kgate = 16, k = 128) achieves 99.3% of the individual model per- on eight image classification tasks using CLIP-
formance while using only 2.56 times the parameters of a single ViT-L/14 models (kgate = 16).
model, compared to eight individual fine-tuned models and Weight-
Ensembling MoE which requires 6.40 times the parameters.

1
https://huggingface.co/openai/clip-vit-base-patch32
2
https://huggingface.co/openai/clip-vit-large-patch14

8
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

Table 4: Multi-task performance when merging Flan-T5-base (full fine-tuned) models on all eight tasks. Here we show
two different hyperparameter settings for our method: (1) kgate = 4, k = 16 and the normalized parameter count is
1.26; (2) kgate = 8, k = 32 and the normalized parameter count is 1.52.
Method CoLA MNLI MRPC QNLI QQP RTE SST2 STSB Avg.
Individual 75.0 83.4 87.5 91.5 85.4 85.9 93.6 88.7 86.4 (100%)
Weight Averaging 69.1 62.6 79.4 89.8 83.9 81.2 91.7 73.2 78.9 (91.3%)
Task Arithmetic 70.5 57.8 78.4 90.2 83.6 80.5 92.3 77.8 78.9 (91.3%)
Ties-Merging 70.3 65.0 78.9 90.2 83.5 81.6 91.7 78.3 79.9 (92.5%)
SMILE (1, ×1.26) 72.0 84.2 84.3 91.3 84.7 84.1 93.3 87.0 85.1 (98.5%)
SMILE (2, ×1.52) 73.2 84.2 85.0 91.3 84.9 84.8 93.5 87.3 85.5 (99.0%)

Table 5: Multi-task performance when merging Flan-T5-base (LoRA fine-tuned) models on all eight tasks. We choose
kgate = 2, k = 4 and the normalized parameter count is 1.02.
Method CoLA MNLI MRPC QNLI QQP RTE SST2 STSB Avg.
Individual 69.1 82.7 85.5 90.9 84.0 84.4 92.9 87.4 84.6 (100%)
Weight Averaging 69.7 59.7 78.9 90.1 83.8 80.5 91.2 72.0 78.2 (92.4%)
Task Arithmetic 68.8 55.2 78.7 89.8 83.7 79.1 91.5 72.4 77.4 (91.5%)
Ties-Merging 68.3 56.3 79.4 89.8 83.7 79.4 91.6 71.2 77.5 (91.6%)
SMILE (×1.02) 69.3 82.9 83.8 90.6 83.9 83.4 93.1 85.1 84.0 (99.3%)

5.2 Multi-Task Model Fusion on Text Generation Tasks

We further evaluate SMILE on text generation tasks using Flan-T5-base models 3 , which are fine-tuned on eight tasks
from the GLUE benchmark [Wang et al., 2018]. We use two different fine-tuning strategies: full fine-tuning and LoRA
fine-tuning with rLoRA = 16. We present the results in Tables 4 and 5, for full fine-tuned models and LoRA fine-tuned
models, respectively. For fully fine-tuned models, SMILE consistently outperforms other fusion methods across all
eight tasks. With just 1.52 times the parameters of a single model, SMILE (2: kgate = 8, k = 32) achieves 99.0% of
the individual model performance with 1.52 times the parameters of a single model. In the LoRA fine-tuned scenario,
SMILE maintains strong performance with minimal parameter increase (1.02 times). It achieves 99.3% of the individual
model performance, significantly surpassing other multi-task model fusion methods.

5.3 Spare Mixture of Low-Rank Experts Analysis

To better understand SMILE, we further conduct ablation studies using CLIP and Flan-T5 models.
Ablations on the low-rank approximation rank k and routing dimension kgate (hyperparameter analysis). Our
hyperparameter analysis demonstrates the flexibility and robustness of SMILE across different model architectures and
tasks. Figure 5 illustrates the impact of hyperparameters k and kgate on performance and parameter count for Flan-
T5-Base models in both full and LoRA fine-tuned scenarios. For CLIP-ViT models, Figures 8 and 9 provide detailed
heatmaps and line plots showing the relationship between hyperparameters, average accuracy, and parameter count.
Across all models, we observe a consistent trend: increasing k and kgate generally leads to improved performance,
but with diminishing returns as parameter count grows. Notably, SMILE achieves near-optimal performance with
relatively small values of k and kgate . This analysis highlights the effectiveness of SMILE in balancing performance and
efficiency, allowing users to fine-tune the trade-off based on their specific requirements. The stability of performance
across a range of hyperparameter values also underscores the robustness of our method, making it adaptable to various
multi-task fusion scenarios. For more details, please refer to Appendix D.
Ablations on Top-K routing (routing analysis). Here we compare different values K in the top-K routing mechanism.
The plots in Figure 6 illustrate the impact of varying K on both the average accuracy across all tasks (Figure 6a) and
the accuracy of each individual task (Figure 6b) when using the CLIP-ViT-B/32 model across eight image classification
tasks. We observe that the average accuracy decreases slightly as K increases. This suggests that larger values of K,
which allow more experts to be used for each input, are not necessary for multi-task model fusion where each expert is

3
https://huggingface.co/google/flan-t5-base

9
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

3.0 84.0
85.1 85.1 85.2 85.3 1.26 1.50 1.97 2.92 84.0 84.0 84.0 1.02 1.03 1.06
4

2
1.06
85.4 2.5
85.3 85.5 85.5 85.6 1.29 1.52 2.00 2.94
8

8
kgate

kgate

kgate

kgate
83.8 83.9 83.8 1.02 1.04 1.06

4
83.9
85.2 85.3 85.5 85.5 1.34 1.57 2.05 2.99 2.0 1.04
16

16
85.2
85.0 85.3 85.3 85.3 1.44 1.68 2.15 3.09 83.9 83.8 83.9 1.03 1.04 1.07

8
1.5
32

32
1.02
16 32 64 128 16 32 64 128 4 8 16 4 8 16
k k k k

(a) average performance (b) normalized parameters count (c) average performance (d) normalized parameters count

Figure 5: Hyperparameter analysis of the Flan-T5-Base models on eight tasks from GLUE benchmark. We show
how different values of hyperparameters k and kgate affect the average performance and the normalized number of
parameters in the upscaled model. Subfigures (a), and (b) show the results of the full fine-tuned models, while subfigures
(c), and (d) show the results of the fine-tuned models with rLoRA = 16.

100 SVHN
Average Accuracy (%)

Cars

Accuracy (%)
RESISC45
90
EuroSAT
GTSRB
80
MNIST
DTD
70 SUN397
2 4 6 8 2 4 6 8
Top-K Top-K
(a) Average accuracy. (b) Accuracy of each task.

Figure 6: Ablations on the Top-K routing for CLIP-ViT-B/32 models on eight image classification tasks (kgate =
16, k = 32). Here we show the average accuracy and the accuracy on each task, and the y-axis is shared.

specialized for a specific task. In general, the performance of individual tasks is relatively stable across different values
of K, indicating the robustness of the routing mechanism of SMILE (Equation (15)).

5.4 Scalability to Large-Scale Models (Mistral-7B models)

To demonstrate the scalability of SMILE to large-scale mod-


0.70 GSM8K Score of M1
els, we conduct experiments on Mistral-7B models. We use
Mistral-7B-v0.1 as our base pre-trained model, referred to as 0.65
GSM8K Match Score

M0 , and acquire three specialized models from HuggingFace [Wolf M0; 1

et al., 2019a]. The expert models are respectively labeled as M1 , 0.60


M0; 2
M0; 3
M2 , and M3 4 . Of these models, M1 stands out as the sole expert in
M0; 123
mathematics. To demonstrate the efficacy of the zero-shot routing 0.55
mechanisms, we construct four distinct series of SMILE models with
0.50
various expert combinations. These series are designated as M0;1 ,
M0;2 , M0;3 , and M0;123 , with M0;i1 ...in indicating the SMILE model 0.45
that combines M0 with the expert models Mi1 , . . . , Min .
0 100 200 300 400 500
In Figure 7, we present the GSM8K benchmark scores of the up- rank of local experts k

scaled Mistral-7B models with varying rank of local experts k with a


constant rank of routers kgate = 8. This plot highlights the trade-offs Figure 7: The GSM8K benchmark scores of
in selecting expert rank k for the upscaled SMILE model, where upscaled Mistral-7B models with varying k.
the GSM8K score generally improves as k increases, but this improvement is more pronounced for specific expert
combinations, particularly for M0;1 and M0;123 . This suggests that the routers succeed in selecting the proper expert
for math problems. In Table 6, we compare the performance of individual models and the upscaled model on various
benchmark tasks. Notably, the individual expert models show strengths in specific benchmarks, such as M1 excelling in

4
The expert models are meta-math/MetaMath-Mistral-7B, cognitivecomputations/dolphin-2.1-mistral-7b and
uukuguy/speechless-code-mistral-7b-v1.0 respectively.

10
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

Table 6: Comparison of individual Mistral-7B models and the upscaled model on various benchmark tasks.
Model MMLU TruthfulQA GSM8K ARC Challenge
M0 (pre-trained) 59.64 42.62 38.81 53.92
M1 60.56 44.79 71.49 51.02
M2 60.56 55.88 56.93 57.00
M3 61.18 47.47 48.98 57.68
M0;123 (11.2B, kgate = 8, k = 512) 60.66 52.79 67.85 54.35
Qwen1.5-14B (reference) 66.11 52.00 69.37 49.93

the GSM8K benchmark and M3 in the ARC Challenge. This indicates that each expert brings specialized knowledge,
which when combined, enhances the overall performance in a diverse set of tasks. More details are in Appendix E.

6 Related Work
Mixture of Experts. The concept of Mixture of Experts (MoE) is first introduced by Jacobs et al. [1991], involving
training multiple specialized models. This concept has gained significant attention in recent years [Jiang et al., 2024,
Dai et al., 2024], with much of the innovation focusing on routing mechanisms and expert design. Much innovation
revolves around the design of more efficient routers. For example, the Switch Transformer [Fedus et al., 2022b] selects
only the top expert for each token, simplifying the process and improving scalability. Similarly, [Lewis et al., 2021] use
a linear assignment to optimize token-expert affinities, ensuring an equal spread of tokens among experts. In [Ostapenko
et al., 2024], the authors propose to build a library of LoRA adapters using model-based clustering and build a MoE to
select the most relevant adapters based on input without retraining. For detailed reviews on MoE, see [Fedus et al.,
2022a], and for MoE in the context of model merging (MoEerging), refer to [Yadav et al., 2024].
Deep Model Fusion. Mode connectivity reveals that different model solutions can be linked by low-loss path in the
parameter space [Freeman and Bruna, 2016, Nagarajan and Kolter, 2019, Draxler et al., 2018, Frankle et al., 2020,
Entezari et al., 2021, Garipov et al., 2018, Tatro et al., 2020, Yunis et al., 2022, Benton et al., 2021], facilitating model
fusion by weight interpolation [Izmailov et al., 2018, Matena and Raffel, 2022, Wolf et al., 2019b, Kaddour, 2022,
Ilharco et al., 2022, Yadav et al., 2023, Yang et al., 2024c, Wu et al., 2023]. However, this strategy also poses challenges,
particularly when merging models with diverse structures. Alignment helps reduce model disparities by matching and
interpolating components [Li et al., 2015, Tatro et al., 2020]. Methods involve matching activations or weights [Stoica
et al., 2023, Jin et al., 2022, Yang et al., 2024b], using channel-wise graph matching [Liu et al., 2022], or applying
permutation invariance [Ainsworth et al., 2022]. Another line of research is model mixing, which combines models
through gating mechanisms or depth concatenation [Tang et al., 2024c, Lu et al., 2024, Tang et al., 2024b, Kim et al.,
2023], allowing for more flexible and adaptive fusion strategies.

7 Conclusion, Limitations, and Future Work


In this paper, we introduced the Sparse Mixture of Low-Rank Experts (SMILE) model as a novel approach to model
fusion, Our method leverages a zero-shot mechanism, eliminating the need for additional training data or processes,
which makes it highly practical. While the MoE method is designed to be efficient through sparse activation, it still
adds extra computational overhead, especially as the number of tasks or experts increases.
Understanding which subspaces contribute most to task-specific performance could lead to more targeted and efficient
fine-tuning strategies, potentially focusing on updating specific parts of the model while leaving others intact. Addition-
ally, this approach might be applied to other areas, like multi-modal large language models, where different types of
data (modalities) are treated as separate experts. Furthermore, it would be worth exploring how SMILE can manage
multi-objective optimization by adjusting the importance of different routing weights. Moreover, develop methods to
dynamically adjust the number of experts K per token based on the input, potentially improving efficiency without
sacrificing performance.

11
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

References
Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation
symmetries. arXiv preprint arXiv:2209.04836, 2022.
Gregory Benton, Wesley Maddox, Sanae Lotfi, and Andrew Gordon Gordon Wilson. Loss surface simplexes for mode
connecting volumes and fast ensembling. In International Conference on Machine Learning, pages 769–779. PMLR,
2021.
Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the
art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the
wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu,
Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv
preprint arXiv:2401.06066, 2024.
Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network
energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.
Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode
connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning. arXiv preprint
arXiv:2209.01667, 2022a.
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple
and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022b.
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the
lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint
arXiv:1611.01540, 2016.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding,
Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria
Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang,
and Andy Zou. A framework for few-shot language model evaluation, 07 2024.
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode
connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh,
Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. Large language models: a comprehensive survey of its applications,
challenges, limitations, and future prospects. Authorea Preprints, 2023.
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning
benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing, 12(7):2217–2226, 2019.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi,
and Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights
leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.
Neural computation, 3(1):79–87, 1991.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural
networks. Advances in neural information processing systems, 31, 2018.
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint
arXiv:2401.04088, 2024.
Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of
language models. arXiv preprint arXiv:2212.09849, 2022.

12
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

Jean Kaddour. Stop wasting my time! saving days of imagenet and bert training with latest weight averaging. arXiv
preprint arXiv:2209.14981, 2022.
Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim,
Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling.
arXiv preprint arXiv:2312.15166, 2023.
Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay,
Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv
preprint arXiv:2212.05055, 2022.
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In
Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of
large, sparse models. In International Conference on Machine Learning, pages 6265–6274. PMLR, 2021.
Weishi Li, Yong Peng, Miao Zhang, Liang Ding, Han Hu, and Li Shen. Deep model fusion: A survey, 2023.
Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do different neural
networks learn the same representations? arXiv preprint arXiv:1511.07543, 2015.
Chang Liu, Chenfei Lou, Runzhong Wang, Alan Yuhan Xi, Li Shen, and Junchi Yan. Deep neural network fusion
via graph matching with applications to model ensemble and federated learning. In International Conference on
Machine Learning, pages 13857–13869. PMLR, 2022.
Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration
of modular expertise in model merging. arXiv preprint arXiv:2406.15479, 2024.
Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information
Processing Systems, 35:17703–17716, 2022.
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng
Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024.
Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning.
Advances in Neural Information Processing Systems, 32, 2019.
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural
images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning,
volume 2011, page 4. Granada, 2011.
Peter J. Olver and Chehrzad Shakiban. Applied Linear Algebra. Undergraduate Texts in Mathematics. Springer Interna-
tional Publishing, Cham, 2018. ISBN 978-3-319-91040-6 978-3-319-91041-3. doi: 10.1007/978-3-319-91041-3.
URL http://link.springer.com/10.1007/978-3-319-91041-3.
Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux, Matheus Pereira, Lucas
Caccia, and Alessandro Sordoni. Towards modular llms by building and reusing a library of loras. arXiv preprint
arXiv:2405.11157, 2024.
Omer Sagi and Lior Rokach. Ensemble learning: A survey. Wiley interdisciplinary reviews: data mining and knowledge
discovery, 8(4):e1249, 2018.
Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine
learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.
George Stoica, Daniel Bolya, Jakob Bjorner, Pratik Ramesh, Taylor Hearn, and Judy Hoffman. Zipit! merging models
from different tasks without training. arXiv preprint arXiv:2305.03053, 2023.
Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, and Dacheng Tao. Parameter efficient
multi-task model fusion with partial linearization. arXiv preprint arXiv:2310.04742, 2023.
Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Do, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep
model fusion. arXiv preprint arXiv:2406.03280, 2024a.
Anke Tang, Li Shen, Yong Luo, Shiwei Liu, Han Hu, and Bo Du. Towards efficient pareto set approximation via
mixture of experts based model fusion. arXiv preprint arXiv:2406.09770, 2024b.
Anke Tang, Li Shen, Yong Luo, Nan Yin, Lefei Zhang, and Dacheng Tao. Merging multi-task models via weight-
ensembling mixture of experts. arXiv preprint arXiv:2402.00433, 2024c.

13
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

Norman Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, and Rongjie Lai. Optimizing mode connectivity
via neuron alignment. Advances in Neural Information Processing Systems, 33:15300–15311, 2020.
Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language
models. arXiv preprint arXiv:2401.10491, 2024a.
Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, and Wei Bi. Fusechat: Knowledge fusion of
chat models. arXiv preprint arXiv:2402.16107, 2024b.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A Multi-Task
Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP
Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium,
2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL http://aclweb.org/
anthology/W18-5446.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.
arXiv preprint arXiv:1910.03771, 2019a.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.
arXiv preprint arXiv:1910.03771, 2019b.
Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, and Ping Luo. π-tuning: Transferring
multimodal foundation models with optimal multi-task interpolation. In International Conference on Machine
Learning, pages 37713–37727. PMLR, 2023.
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene
recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition,
pages 3485–3492. IEEE, 2010.
Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Resolving interference when merging
models. arXiv preprint arXiv:2306.01708, 1, 2023.
Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem
Choshen, and Alessandro Sordoni. A survey on model moerging: Recycling and routing among specialized experts
for collaborative learning. arXiv preprint arXiv:2408.07057, 2024.
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging
in llms, mllms, and beyond: Methods, theories, applications and opportunities. arXiv preprint arXiv:2408.07666,
2024a.
Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representation
surgery for multi-task model merging. Forty-first International Conference on Machine Learning, 2024b.
Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging:
Adaptive model merging for multi-task learning. The Twelfth International Conference on Learning Representations,
2024c.
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities
from homologous models as a free lunch. In Forty-first International Conference on Machine Learning, 2024.
David Yunis, Kumar Kshitij Patel, Pedro Henrique Pamplona Savarese, Gal Vardi, Jonathan Frankle, Matthew Walter,
Karen Livescu, and Michael Maire. On convexity and linear mode connectivity in neural networks. In OPT 2022:
Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
Hongling Zheng, Li Shen, Anke Tang, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Learn from model beyond
fine-tuning: A survey. arXiv preprint arXiv:2310.08184, 2023.

14
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

A Projecting Fine-tuned Updates onto Subspaces


This section provides an in-depth mathematical explanation of the projection merge experiments discussed in Section 2.
These experiments aim to gain empirical insights into the distribution of task-specific information across different
subspaces of the weight matrix after fine-tuning a pre-trained model on downstream tasks.
Let W ∈ Rm×n be the weight matrix of a linear layer in the pre-trained model, and Wf t ∈ Rm×n be the corresponding
weight matrix after fine-tuning. We define the weight update as ∆W = Wf t − W . We begin by performing a full
Singular Value Decomposition (SVD) on the pre-trained weight matrix W :
W = U ΣV ⊤ (19)
where U ∈ Rm×m and V ∈ Rn×n are orthonormal matrices containing left and right singular vectors, respectively, and
Σ ∈ Rm×n contains the singular values in descending order. The first r diagonal elements of Σ are non-zero, where
r = rank(W ), while the remaining elements are zero. According to Theorem 1, we can leverage the properties of
singular value decomposition (SVD) to gain a deeper understanding of the fine-tuning process. This theorem states that
any matrix A ∈ Rm×n can be decomposed into a sum of rank-one matrices using the left singular vectors {ui }m i=1 and
right singular vectors {vi }ni=1 as bases:
m X
X n
A= αij ui vj⊤ = U ΣA V ⊤ (20)
i=1 j=1

where αij = ⟨A, ui vj⊤ ⟩ is the projection of A onto the basis ui vj⊤ , ΣA is a real matrix and ΣA (i, j) = αij . This
decomposition provides a powerful framework for analyzing the fine-tuning process. When we fine-tune a pre-trained
model, we can interpret the weight updates ∆W as modifications to the singular value matrix Σ, while the singular
vectors U and V remain constant. Then we partition the singular matrix Σ into three zones:
Prhalf Pr
• Zone I & Subspace I: {1, . . . , rhalf }, where rhalf is chosen such that i=1 σi ≈ 21 i=1 σi . The basis of
this subspace is {ui vj⊤ |1 ≤ i, j ≤ rhalf }. The projection merged weights in this subspace can be computed as
follows:
rhalf rhalf
X X
WI = W + ⟨∆W, ui vj⊤ ⟩ui vj⊤ = W + Urhalf Ur⊤half ∆W Vrhalf Vr⊤half . (21)
i=1 j=1
Where Urhalf and Vrhalf are the first rhalf columns of U and V , respectively.
• Zone II & Subspace II: {rhalf + 1, . . . , r}, where r = rank(W ). The basis of this subspace is
{ui vi⊤ }ri=rhalf +1 . The basis of subspace II & III is {ui vj⊤ |rhalf + 1 ≤ i ≤ r, rhalf + 1 ≤ j ≤ r}.
The projection merged weights in this subspace can be computed as follows:
r
X r
X
WII = W + ⟨∆W, ui vj⊤ ⟩ui vj⊤ = W + Urhalf +1:r Ur⊤half +1:r ∆W Vrhalf +1:r Vr⊤half +1:r .
i=rhalf +1 j=rhalf +1
(22)
Where Urhalf +1:r and Vrhalf +1:r are the (rhalf + 1)-th to r-th columns of U and V , respectively.
• Zone III & Subspace II + III: The basis of this subspace is {ui vj⊤ |r + 1 ≤ i ≤ m, r + 1 ≤ j ≤ n} and the
projection merged weights in this subspace can be computed as follows:
m
X n
X
WII+III = W + ⟨∆W, ui vj⊤ ⟩ui vj⊤ = W + Ur+1:m Ur+1:m
⊤ ⊤
∆W Vr+1:n Vr+1:n . (23)
i=r+1 j=r+1

Where Ur+1:m and Vr+1:n are the (r + 1)-th to m-th columns of U and (r + 1)-th to n-th columns of V .

We then evaluate the performance of these modified weight matrices on the downstream tasks. The accuracy comparison
in Figure 2b is obtained by using these modified weight matrices in place of the original pre-trained weights. For other
layers instead of the linear layers, we keep the pre-trained weights unchanged.

B The Gradient Flow During Fine-tuning


In this section, we analyze the gradient flow during the fine-tuning process to gain insights into how linear layers in a
deep neural network adapt to new tasks. We decompose a fine-tuned deep neural network f into three components:

15
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

the pre-linear layers fpre , the linear layers flinear (x) = Wf t x + bf t , and the post-linear layers fpost . Therefore, the
output of the network can be expressed as f (x) = fpost (flinear (fpre (x))). Without loss of generality, the pre-linear
layers fpre can be dropped, as our focus is on the fine-tuning process of the linear layer flinear .
During fine-tuning, the model parameters are updated by minimizing a loss function L with respect to the model
weights. Using stochastic gradient descent (SGD) as the optimization algorithm, the weight update ∆W and bias update
∆b at each step can be expressed as:
     
W (t+1) − W (t) = −η∇W L fpost W (t) x + b(t) , b(t+1) − b(t) = −η∇b L fpost W (t) x + b(t) . (24)

Where η is the learning rate, and ∇W L and ∇b L are the gradients of the loss function with respect to the weights and
biases, respectively. In Eq. (24), we omit the pre-linear layers fpre and use x as the input to the linear layer flinear for
simplicity. Let y = W x + b be the output of the linear layer, L′ = L ◦ fpost be the composed loss function. Starting
from the SGD update rule for the weights, we have:
 
W (t+1) − W (t) = −η∇W L′ W (t) x + b(t) (25)
 
= −η∇y L′ · ∇W W (t) x + b(t) (26)
= −η∇y L′ · xT , (27)
In practice, we typically use mini-batch SGD, where we average the gradients over a batch of samples. We can represent
this as an expectation:
W (t+1) − W (t) = −ηEx∼p(x) [∇y L · xT ] (28)
Given this gradient update rule, we can analyze how it relates to our choice of routing logits in SMILE. Recall that we
defined the routing logits as:
(i)⊤
r(i) = Vkgate x (29)
2
Let’s consider the expected weight update over many iterations when Wf t is close to W :
E[∆W ] ≈ −ηEx∼p(x) [∇y L · xT ] (30)
T
= −ηEx∼p(x) [g · x ] (31)
where g = ∇y L is the gradient of the loss with respect to the output of the linear layer. Now, let’s consider the singular
value decomposition (SVD) of the expected weight update:
E[∆W ] = U ΣV T (32)
The right singular vectors V represent the directions in the input space that are most important for the weight updates.
Our routing logits r(i) are based on projecting the input onto the top kgate right singular vectors of the fine-tuned weight
difference ∆W (i) . This choice is justified because: the right singular vectors of ∆W (i) are likely to be similar to
those of E[∆W ], as both represent important directions for task-specific updates. In addition, by projecting onto these
vectors, we’re measuring how much an input aligns with the directions that were most important during fine-tuning for
a specific task.
A NTK perspective. On the other hand, we can analyze the fine-tuning process from a neural tangent kernel (NTK)
perspective. Following [Tang et al., 2023], and according to the linear property of the linear layer, we have:
flinear (x; ϕf t ) − flinear (x; ϕ) (33)

= ∇ϕ flinear (x; ϕ) (ϕf t − ϕ) (34)
h i

≈ −ηEx′ ∼p(x) ∇ϕ flinear (x; ϕ) ∇ϕ L′ (flinear (x′ ; ϕ)) (35)
h i

= −ηEx′ ∼p(x) ∇ϕ flinear (x; ϕ) ∇ϕ flinear (x′ ; ϕ) ∇flinear L′ (flinear (x′ ; ϕ)) (36)
= −ηEx′ ∼p(x) [K(x, x′ ; ϕ)∇flinear L′ (flinear (x′ ; ϕ))] , (37)
Where ϕ denotes the pre-trained parameters of the flinear , i.e. W and b. ϕf t denotes the fine-tuned parameters Wf t
and bf t . K(x, x′ ; ϕ) = ⟨∇ϕ flinear (x; ϕ) , ∇ϕ flinear (x′ ; ϕ)⟩ is the neural tangent kernel (NTK) [Jacot et al., 2018]
of the linear layer flinear , and L′ = L ◦ fpost is the composed loss function. Note that for given x, K(x, x′ ; ϕ) is a
constant matrix.

16
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

Table 7: Performance of fine-tuned CLIP-ViT-B/32 models on eight downstream tasks.


Model SUN397 Cars RESISC45 EuroSAT SVHN GTSRB MNIST DTD
Pre-trained 63.2 59.8 60.7 46.0 31.6 32.5 48.3 43.9
SUN397 75.0 47.0 54.3 46.5 28.3 26.4 44.3 41.6
Cars 56.6 78.3 50.9 38.4 30.2 30.6 49.7 41.8
RESISC45 52.0 47.2 95.2 56.9 23.9 24.3 39.7 35.9
EuroSAT 49.0 39.9 33.5 99.0 11.8 22.9 33.8 35.5
SVHN 40.5 36.3 18.9 9.8 97.3 27.3 81.8 23.2
GTSRB 36.8 33.0 20.6 21.3 41.2 98.9 30.9 23.9
MNIST 50.3 40.0 31.3 17.7 50.1 19.3 99.6 30.7
DTD 54.6 51.3 36.9 25.0 28.9 21.8 47.3 79.7

Table 8: Performance of fine-tuned CLIP-ViT-L/14 models on eight downstream tasks.


Model SUN397 Cars RESISC45 EuroSAT SVHN GTSRB MNIST DTD
Pre-trained 68.3 77.8 71.0 58.9 58.4 50.6 76.4 55.5
SUN397 82.8 68.4 58.1 49.9 55.0 46.3 79.5 52.8
Cars 67.8 92.9 68.7 56.4 51.7 47.7 80.5 55.6
RESISC45 65.6 69.0 97.4 64.3 38.3 46.6 77.7 49.9
EuroSAT 65.2 69.0 40.6 99.2 33.4 45.6 73.5 47.1
SVHN 66.4 69.0 54.0 19.7 97.9 48.7 92.2 50.1
GTSRB 63.4 64.8 38.7 19.6 71.0 99.2 75.1 45.8
MNIST 56.0 49.8 53.5 26.6 48.2 33.1 99.8 47.1
DTD 66.8 75.3 65.5 43.7 49.5 45.0 68.5 85.5

Table 9: Performance of full fine-tuned Flan-T5-Base models on eight downstream tasks.


Model CoLA MNLI MRPC QNLI QQP RTE SST2 STSB
Pre-trained 69.1 56.5 76.2 88.4 82.1 80.1 91.2 62.2
CoLA 75.0 37.2 72.8 87.6 80.4 76.9 91.4 63.6
MNLI 65.9 83.4 75.7 89.2 82.6 78.0 90.6 66.2
MRPC 63.4 48.3 87.5 85.8 81.1 72.6 88.1 76.1
QNLI 68.7 39.2 75.5 91.5 81.3 78.3 91.6 68.2
QQP 59.1 50.4 73.8 88.3 85.4 81.2 90.8 75.9
RTE 65.4 51.1 69.6 88.7 80.8 85.9 90.3 68.9
SST2 67.8 54.0 76.5 87.8 83.4 80.5 93.6 63.6
STSB 69.3 49.3 76.5 89.0 81.7 77.6 90.1 88.7

Table 10: Performance of LoRA fine-tuned (rLoRA = 16) Flan-T5-Base models on eight downstream tasks.
Model CoLA MNLI MRPC QNLI QQP RTE SST2 STSB
Pre-trained 69.1 56.5 76.2 88.4 82.1 80.1 91.2 62.2
CoLA 69.1 39.9 75.2 89.1 81.1 81.9 90.7 54.0
MNLI 69.4 82.7 73.8 89.3 82.0 79.4 90.9 68.1
MRPC 64.0 44.9 85.5 82.6 81.0 69.0 88.6 73.6
QNLI 68.9 52.7 76.7 90.9 82.8 79.8 91.5 68.9
QQP 65.0 54.6 75.7 89.0 84.0 81.6 90.7 75.3
RTE 64.9 51.8 69.4 89.2 79.8 84.5 90.6 70.1
SST2 68.3 56.6 76.0 88.5 83.4 79.8 92.9 62.6
STSB 65.7 1.7 67.4 89.3 80.1 79.8 90.8 87.4

17
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

C Fine-Tuned Model Performance


In this section, we present the performance of the fine-tuned models on their corresponding test sets. These results
serve as a baseline for evaluating the effectiveness of our proposed model fusion technique.
Tables 7 and 8 show the performance of fine-tuned CLIP-ViT-B/32 and CLIP-ViT-L/14 models, respectively, on
eight image classification tasks. These tasks include SUN397 [Xiao et al., 2010], Cars [Krause et al., 2013], RE-
SISC45 [Cheng et al., 2017], EuroSAT [Helber et al., 2019], SVHN [Netzer et al., 2011], GTSRB [Stallkamp et al.,
2012], MNIST [LeCun et al., 1998], and DTD [Cimpoi et al., 2014]. For image classification tasks, we report the
classification accuracy. Tables 9 and 10 present the performance of Flan-T5-Base models on eight text generation tasks
from GLUE benchmark [Wang et al., 2018], using full fine-tuning and LoRA fine-tuning (rLoRA = 16) respectively.
We report Spearman’s ρ for STSB and exact match accuracy for other tasks. In particular for STSB, if the text outputs
can not be parsed into a valid float number, we assign a score of zero. The datasets and fine-tuned models are accessible
to the public on HuggingFace [Wolf et al., 2019a]. For further information, please consult FusionBench [Tang et al.,
2024a].
Several key observations can be made from these results:

1. Task-Specific Improvements: Across all model types, fine-tuning consistently improves performance on the
target task compared to the pre-trained model. This demonstrates the effectiveness of task-specific adaptation.
2. Negative Transfer: In some cases, we observe negative transfer, where fine-tuning on one task harms perfor-
mance on another task. For example, in Table 7, the SVHN-tuned model performs worse on EuroSAT (9.8%)
compared to the pre-trained model (46.0%).
3. Task Relatedness: Some fine-tuned models show improved performance on related tasks. For example, in
Table 9, the MNLI-tuned model performs well on QNLI, suggesting a transfer of relevant knowledge between
these natural language inference tasks.
4. Varying Task Difficulty: The diagonal entries reveal that some tasks are inherently more challenging than
others. For instance, in the CLIP-ViT models, EuroSAT and MNIST consistently achieve very high accuracy,
while SUN397, Cars and DTD prove more challenging.
5. Model Size Impact: Comparing CLIP-ViT-B/32 and CLIP-ViT-L/14 results, we generally see improved
performance with the larger model, indicating that model capacity plays a role in both task-specific performance
and generalization.

D Hyperparameter Analysis
In this section, we present a comprehensive analysis of the hyperparameters k and kgate for the CLIP-ViT-B/32 and
CLIP-ViT-L/14 models across eight image classification datasets. We examine their impact on both model performance
(average accuracy) and model complexity (number of parameters). We also test the extreme cases when k = ∞, which
corresponds to full-rank experts, denoted as “Dense” in the figures. We normalize the number of parameters by the
number of parameters in the original model (87.5M for CLIP-ViT-B/32 and 303M for CLIP-ViT-L/14) to facilitate
comparison.
CLIP Models. From Figures 8 and 9, we observe that (1) The performance of the upscaled models is generally better
than the pre-trained models, which demonstrates the effectiveness of our fine-tuning strategy. (2) increasing the values
of k generally improves the performance of both the CLIP-ViT-B/32 and CLIP-ViT-L/14 models, though at the cost
of increased model complexity. (3) Increased values of kgate improve the performance of the upscaled models at the
beginning, but the performance starts to decrease when kgate is too large. This observation is consistent with our
discussion in Section 4 that a larger kgate may result in a less discriminative gating mechanism. (4) Better performance
preservation can be achieved with the CLIP-ViT-L/14 model than with the CLIP-ViT-B/32 model, which is consistent
with our discussion in Section 3 that the larger model has more dimension redundancy and is less severe to the parameter
interference problem.
In practice, when selecting hyperparameters for the upscaled models, it is crucial to balance the trade-off between
performance and parameter overhead. Take CLIP-ViT-B/32 as an example, a good trade-off between performance
and parameter overhead can be achieved with k ≈ 32 and kgate ≈ 16. For the CLIP-ViT-L/14 model, k ≈ 64 and
kgate ≈ 8 are recommended. By doing so, we obtain a multi-task model that achieves around 98% of the performance
of the fine-tuned model with only 20% of the total parameters compared to maintaining eight individual fine-tuned
models for each task. Note that the upscaled SMILE model is sparsely inferenced, increasing the number of parameters
by N only increases the activated parameters by about N/T for each token. Even with an extreme focus on storage and

18
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

75.8 79.9 82.7 84.7 85.7 86.2 86.5 90


1 87.5
78.1 81.9 84.6 86.4 87.2 87.7 88.0
2
kgate
79.6 83.1 85.8 87.4 88.3 88.8 89.1 85.0
4

80 1
79.5 83.6 86.1 87.7 88.5 89.0 89.3 2
768 512 256 128 64 32 16 8

Average Accuracy
82.5
80.3 84.0 86.6 88.2 88.9 89.3 89.6 4
80.0 8
kgate

80.5 84.2 86.7 88.1 89.0 89.4 89.6 70


16
80.6 84.3 86.9 88.5 89.1 89.5 89.8 77.5 32
80.6 84.3 86.9 88.4 89.1 89.6 89.8 64
75.0 60
128
80.6 84.4 87.0 88.5 89.2 89.5 89.8
256
80.0 84.1 86.8 88.4 89.1 89.4 89.7 72.5 512
69.5 74.4 78.3 81.0 82.4 83.1 83.6 50 768
70.0
4 8 16 32 64 128 Dense 0 20 40 60 80 100 120
k k

(a) Heatmap of average accuracy. (b) Line plot of average accuracy.


1.07 1.14 1.26 1.50 1.99 2.96 8.78 14 9
1

1.08 1.14 1.26 1.51 1.99 2.96 8.79


2

8
12 Normalized Number of Parameters kgate
1.10 1.16 1.28 1.52 2.01 2.98 8.81
4

7 1
1.13 1.19 1.31 1.55 2.04 3.01 8.84
768 512 256 128 64 32 16 8

10 2
1.19 1.25 1.37 1.61 2.10 3.07 8.90 6 4
8
kgate

1.31 1.37 1.49 1.74 2.22 3.19 9.02 8 5


16
1.55 1.61 1.74 1.98 2.46 3.43 9.26 32
6 4
2.04 2.10 2.22 2.46 2.95 3.92 9.75 64
3 128
3.01 3.07 3.19 3.43 3.92 4.89 10.72 4
256
4.95 5.01 5.13 5.38 5.86 6.83 12.66 2 512
6.89 6.96 7.08 7.32 7.80 8.78 14.60 2 768
1
4 8 16 32 64 128 Dense 0 20 40 60 80 100 120
k k

(c) Heatmap of the normalized number of parameters. (d) Line plot of normalized number of parameters.
Figure 8: Hyperparameter analysis of the CLIP-ViT-B/32 model on eight image classification datasets. Here we show
how different values of hyperparameters k and kgate affect the average performance and the number of parameters
(normalized by the number of parameters in the original model) in the upscaled model. We also show the average
accuracy of pre-trained models and individual fine-tuned models in subfigure (b).

inference costs, only 7% of the parameters overhead can achieve an average multi-task performance of about 90% of
the individual fine-tuned models.
Flan-T5 Models. Figure 5 shows the hyperparameter analysis of the Flan-T5-Base models on eight tasks from the
GLUE benchmark. We conduct the same analysis for both full fine-tuned models and LoRA fine-tuned models with
rLoRA = 16. It is observed that the performance is relatively stable, with most configurations yielding accuracies
around 85.1 to 85.6 for the full fine-tuned models and around 83.8 to 84.0 for the LoRA fine-tuned models. Upscaling
LoRA fine-tuned models is very parameter-efficient, with the number of parameters increasing by only 2% to 7%
compared to the original dense model. For a balanced trade-off between performance and parameter overhead, consider
setting k ≈ 32 and kgate ≈ 8 for the full fine-tuned model fusion, and k ≈ 8 and kgate ≈ 2 for the LoRA fine-tuned
model fusion.

E Large-Scale Model Experiments


This appendix provides additional details and results for our experiments with large-scale models, specifically the
Mistral-7B series. We used the following models in our experiments, which are available on the HuggingFace:

• Base pre-trained model (M0 ): mistralai/Mistral-7B-v0.1


• Expert model M1 : meta-math/MetaMath-Mistral-7B

19
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

95
1 86.0 88.4 90.4 91.7 92.4 92.7 93.0
93
90
86.9 89.1 90.9 92.1 92.7 93.0 93.2 92
2

85

Average Accuracy
91
87.5 89.8 91.4 92.6 93.1 93.4 93.6
4
kgate

90 80
kgate
87.8 90.1 91.6 92.7 93.3 93.5 93.8
8

89 1
75 2
87.9 90.2 91.7 92.8 93.4 93.6 93.9 88 4
16

70 8
87 16
87.9 90.3 91.8 92.9 93.4 93.6 93.9
32

65 32
86
4 8 16 32 64 128 Dense 0 20 40 60 80 100 120
k k

(a) Heatmap of average performance. (b) Line plot of average performance.


9
1.06 1.10 1.20 1.38 1.76 2.51 8.98
1

2.6
8
Normalized Number of Parameters
2.4
1.06 1.11 1.20 1.39 1.76 2.51 8.99
2

7
2.2
1.08 1.12 1.22 1.40 1.78 2.52 9.00 6
4

2.0
kgate

5 kgate
1.8
1.10 1.15 1.24 1.43 1.80 2.55 9.02
8

1
4
1.6 2
1.15 1.19 1.29 1.47 1.85 2.59 9.07 4
16

3 1.4
8
2 1.2 16
1.24 1.29 1.38 1.57 1.94 2.69 9.16
32

32
1.0
4 8 16 32 64 128 Dense 0 20 40 60 80 100 120
k k

(c) Heatmap of normalized parameter count. (d) Line plot of normalized parameter count.
Figure 9: Hyperparameter analysis of the CLIP-ViT-L/14 model on eight image classification datasets. Here we show
how different values of hyperparameters k and kgate affect the average performance and the number of parameters
(normalized by the number of parameters in the original model) in the upscaled model. We also show the average
accuracy of pre-trained models and individual fine-tuned models in subfigure (b).

• Expert model M2 : cognitivecomputations/dolphin-2.1-mistral-7b


• Expert model M3 : uukuguy/speechless-code-mistral-7b-v1.0

For the SMILE models, the hyperparameter settings were as follows: kgate was consistently set to 8 across all
experiments, while k ranged from 8 to 512 (including 8, 16, 32, 64, 128, 256, 384, and 512), as shown in Figure 7.
Table 11 provides a more comprehensive view of the performance of individual models and various SMILE models
with different k values. We use EleutherAI/lm-evaluation-harness [Gao et al., 2024] to evaluate the models on
the four tasks: MMLU, TruthfulQA, GSM8K, and ARC Challenge. We merge the models on host memory and evaluate
them on two NVIDIA 4090 GPUs with 24GB of memory each.
It is notable that as the value of k increases, we generally see improved performance, especially in tasks like GSM8K
and TruthfulQA. The results also show a clear trade-off between model size and performance. The SMILE model with
k = 8 (7.3B parameters) already achieves comparable or better results than the pre-trained model on all tasks, while
larger models (k = 512, 11.2B parameters) approach the performance of individual expert models.
Limitations and future work for LLMs. Here we provide a brief discussion of the limitations of our experiments and
potential directions for future work.

• Limited expert model pool. In the experiments for CLIP models and Flan-t5 models, we use eight expert
models to evaluate the performance of the SMILE model. However, in the Mistral-7B experiments, the
experiments are currently limited to three expert models, which may not fully demonstrate the method’s

20
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction A P REPRINT

Table 11: Detailed performance comparison of individual models and various SMILE models with different k values.
For all upscaled models, the kgate value was set to 8.
Model MMLU TruthfulQA GSM8K ARC Challenge
M0 (pre-trained) 59.64 42.62 38.81 53.92
M1 60.56 44.79 71.49 51.02
M2 60.56 55.88 56.93 57.00
M3 61.18 47.47 48.98 57.68
M0;123 (7.3B, k = 8) 60.28 46.31 46.55 55.55
M0;123 (7.5B, k = 32) 60.37 49.49 55.04 54.52
M0;123 (8.3B, k = 128) 60.43 50.91 63.76 54.35
M0;123 (9.3B, k = 256) 60.53 51.83 65.58 54.01
M0;123 (11.2B, k = 512) 60.66 52.79 67.85 54.35

capabilities with a larger, more diverse set of experts. Future work could explore the impact of additional
expert models on the performance of the SMILE model.
• LoRA fine-tuning. In the experiments, we use full fine-tuned Mistral-7B models as expert models. Where
the linear models are upscaled into MoE modules and the remaining parts of the model are copied directly
from the pre-trained model. The reason for this is that the top Mistral-7B models available on HuggingFace
are now fully fine-tuned. This approach, however, may not fully exploit SMILE’s potential. A more effective
strategy could involve using LoRA fine-tuned models as expert models. In this scenario, only specific linear
layers would be fine-tuned using low-rank techniques, with the rest of the model remaining frozen. This
approach could potentially enhance SMILE’s efficiency and effectiveness. As we have shown in the Flan-T5
experiments, LoRA fine-tuning can significantly reduce the number of additional parameters required to
achieve comparable performance.

21

You might also like