MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

Zuo, Simiao; Zhang, Qingru; Liang, Chen; He, Pengcheng; Zhao, Tuo; Chen, Weizhu

Computer Science > Computation and Language

arXiv:2204.07675 (cs)

[Submitted on 15 Apr 2022 (v1), last revised 28 Apr 2022 (this version, v2)]

Title:MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

Authors:Simiao Zuo, Qingru Zhang, Chen Liang, Pengcheng He, Tuo Zhao, Weizhu Chen

View PDF

Abstract:Pre-trained language models have demonstrated superior performance in various natural language processing tasks. However, these models usually contain hundreds of millions of parameters, which limits their practicality because of latency requirements in real-world applications. Existing methods train small compressed models via knowledge distillation. However, performance of these small models drops significantly compared with the pre-trained models due to their reduced model capacity. We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We initialize MoEBERT by adapting the feed-forward neural networks in a pre-trained model into multiple experts. As such, representation power of the pre-trained model is largely retained. During inference, only one of the experts is activated, such that speed can be improved. We also propose a layer-wise distillation method to train MoEBERT. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks. Results show that the proposed method outperforms existing task-specific distillation algorithms. For example, our method outperforms previous approaches by over 2% on the MNLI (mismatched) dataset. Our code is publicly available at this https URL.

Comments:	NAACL 2022
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2204.07675 [cs.CL]
	(or arXiv:2204.07675v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2204.07675

Submission history

From: Simiao Zuo [view email]
[v1] Fri, 15 Apr 2022 23:19:37 UTC (931 KB)
[v2] Thu, 28 Apr 2022 21:53:25 UTC (926 KB)

Computer Science > Computation and Language

Title:MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators