Abstract
Recent works on multi-modal emotion recognition move towards end-to-end models, which can extract the task-specific features supervised by the target task compared with the two-phase pipeline. In this paper, we propose a novel multi-modal end-to-end transformer for emotion recognition, which can effectively model the tri-modal features interaction among the textual, acoustic, and visual modalities at the low-level and high-level. At the low-level, we propose the progressive tri-modal attention, which can model the tri-modal feature interactions by adopting a two-pass strategy and can further leverage such interactions to significantly reduce the computation and memory complexity through reducing the input token length. At the high-level, we introduce the tri-modal feature fusion layer to explicitly aggregate the semantic representations of three modalities. The experimental results on the CMU-MOSEI and IEMOCAP datasets show that ME2ET achieves the state-of-the-art performance. The further in-depth analysis demonstrates the effectiveness, efficiency, and interpretability of the proposed tri-modal attention, which can help our model to achieve better performance while significantly reducing the computation and memory cost (Our code is available at https://github.com/SCIR-MSA-Team/UFMAC.).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bagher Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 2018
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Dai, W., Cahyawijaya, S., Liu, Z., Fung, P.: Multimodal end-to-end sparse model for emotion recognition. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5305–5316, June 2021
Dai, W., Liu, Z., Yu, T., Fung, P.: Modality-transferable emotion embeddings for low-resource multimodal emotion recognition. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 269–280, December 2020
Gong, Y., Chung, Y.A., Glass, J.: AST: audio spectrogram transformer. In: Proceedings Interspeech 2021, pp. 571–575 (2021)
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2020)
Le, H.D., Lee, G.S., Kim, S.H., Kim, S., Yang, H.J.: Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. IEEE Access 11, 14742–14751 (2023)
Li, Y., Wang, Y., Cui, Z.: Decoupled multimodal distilling for emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6631–6640 (2023)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6558–6569, July 2019
Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 10790–10797 (2021)
Zhang, Z., Wang, L., Yang, J.: Weakly supervised video emotion detection and prediction via cross-modal temporal erasing network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18888–18897 (2023)
Zhang, Z., Yang, J.: Temporal sentiment localization: listen and look in untrimmed videos. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 199–208 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wu, Y., Peng, P., Zhang, Z., Zhao, Y., Qin, B. (2024). An End-to-End Transformer with Progressive Tri-Modal Attention for Multi-modal Emotion Recognition. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14431. Springer, Singapore. https://doi.org/10.1007/978-981-99-8540-1_32
Download citation
DOI: https://doi.org/10.1007/978-981-99-8540-1_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8539-5
Online ISBN: 978-981-99-8540-1
eBook Packages: Computer ScienceComputer Science (R0)