MMMLP: Multi-Modal Multilayer Perceptron For Sequential Recommendations
MMMLP: Multi-Modal Multilayer Perceptron For Sequential Recommendations
MMMLP: Multi-Modal Multilayer Perceptron For Sequential Recommendations
Zitao Liu
Guangdong Institute of Smart
Education, Jinan University
liuzitao@jnu.edu.cn
ABSTRACT CCS CONCEPTS
Sequential recommendation aims to ofer potentially interesting • Information systems → Recommender systems.
products to users by capturing their historical sequence of inter-
acted items. Although it has facilitated extensive physical scenarios, KEYWORDS
sequential recommendation for multi-modal sequences has long
Sequential Recommendation, Multi-modal Data, Multimedia
been neglected. Multi-modal data that depicts a user’s historical
interactions exists ubiquitously, such as product pictures, textual de- ACM Reference Format:
scriptions, and interacted item sequences, providing semantic infor- Jiahao Liang, Xiangyu Zhao, Muyang Li, Zijian Zhang, Wanyu Wang,
mation from multiple perspectives that comprehensively describe a Haochen Liu, and Zitao Liu. 2023. MMMLP: Multi-modal Multilayer Per-
user’s preferences. However, existing sequential recommendation ceptron for Sequential Recommendations. In Proceedings of the ACM Web
methods either fail to directly handle multi-modality or sufer from Conference 2023 (WWW ’23), April 30–May 04, 2023, Austin, TX, USA. ACM,
high computational complexity. To address this, we propose a novel New York, NY, USA, 9 pages. https://doi.org/10.1145/3543507.3583378
Multi-Modal Multi-Layer Perceptron (MMMLP) for maintaining
multi-modal sequences for sequential recommendation. MMMLP
is a purely MLP-based architecture that consists of three modules - 1 INTRODUCTION
the Feature Mixer Layer, Fusion Mixer Layer, and Prediction Layer With the rapid development of e-commerce, users are regularly in-
- and has an edge on both efcacy and efciency. Extensive experi- undated with diverse and trendy content, and they exhibit dynamic
ments show that MMMLP achieves state-of-the-art performance preferences over time. Capturing this preference variation has be-
with linear complexity. We also conduct ablating analysis to verify come a prominent task for content providers [22]. By modeling
the contribution of each component. Furthermore, compatible ex- a user’s historical interacted record, sequential recommendation
periments are devised, and the results show that the multi-modal systems (SRS) have an advantage in describing how a user’s be-
representation learned by our proposed model generally benefts havior changes over time. SRS has extensively facilitated modern
other recommendation models, emphasizing our model’s ability to life, including product recommendation [4, 19, 49], click prediction
handle multi-modal information. We have made our code available [24, 27], and web-page recommendation [28, 41].
online to ease reproducibility1 . With the rapid development of deep learning over the past few
years, several sequential recommendation models based on deep
∗ Xiangyu learning have emerged [10]. Recurrent Neural Network (RNN)
Zhao is the corresponding author.
1 https://github.com/Applied-Machine-Learning-Lab/MMMLP based [10, 11] and self-attention-based methods [15, 36] are the
most representative ones. It is generally believed that RNNs are ef-
fcient in processing sequentially correlated data. However, despite
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed achieving advanced performance [15, 36], whether they use long-
for proft or commercial advantage and that copies bear this notice and the full citation term memory units (LSTM) [12] or gated recurrent units (GRU)
on the frst page. Copyrights for components of this work owned by others than the [3], they still sufer from the incapability of maintaining long-term
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission dependencies and the difculty of parallelism. The newly emerged
and/or a fee. Request permissions from permissions@acm.org. self-attention [32] is not constrained by these limitations and can
WWW ’23, April 30–May 04, 2023, Austin, TX, USA capture long-term correlations between items without relying on
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9416-1/23/04. . . $15.00 the relative positions of the items. Self-attention has reached state-
https://doi.org/10.1145/3543507.3583378 of-the-art performance [15, 36, 37].
1109
WWW ’23, April 30–May 04, 2023, Austin, TX, USA Jiahao Liang, Xiangyu Zhao, Muyang Li, Zijian Zhang, Wanyu Wang, Haochen Liu, and Zitao Liu
2 FRAMEWORK
Although existing works [11, 17, 33, 36] emphasize using side
In this section, we will start by describing the problem formula-
information to accurately simulate user sequential behavior, few
tion of sequential recommendation tasks, and then introduce our
studies have explored multimodal sequential recommendations and
proposed MMMLP framework for sequential recommendation sys-
user sequential behavior is rarely considered multimodal. How-
tems. Specifcally, we will frst propose a new multi-modal MLP
ever, in the feld of recommendation systems, there is increasing
framework that can be used to address the above tasks with a high
attention being paid to multimodal data, which provides semantic
degree of efciency. Then we will discuss the optimization process
information about user interactions from multiple perspectives. For
of the model and present the pseudo-code.
example, a regular sequential recommender system might fail to
capture the semantic information from an item’s images or text
2.1 Problem Statement
descriptions, which are crucial to a user interested in a type of
vehicle with a specifc color. To solve this task, latent embeddings Given the item set � = {� 1, . . . , �� . . . , � � }, for each user, we rep-
must be derived from diverse representations of items. resent his item interaction list as �� = {� 1, . . . , �� . . . , � � |�� ∈ � } ∈
A typical Multi-modal sequential recommender system is shown R� �� , including � item embeddings with the dimension of �� .
in Figure 1, where both the interaction history and sequence infor- Considering the feature in multiple modalities, we denote the image
mation show users’ short- and long-term preferences. Multi-modal feature corresponding to �� as � � ∈ R� ×�� , and textual feature
sequential recommender system uses this information and stud- as � � ∈ R� ×�� . � � and �� are the embedding size of image and
ies the users’s preferences to recommend relevant items. Unlike text token, resepectively. To pursue a concise description, we omit
item-IDs, which reveal only part of a sequential pattern, multi- the user subscript and represent the image, textual feature, and the
modal feature sequences reveal a more comprehensive view to the interacted item list as � , � , and �, respectively. It is noteworthy that
underlying pattern. Therefore, in order to use multi-modal fea- our proposed method can be easily extended to other modalities.
tures for sequential recommendation, it is increasingly common Sequential recommendation systems aim to predict what item
for RNN-based and self-attention-based models to integrate com- the user will select next based on past interactions. As such, given
modity features [11, 36]. However, RNN cannot maintain long-term the user interacted item list including � time steps, our goal is to
dependencies, while attention is computationally expensive. predict what will be the next item at time step � + 1, based on the
To address the above issues, we propose a Multi-Modal Multilayer multi-modal feature of items.
Perceptron (MMMLP) for sequence recommendation based on pure
MLP architecture, which efectively captures and fuses multi-modal 2.2 Overall Architecture
information to produce informed next-item predictions. Our model In this paper, we propose a multi-modal recommendation frame-
consists of three layers: the Feature Mixer Layer, Fusion Mixer work based on MLP, namely MMMLP, that can explicitly learn
Layer, and Prediction Layer. The Feature Mixer Layer includes information from various modalities. Figure 2 illustrates the ar-
three Mixer Modules, which capture the multi-modal information chitecture of MMMLP, which consists of three layers: the Feature
of items with linear complexity. The Fusion Mixer Layer mixes Mixer Layer, Fusion Mixer Layer, and Prediction Layer.
the information from the three modalities, and the last output is Our framework is fexible and can incorporate data in diverse
passed to the Prediction Layer to generate the next-item recom- modalities, and we focus on images and texts in this paper, which
mendation. We evaluate our proposed method on the Movielens are the most commonly used types of modalities in addition to item
100K and Movielens 1M benchmark datasets and demonstrate that sequences. As shown in Figure 2, image, text and item sequences
it outperforms existing basic sequential recommendation methods in the user-item interaction history are used as input, and we in-
and competitive side information integration methods. Moreover, corporate the Feature Mixer Layer, including three Mixer Modules
our proposed Feature Mixer Layer can also be applied to other to extract and process image, text, and item sequence information,
1110
MMMLP: Multi-modal Multilayer Perceptron for Sequential Recommendations WWW ’23, April 30–May 04, 2023, Austin, TX, USA
Sequence
Channel
Image
Mixer
Image Image
Sequence
Fusion Mixer
LayerNorm
Based on
Prediction
Channel
GELU
Channel
McMillan's Mixer
Text
novel,
with…
Text Text
Next item
x1
Sqeuence
Channel
…
Mixer
xt
…
xN
Sequence Sequence
Fusion Mixer Prediction
Feature Mixer Layer Layer Layer
catenating the outputs � � , � � , and � � from the three Mixer Modules Channel Sequence Sequence
LayerNorm
to fuse multiple modality representations. Finally, we make predic-
Sequence
Channel
Channel
GELU
𝑾𝟏 𝑾𝟐
tions of the next recommendation in the Prediction Layer based on T
the fused representation.
Sequence
Sequence
GELU
There are three Mixer Modules in the Feature Mixer Layer to ex- 𝑾 𝟒 𝑾 𝟑
1111
WWW ’23, April 30–May 04, 2023, Austin, TX, USA Jiahao Liang, Xiangyu Zhao, Muyang Li, Zijian Zhang, Wanyu Wang, Haochen Liu, and Zitao Liu
2.3.1 Image mixer. For the image mixer module, we take the image where � 9 ∈ R� � ×� and � 10 ∈ R� ×� � denote the learnable
embedding b � through a Mixer Module to extract the raw image weights of the frst layer in the sequence mixer. � 11 ∈ R� � ×�
features. The obtained visual embedding sequence is passed through and � 12 ∈ R� ×� � are the learnable weights of the second layer in
the mixer module, where the token mixer captures the interactions the sequence mixer. � � is the representation of item sequence.
between tokens, and then the results are provided to the channel
mixer to capture the interactions between channels. With the image 2.4 Fusion Mixer Layer
mixer, we achieve a visual representation of each sequence by fusing
visual correlations into the representation of each item. We present the Fusion Mixer Layer to fuse representations of mul-
As a result of image mixer, we have the following output: tiple modalities. A post-fusion approach is used to concatenate the
outputs of all Mixer Modules, i.e., � � , � � , � � , into the mixer layer
which consists of a mixer module. This approach is also referred to
b � ∗,� + � 2 � � 1 LayerNorm(b
� ∗,� = b � )∗,� , for � = 1 . . . � � , as a single-stream approach, which is comparatively more efective
(2) than dual-stream methods [1]. Using the Fusion Mixer Layer, we
� �,∗ + � 4 � � 3 LayerNorm(b
� ��,∗ = b � ) �,∗ , for � = 1 . . . � can approach a comprehensive representation of user interacted
item sequences by fusing the multi-modal representations.
where � is the GELU activation function [9]. b � ∗,� represents opera-
As a result of Fusion Mixer Layer, we have the following output:
tions on column dimension, i.e., cross-token processing, on image
feature matrix, and b � �,∗ is operations on the row dimension, i.e.,
cross-channel processing. � 1 ∈ R� � ×� and � 2 ∈ R� ×� � de- b � ∗,� + � 14 � � 13 LayerNorm(b
� ∗,� = b � )∗,� , for � = 1 . . . �,
note the learnable weights of the frst layer in the image mixer.
� �,∗ + � 16 � � 15 LayerNorm(b
�
� �,∗ = b � ) �,∗ , for � = 1 . . . �
� 3 ∈ R� � ×�� and � 4 ∈ R�� ×� � are the learnable weights of
the second layer in the image mixer. � � and � � are hidden size in (5)
feature-mixer. � � is the learned representation of image modality. where b𝒀 = Linear 𝒀 𝑖 𝒀 𝑡 𝒀 𝑠 and ∥ is the concatenation operation,
so 𝐷 = 𝐷 𝐼 + 𝐷𝑇 + 𝐷𝑆 . 𝒀 𝑓 is the output of the block, which is
2.3.2 Text mixer. For the text mixer module, we take the text em- the comprehensive representation considering multiple modalities.
bedding b � through a Mixer module to extract the raw features. 𝑾 13 ∈ R𝑟 𝑁 ×𝑁 and 𝑾 14 ∈ R𝑁 ×𝑟 𝑁 denote the learnable weights of
Using the obtained text embedding sequence, the token mixer cap- the first layer in the mixer. 𝑾 15 ∈ R𝑟 𝐷 ×𝐷 and 𝑾 16 ∈ R𝐷 ×𝑟 𝐷 are
tures the interactions between tokens within channels. Then, the the learnable weights of the second layer in the mixer.
channel mixer takes the results to capture the interactions between
channels within tokens. Using text mixer, it is possible to create
an informative sequence representation by integrating text correla- 2.5 Model Optimization
tions into the representation of each sequence, in order to create a 2.5.1 Prediction. To make fair comparisons, we introduce the most
text representation of each sequence as a whole. commonly used inference method in SRS. After we have taken 𝐿
As a result of text mixer, we have the following output: layers of sequence-mixers, channel-mixers, and feature-mixers, we
obtain a sequence of hidden states that contain the sequential,
b � ∗,� + � 6 � � 5 LayerNorm(b
� ∗,� = b � )∗,� , for � = 1 . . . �� , cross-channel, and cross-feature dependencies of each interaction.
ℎ 𝑁 represents the user’s preference based on previous 𝑁 times
� �,∗ + � 8 � � 7 LayerNorm(b
� ��,∗ = b � ) �,∗ , for � = 1 . . . �
interactions. The score of each candidate item 𝑥𝑖 is calculated by:
(3)
� 5 ∈ R� � ×� and � 6 ∈ R� ×� � denote the learnable weights of �
the frst layer in the text mixer. � 7 ∈ R� � ×�� and � 8 ∈ R�� ×� � b� = �� � ���� (�� · (� � )� )
� (6)
are the learnable weights of the second layer in the text mixer. � �
is the learned representation of text modality. where � = 1, . . . , �, � � ∈ R� ×� is the representation of item �� .
�
1112
MMMLP: Multi-modal Multilayer Perceptron for Sequential Recommendations WWW ’23, April 30–May 04, 2023, Austin, TX, USA
1113
WWW ’23, April 30–May 04, 2023, Austin, TX, USA Jiahao Liang, Xiangyu Zhao, Muyang Li, Zijian Zhang, Wanyu Wang, Haochen Liu, and Zitao Liu
Table 2: Overall performance comparison on two datasets. The results are averaged over three random seeds. Bold scores
indicate the best model for each metric and underlined scores indicate the second best model.
Dataset Metric FPMC BPR GRU4Rec SASRec GRU4RecF+ FDSA+ SASRecF+ MLPMixer MMMLP
MRR@10 0.1314 0.1513 0.1829 0.1946 0.1839 0.1980 0.2040 0.1909 0.2144*
ML-100K
NDCG@10 0.1932 0.2132 0.2521 0.2704 0.2574 0.2643 0.2758 0.2636 0.2764*
MRR@10 0.2419 0.2959 0.3383 0.3743 0.3614 0.3994 0.4063 0.4043 0.4129*
ML-1M
NDCG@10 0.3040 0.3535 0.4062 0.4430 0.4342 0.4668 0.4691 0.4695 0.4795*
“*” indicates the statistically signifcant improvements (i.e., two-sided t-test with � < 0.05) over the best baseline.
indicates that we improve the original model, which takes the em- incorporate the pre-trained bert-base-uncased provided by hug-
bedded matrix of item ID, image, and text features as input and can gingface3 for text data preprocessing [5, 7]. The implementation
be fairly compared with MMMLP. code is available online to ease reproducibility4 .
• FPMC [25]: FPMC combines Markov Chains and Matrix Fac- 3.4 Overall Performance (RQ1)
torization method to learn the sequential dependencies in user To answer RQ1, we compare MMMLP with representative baselines.
interaction history as well as users’ general preferences. The comparison results are summarized in Table 2, where models
• BPR [26]: BPR builds matrix factorization model from pair-wise including FPMC, BPR, GRU4Rec, SASRec, MLPMixer only consider
loss function to learn from implicit feedback, and it is a classical item embeddings, while models such as GRU4RecF+ , FDSA+ and
general recommender system. SASRecF+ also involve multi-modal information.
• GRU4Rec [26]: GRU4Rec uses gated recurrent unit to improve From Table 2, we can make the following general observations: (i)
the performance of vanilla RNN, allowing it to mitigate the van- Starting from GRU4Rec, deep learning-based methods largely out-
ishing gradient problem to some extent. perform traditional methods such as BPR, which indicates that deep
• SASRec [15]: A sequential recommendation model based on learning models do a better job of capturing sequential correlation
attention that uses a self-attention network for the generation of in the sequential recommendation. More specifcally, we can also
sequential recommendations. observe that: (ii) self-attention models generally have better perfor-
• GRU4RecF+ [11]: This is an improved version of GRU4RecF. In mance compared to RNN-based models. This can be attributed to the
order to make a fair comparison, we replace the classical bag-of- stronger ability of self-attention to capture sequential patterns. (iii)
words with TF-IDF with a pre-trained bert-base-uncased. Models that can handle multi-modal features (models with + ), such
• SASRecF+ : This is our improved version of SASRec, which fuses as GRU4RecF+ , FDSA+ , and SASRecF+ are generally superior to
the text, image, and sequence representation of items through models that cannot handle multi-modal features, such as GRU4Rec
concatenation operation before feeding them to the model. and SASRec, which indicates the importance of multi-modal fea-
• FDSA+ [36]: This is our improved version of FDSA, which fuses tures in the sequential recommendation. (iv) MLPMixer can achieve
the text, image, and sequence representation of items through comparable performance with SASRec, FDSA+ , and other SOTA
concatenation operation before feeding them to the model. methods, which indicates that a simple MLP architecture can re-
• MLPMixer [30]: MLPMixer is our improved version of MLP- place the self-attention mechanism. (v) MMMLP consistently out-
Mixer [30] to make it adapt to sequential recommendation tasks performs all multi-modal baselines including SASRecF+ or FDSA+ ,
based on item embeddings. which indicates that MLPMixer-based multimodal information fu-
sion is signifcantly efective and even rivals transformer-based
multimodal information fusion capability.
3.3 Implementation Details In summary, MMMLP’s performance on both datasets is supe-
This MMMLP implementation and all baselines are based on the rior to that of the state-of-the-art baselines, which validates the
RecBole [38] library, an open-source recommendation system li- efectiveness of our proposed MMMLP model in sequential recom-
brary that enables us to test and compare all methods in a fair mendations with multi-modal information.
environment, allowing our results to be replicated easily. We ad-
just the hyperparameters based on the original paper. The Adam 3.5 Parameters Analysis (RQ2)
optimizer [16] and the early stopping policy were adapted, and we To investigate RQ2, in this section, we analyze the parameters of
perform cross-validation for hyperparameter selection when the MMMLP on ML-100K and ML-1M, including image mixer layer
original paper did not provide detailed hyperparameters. depth and text mixer layer depth. Table 3 shows the efect of layer
We set the learning rate to 1e-4, and fx the batch size as 256. depth and embedding size on MMMLP and MLPMixer.
Moreover, to handle diferent item sequence lengths, we use padding
to fll users whose interaction numbers are less than the maximum 3.5.1 Image Mixer layer depth. In MMMLP, we applied Image
sequence length, and use the most recent interactions from users Mixer to obtain the visual information of items. Based on our ex-
with more interactions than the maximum sequence length [15]. periments, it has been shown in Table 3 that when the Image Mixer
We only used GELU [9] as the nonlinear activation across all mod- 3 https://huggingface.co/bert-base-uncased
els for fair comparison [11]. To achieve efcient text modeling, we 4 https://github.com/Applied-Machine-Learning-Lab/MMMLP
1114
MMMLP: Multi-modal Multilayer Perceptron for Sequential Recommendations WWW ’23, April 30–May 04, 2023, Austin, TX, USA
Dataset Metric Image mixer layer depth Text mixer layer depth
2 layers 4 layers 8 layers 2 layers 4 layers 8 layers
MRR@10 0.2059 0.2144 0.2034 0.2082 0.2144 0.2056
ML-100K
NDCG@10 0.2676 0.2764 0.2643 0.2708 0.2764 0.2664
MRR@10 0.4073 0.4129 0.4039 0.4076 0.4129 0.4053
ML-1M
NDCG@10 0.4751 0.4795 0.4702 0.4763 0.4795 0.4723
layer depth is 4, the performance of the model is optimal. Models Table 4: Ablation study comparison.
with shallower layers, such as a layer depth of 2, sufer from poor
performance because the embedding size has limited representation Model MRR@10 NDCG@10
power. As the number of layers gets deeper, like a layer of 8, it is
MLPMixer 0.1909 0.2636
likely to end up having a problem with overftting.
MMMLP-Image 0.2094 0.2743
3.5.2 Text Mixer layer depth. As shown in Table 3, we found that MMMLP-Text 0.2092 0.2747
our model achieves optimal performance with a Text Mixer layer C/B-MMMLP 0.2108 0.2746
depth of 4. However, if the text input is short, such as the title MMMLP 0.2144 0.2764
of a movie, a lighter layer can yield better results. On the other
hand, embedding sizes of models with a layer depth of 2 may not be
sufcient for presenting long text, resulting in poor performance, which can be attributed that our proposed Feature Mixer Layer
such as long descriptions of movies. Meanwhile, deeper layers, such performs better than pre-trained image or text extractors.
as 8, may result in overftting. In our analysis of the Text Mixer
layer depth, we used the ������������ of movies, which contain 3.7 Compatibility Study (RQ4)
long text information.
To answer RQ4, we conduct an experimental analysis to examine
the compatibility of MMMLP’s feature mixer module on ML-100K.
3.6 Ablation Study (RQ3) In particular, we aim to investigate whether using our proposed
We conducted an ablating study on ML-100K to investigate RQ3. As (i) text mixer (Model_T), (ii) image mixer (Model_I) and (iii) both
mentioned earlier, MMMLP achieves better performance than MLP- text and image mixers (Model_TI) in other multi-modal sequential
Mixer on both datasets across all metrics, with the only diference recommendation models can improve their performance.
between their architectures being the feature Mixer. Here, we in- From Figure 4, we can observe that all three modifed version out-
vestigate the need for a feature mixer by answering two important perform the original multi-modal sequential recommendation mod-
questions - Q1: Can our model maintain satisfactory performance els. Additionally, GUR4RecF_TI, SASRec_TI, and FDSA_TI, which
when replacing our mixer module with a common feature extractor? are replaced by both our proposed image and text mixers, work
and Q2: What is the contribution of each module in our proposed better than GUR4RecF_T(I), SASRec_T(I), FDSA_T(I), which are
model? In order to answer these questions, we designed the follow- replaced by either image or text mixer alone. These observations
ing alternatives to MMMLP and MLPMixer : indicate that (i) our proposed text and image mixers can capture
• MLPMixer: MLPMixer is a plain MLPMixer and does not include diferent and complementary information from items, and (ii) our
item features. proposed mixers possess excellent compatibility that could be ap-
• MMMLP-Image: MMMLP-Image is a simplifed MMMLP, which plied to other multi-modal sequential recommendation models and
only uses the image extractor to extract visual features. improve their performance.
• MMMLP-Text: MMMLP-text is a simplifed MMMLP, which only
uses text extractor to extract text features. 4 RELATED WORK
• C/B-MMMLP: CNN and Bert are used as the image and text This section briefy reviews the representative works related to us,
extractors of MMMLP. including sequential recommendation, multi-modal recommenda-
From Table 4, we can conclude that: (i) Without incorporating tion, and MLP-based models. We also discuss the advantages of our
multi-modal features, MLPMixer has signifcantly worse perfor- method beyond these works.
mance, which confrms the importance of including multi-modal
features in sequential recommendations. (ii) For the two simplifed 4.1 Sequential Recommendation
versions of MMMLP, we can observe that MMMLP always performs In the early stages of sequential recommendation, a number of ap-
better than both of them, while MMMLP-Text and MMMLP-Image proaches were based on the Markov Chain assumption [8, 14, 25, 48]
perform better than vanilla MLPMixer. This shows that either image and focused on modeling item-item transition relationships in order
or text information can enhance the efectiveness of the model. (iii) to predict the next item depending on the user’s last interaction with
MMMLP consistently outperforms C/B-MMMLP overall metrics, the item. In recent years, several works have been developed along
1115
WWW ’23, April 30–May 04, 2023, Austin, TX, USA Jiahao Liang, Xiangyu Zhao, Muyang Li, Zijian Zhang, Wanyu Wang, Haochen Liu, and Zitao Liu
this line, with the Reinforcement Learning (RL) approach focusing gMLP [20], and the sMLP [29], etc. In addition, MLP-based models
on the Markov Decision Process (MDP) being particularly note- are further enhanced by gMLP, which uses a gated version of MLP
worthy [39–46]. As neural networks developed, Hidasi et al. [10] to enhance the model. Among the NLP representatives are pNLP-
introduced a new neural network architecture, the GRU4Rec, to cap- Mixer [6] and Hypermixer [23], where gMLP and sMLP [34] can also
ture sequential patterns, and in recent years there has been a surge be applied. With the use of token mixing and input weighted sum-
of work leveraging other neural network architectures, e.g., CNN, mation, these MLP-based models are capable of achieving a similar
GNN, and Transformer for sequential recommendation [35, 37]. functionality to self-attention. FMLP-REC [47] and MLP4Rec [17]
Numerous studies have introduced other contextual information are among the pioneers when it comes to sequence recommenda-
(e.g., item attributes and reviews). Although these existing mod- tions. In spite of this, MLP-based models are not frequently used in
els have been successful in sequential recommendation tasks, we multi-modal sequence recommendation, and our proposed model
fnd that they have not fully captured users’ interests from multi- provides a very efective solution for this problem.
modal information. Our proposed model efectively leverages the
multi-modal information and achieves a more comprehensive user
preference modeling. 5 CONCLUSION
This paper proposes MMMLP, an MLP-based architecture for multi-
4.2 Multi-modal Recommendation modal sequential recommendation. Specifcally, we devise a unique
Multi-modal information has emerged as a crucial research area in feature mixer layer that can extract image, text, and item sequence
the feld of recommendation systems. Presently, most methods for information simultaneously, a fusion mixer layer for fusing these
integrating multi-modal information focus on combining features representations, and a prediction layer for generating recommen-
for recommendations. Some approaches have been proposed for dations. Compared to other methods, MMMLP ofers superior ca-
concatenating diferent patterns in user behavior sequences and an- pabilities for extracting and fusing multi-modal information, while
alyzing the transition patterns between adjacent behaviors [11, 36]. maintaining a linear computational complexity. Extensive experi-
For instance, Zhang et al. [36] proposed a transformation pattern ments on two benchmark datasets demonstrate that MMMLP con-
model of item and attribute-level behaviors based on the neighbor- sistently outperforms other baseline methods. As a pioneering ap-
ing behaviors. Multi-modal content can also be merged with item proach in the context of multi-modal sequential recommendation,
representations, which are subsequently fed into sequential models MMMLP has been shown to be highly efective at combining multi-
such as RNN for the recommendation of the next item [2, 13, 18, 21]. modal information. Additionally, we provide compatibility analysis
According to Huang et al. [13], unifed multi-type actions and multi- to demonstrate that our proposed mechanism can enhance other
modal representations of content are combined to form a contextual methods that use multi-modal data capture.
self-attention network for sequential recommendation. However,
these models do not fully exploit multi-modal information, or they
are overly complex. Our proposed model is more than capable of ACKNOWLEDGEMENTS
compensating for these issues.
This research was partially supported by APRC - CityU New Re-
search Initiatives (No.9610565, Start-up Grant for New Faculty of
4.3 MLP-based Models City University of Hong Kong), SIRG - CityU Strategic Interdis-
According to recent studies, MLP-based architectures have shown ciplinary Research Grant (No.7020046, No.7020074), HKIDS Early
superior performances in Computer Vision(CV) and Natural Lan- Career Research Grant (No.9360163), Huawei (Huawei Innovation
guage Processing(NLP), and are comparable to mainstream Trans- Research Program) and Ant Group (CCF-Ant Research Fund), and
formers in terms of performance. The feld of CV includes a number Key Laboratory of Smart Education of Guangdong Higher Educa-
of representatives, such as the MLP-Mixer [30], the resMLP [31], the tion Institutes, Jinan University (2022LSYS003).
1116
MMMLP: Multi-modal Multilayer Perceptron for Sequential Recommendations WWW ’23, April 30–May 04, 2023, Austin, TX, USA
1117