MMMLP: Multi-Modal Multilayer Perceptron For Sequential Recommendations

MMMLP: Multi-modal Multilayer Perceptron
for Sequential Recommendations

Jiahao Liang Xiangyu Zhao∗ Muyang Li
City University of Hong Kong City University of Hong Kong University of Sydney
jiahliang6-c@my.cityu.edu.hk xianzhao@cityu.edu.hk muli0371@uni.sydney.edu.au
Zijian Zhang Wanyu Wang Haochen Liu

Jilin University, City University of City University of Hong Kong Michigan State University
Hong Kong wanyuwang4-c@my.cityu.edu.hk liuhaoc1@msu.edu
zhangzj2114@mails.jlu.edu.cn
Zitao Liu
Guangdong Institute of Smart
Education, Jinan University
liuzitao@jnu.edu.cn
ABSTRACT CCS CONCEPTS
Sequential recommendation aims to ofer potentially interesting • Information systems → Recommender systems.
products to users by capturing their historical sequence of inter-
acted items. Although it has facilitated extensive physical scenarios, KEYWORDS
sequential recommendation for multi-modal sequences has long
Sequential Recommendation, Multi-modal Data, Multimedia
been neglected. Multi-modal data that depicts a user’s historical
interactions exists ubiquitously, such as product pictures, textual de- ACM Reference Format:
scriptions, and interacted item sequences, providing semantic infor- Jiahao Liang, Xiangyu Zhao, Muyang Li, Zijian Zhang, Wanyu Wang,
mation from multiple perspectives that comprehensively describe a Haochen Liu, and Zitao Liu. 2023. MMMLP: Multi-modal Multilayer Per-
user’s preferences. However, existing sequential recommendation ceptron for Sequential Recommendations. In Proceedings of the ACM Web
methods either fail to directly handle multi-modality or sufer from Conference 2023 (WWW ’23), April 30–May 04, 2023, Austin, TX, USA. ACM,
high computational complexity. To address this, we propose a novel New York, NY, USA, 9 pages. https://doi.org/10.1145/3543507.3583378
Multi-Modal Multi-Layer Perceptron (MMMLP) for maintaining
multi-modal sequences for sequential recommendation. MMMLP
is a purely MLP-based architecture that consists of three modules - 1 INTRODUCTION
the Feature Mixer Layer, Fusion Mixer Layer, and Prediction Layer With the rapid development of e-commerce, users are regularly in-
- and has an edge on both efcacy and efciency. Extensive experi- undated with diverse and trendy content, and they exhibit dynamic
ments show that MMMLP achieves state-of-the-art performance preferences over time. Capturing this preference variation has be-
with linear complexity. We also conduct ablating analysis to verify come a prominent task for content providers [22]. By modeling
the contribution of each component. Furthermore, compatible ex- a user’s historical interacted record, sequential recommendation
periments are devised, and the results show that the multi-modal systems (SRS) have an advantage in describing how a user’s be-
representation learned by our proposed model generally benefts havior changes over time. SRS has extensively facilitated modern
other recommendation models, emphasizing our model’s ability to life, including product recommendation [4, 19, 49], click prediction
handle multi-modal information. We have made our code available [24, 27], and web-page recommendation [28, 41].
online to ease reproducibility1 . With the rapid development of deep learning over the past few
years, several sequential recommendation models based on deep
∗ Xiangyu learning have emerged [10]. Recurrent Neural Network (RNN)
Zhao is the corresponding author.
1 https://github.com/Applied-Machine-Learning-Lab/MMMLP based [10, 11] and self-attention-based methods [15, 36] are the
most representative ones. It is generally believed that RNNs are ef-
fcient in processing sequentially correlated data. However, despite
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed achieving advanced performance [15, 36], whether they use long-
for proft or commercial advantage and that copies bear this notice and the full citation term memory units (LSTM) [12] or gated recurrent units (GRU)
on the frst page. Copyrights for components of this work owned by others than the [3], they still sufer from the incapability of maintaining long-term
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission dependencies and the difculty of parallelism. The newly emerged
and/or a fee. Request permissions from permissions@acm.org. self-attention [32] is not constrained by these limitations and can
WWW ’23, April 30–May 04, 2023, Austin, TX, USA capture long-term correlations between items without relying on
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9416-1/23/04. . . $15.00 the relative positions of the items. Self-attention has reached state-
https://doi.org/10.1145/3543507.3583378 of-the-art performance [15, 36, 37].
1109
WWW ’23, April 30–May 04, 2023, Austin, TX, USA Jiahao Liang, Xiangyu Zhao, Muyang Li, Zijian Zhang, Wanyu Wang, Haochen Liu, and Zitao Liu
recommendation models and leads to signifcant improvements. In

Diverse Deep learning summary, our contributions are as follows:
Preference model
• To the best of the authors’ knowledge, this is the frst efort
to handle multi-modal sequential recommendation with linear
Image
Long-short term
complexity, achieved through an MLP-based model that achieves
preference comparable performance to existing complex methods;
• We propose the novel MMMLP framework for fusing and aligning
Acoustic
multi-modal information in sequential recommendation, efec-
A cowboy doll is tively capturing the user’s fnely grained preferences;
profoundly threatened
and jealous when a
Sequence Interaction • We conduct extensive experiments demonstrating the efcacy
new spaceman…
history of our proposed method, along with comprehensive analysis to
Text Next item
verify the efcacy of each component;
• Our proposed method is a pioneering efort in capturing the
Figure 1: The general paradigm of multi-modal sequential context of multi-modal sequential recommendations, with our
recommender system compatibility study showing that our proposed Feature Mixer
Layer can enhance other recommendation models.
2 FRAMEWORK
Although existing works [11, 17, 33, 36] emphasize using side
In this section, we will start by describing the problem formula-
information to accurately simulate user sequential behavior, few
tion of sequential recommendation tasks, and then introduce our
studies have explored multimodal sequential recommendations and
proposed MMMLP framework for sequential recommendation sys-
user sequential behavior is rarely considered multimodal. How-
tems. Specifcally, we will frst propose a new multi-modal MLP
ever, in the feld of recommendation systems, there is increasing
framework that can be used to address the above tasks with a high
attention being paid to multimodal data, which provides semantic
degree of efciency. Then we will discuss the optimization process
information about user interactions from multiple perspectives. For
of the model and present the pseudo-code.
example, a regular sequential recommender system might fail to
capture the semantic information from an item’s images or text
2.1 Problem Statement
descriptions, which are crucial to a user interested in a type of
vehicle with a specifc color. To solve this task, latent embeddings Given the item set � = {� 1, . . . , �� . . . , � � }, for each user, we rep-
must be derived from diverse representations of items. resent his item interaction list as �� = {� 1, . . . , �� . . . , � � |�� ∈ � } ∈
A typical Multi-modal sequential recommender system is shown R� ×�� , including � item embeddings with the dimension of �� .
in Figure 1, where both the interaction history and sequence infor- Considering the feature in multiple modalities, we denote the image
mation show users’ short- and long-term preferences. Multi-modal feature corresponding to �� as � � ∈ R� ×�� , and textual feature
sequential recommender system uses this information and stud- as � � ∈ R� ×�� . � � and �� are the embedding size of image and
ies the users’s preferences to recommend relevant items. Unlike text token, resepectively. To pursue a concise description, we omit
item-IDs, which reveal only part of a sequential pattern, multi- the user subscript and represent the image, textual feature, and the
modal feature sequences reveal a more comprehensive view to the interacted item list as � , � , and �, respectively. It is noteworthy that
underlying pattern. Therefore, in order to use multi-modal fea- our proposed method can be easily extended to other modalities.
tures for sequential recommendation, it is increasingly common Sequential recommendation systems aim to predict what item
for RNN-based and self-attention-based models to integrate com- the user will select next based on past interactions. As such, given
modity features [11, 36]. However, RNN cannot maintain long-term the user interacted item list including � time steps, our goal is to
dependencies, while attention is computationally expensive. predict what will be the next item at time step � + 1, based on the
To address the above issues, we propose a Multi-Modal Multilayer multi-modal feature of items.
Perceptron (MMMLP) for sequence recommendation based on pure
MLP architecture, which efectively captures and fuses multi-modal 2.2 Overall Architecture
information to produce informed next-item predictions. Our model In this paper, we propose a multi-modal recommendation frame-
consists of three layers: the Feature Mixer Layer, Fusion Mixer work based on MLP, namely MMMLP, that can explicitly learn
Layer, and Prediction Layer. The Feature Mixer Layer includes information from various modalities. Figure 2 illustrates the ar-
three Mixer Modules, which capture the multi-modal information chitecture of MMMLP, which consists of three layers: the Feature
of items with linear complexity. The Fusion Mixer Layer mixes Mixer Layer, Fusion Mixer Layer, and Prediction Layer.
the information from the three modalities, and the last output is Our framework is fexible and can incorporate data in diverse
passed to the Prediction Layer to generate the next-item recom- modalities, and we focus on images and texts in this paper, which
mendation. We evaluate our proposed method on the Movielens are the most commonly used types of modalities in addition to item
100K and Movielens 1M benchmark datasets and demonstrate that sequences. As shown in Figure 2, image, text and item sequences
it outperforms existing basic sequential recommendation methods in the user-item interaction history are used as input, and we in-
and competitive side information integration methods. Moreover, corporate the Feature Mixer Layer, including three Mixer Modules
our proposed Feature Mixer Layer can also be applied to other to extract and process image, text, and item sequence information,
1110
MMMLP: Multi-modal Multilayer Perceptron for Sequential Recommendations WWW ’23, April 30–May 04, 2023, Austin, TX, USA
Sequence
Channel
Image
Mixer
Image Image
Sequence
Fusion Mixer
LayerNorm
Based on
Prediction
Channel
GELU
Channel
McMillan's Mixer
Text
novel,
with…
Text Text
Next item
x1
Sqeuence
Channel
…
Mixer
xt
…
xN
Sequence Sequence
Fusion Mixer Prediction
Feature Mixer Layer Layer Layer
Figure 2: The framework overview of the proposed MMMLP.
respectively. The Feature Mixer Layer also includes layer normal-

Mixer Module
ization and residual connections to enhance training stability. Next,
we use a post-fusion approach in the Fusion Mixer Layer by con- Skip-connections
catenating the outputs � � , � � , and � � from the three Mixer Modules Channel Sequence Sequence
LayerNorm
to fuse multiple modality representations. Finally, we make predic-
Sequence
Channel
Channel
GELU
𝑾𝟏 𝑾𝟐
tions of the next recommendation in the Prediction Layer based on T
the fused representation.
2.3 Feature Mixer Layer Channel Channel

LayerNorm
Sequence
Sequence
GELU
There are three Mixer Modules in the Feature Mixer Layer to ex- 𝑾 𝟒 𝑾 𝟑
tract image, text, and item sequence information. We frst transmit

the multi-modality raw data into the embedding feature matrices.
Specifcally, we load image as feature matrix, utilize pretrained
model for text encoding, and set trainable embedding for item se- Skip-connections
quence. Then, three diferent types of embedded inputs � , � , and �

from the image, text, and item sequence are processed by the mixer
Figure 3: The detailed architecture of Mixer Module.
module. As shown in Figure 3, the mixer module consists of a bunch
of identical blocks, where each block consists of two mixing opera-
tions. We take the processing on image modality feature matrix � as � is written as:
an example, operations on the textual feature � and item sequence �
� ∗,� = � ∗,� + � � LayerNorm(� )∗,� ,
b for � = 1 . . . � � ,
� are the same. The frst operation performs token mixing, where (1)
the token size is � � and we name the token mixer as � �. It acts in � �,∗ + �� LayerNorm(b
� �,∗ = b
b � ) �,∗ , for � = 1 . . . � ,
the same way on the columns of the feature matrix to capture the
interactions between the tokens within the channel. The result is where � ∗,� represents operations on column dimension, i.e., cross-
then provided to a channel mixer ��, which acts on the rows of token processing, on image feature matrix, and � �,∗ is operations on
� to capture the interactions between channels within the token. the row dimension, i.e., cross-channel processing. b� represents the
Standard architectural components such as residual connectivity intermediate representation of image modality. Through the same
and layer normalization are used to stabilize the training process. process on � and �, we can achieve the intermediate representation
For simplicity, the Mixer module operation on image feature matrix � and item sequence b
of text feature matrix b �.
1111
2.3.1 Image mixer. For the image mixer module, we take the image where � 9 ∈ R� � ×� and � 10 ∈ R� ×� � denote the learnable
embedding b � through a Mixer Module to extract the raw image weights of the frst layer in the sequence mixer. � 11 ∈ R� � ×�
features. The obtained visual embedding sequence is passed through and � 12 ∈ R� ×� � are the learnable weights of the second layer in
the mixer module, where the token mixer captures the interactions the sequence mixer. � � is the representation of item sequence.
between tokens, and then the results are provided to the channel
mixer to capture the interactions between channels. With the image 2.4 Fusion Mixer Layer
mixer, we achieve a visual representation of each sequence by fusing
visual correlations into the representation of each item. We present the Fusion Mixer Layer to fuse representations of mul-
As a result of image mixer, we have the following output: tiple modalities. A post-fusion approach is used to concatenate the
outputs of all Mixer Modules, i.e., � � , � � , � � , into the mixer layer
which consists of a mixer module. This approach is also referred to
b � ∗,� + � 2 � � 1 LayerNorm(b
� ∗,� = b � )∗,� , for � = 1 . . . � � , as a single-stream approach, which is comparatively more efective
(2) than dual-stream methods [1]. Using the Fusion Mixer Layer, we
� �,∗ + � 4 � � 3 LayerNorm(b
� ��,∗ = b � ) �,∗ , for � = 1 . . . � can approach a comprehensive representation of user interacted
item sequences by fusing the multi-modal representations.
where � is the GELU activation function [9]. b � ∗,� represents opera-
As a result of Fusion Mixer Layer, we have the following output:
tions on column dimension, i.e., cross-token processing, on image
feature matrix, and b � �,∗ is operations on the row dimension, i.e.,
cross-channel processing. � 1 ∈ R� � ×� and � 2 ∈ R� ×� � de- b � ∗,� + � 14 � � 13 LayerNorm(b
� ∗,� = b � )∗,� , for � = 1 . . . �,
note the learnable weights of the frst layer in the image mixer.
� �,∗ + � 16 � � 15 LayerNorm(b
�
� �,∗ = b � ) �,∗ , for � = 1 . . . �
� 3 ∈ R� � ×�� and � 4 ∈ R�� ×� � are the learnable weights of
the second layer in the image mixer. � � and � � are hidden size in (5)
feature-mixer. � � is the learned representation of image modality. where b𝒀 = Linear 𝒀 𝑖 𝒀 𝑡 𝒀 𝑠 and ∥ is the concatenation operation,
so 𝐷 = 𝐷 𝐼 + 𝐷𝑇 + 𝐷𝑆 . 𝒀 𝑓 is the output of the block, which is
2.3.2 Text mixer. For the text mixer module, we take the text em- the comprehensive representation considering multiple modalities.
bedding b � through a Mixer module to extract the raw features. 𝑾 13 ∈ R𝑟 𝑁 ×𝑁 and 𝑾 14 ∈ R𝑁 ×𝑟 𝑁 denote the learnable weights of
Using the obtained text embedding sequence, the token mixer cap- the first layer in the mixer. 𝑾 15 ∈ R𝑟 𝐷 ×𝐷 and 𝑾 16 ∈ R𝐷 ×𝑟 𝐷 are
tures the interactions between tokens within channels. Then, the the learnable weights of the second layer in the mixer.
channel mixer takes the results to capture the interactions between
channels within tokens. Using text mixer, it is possible to create
an informative sequence representation by integrating text correla- 2.5 Model Optimization
tions into the representation of each sequence, in order to create a 2.5.1 Prediction. To make fair comparisons, we introduce the most
text representation of each sequence as a whole. commonly used inference method in SRS. After we have taken 𝐿
As a result of text mixer, we have the following output: layers of sequence-mixers, channel-mixers, and feature-mixers, we
obtain a sequence of hidden states that contain the sequential,
b � ∗,� + � 6 � � 5 LayerNorm(b
� ∗,� = b � )∗,� , for � = 1 . . . �� , cross-channel, and cross-feature dependencies of each interaction.
ℎ 𝑁 represents the user’s preference based on previous 𝑁 times
� �,∗ + � 8 � � 7 LayerNorm(b
� ��,∗ = b � ) �,∗ , for � = 1 . . . �
interactions. The score of each candidate item 𝑥𝑖 is calculated by:
(3)
� 5 ∈ R� � ×� and � 6 ∈ R� ×� � denote the learnable weights of �
the frst layer in the text mixer. � 7 ∈ R� � ×�� and � 8 ∈ R�� ×� � b� = �� (�� · (� � )� )
� (6)
are the learnable weights of the second layer in the text mixer. � �
is the learned representation of text modality. where � = 1, . . . , �, � � ∈ R� ×� is the representation of item �� .
�
2.3.3 Sequence mixer. Using the items’ ID embeddings as the input

feature in the sequence mixer module, we can produce an embed- 2.5.2 Model Optimization. In this subsection, we present the opti-
ding table that has the same dimensions as the input features for all mization algorithm of our proposed model as shown in Algorithm
the rows of the embedding table, which is applied to the transposed 1. We frst randomly initialize the parameters of model �� (line 1).
embedding table. As a result of this process, all the sequential de- In each epoch, we split the training data into batches (line 3). Then
pendencies within each output sequence are merged into an output we feed the feature matrix in three modalities � , � , and � to � � and
table. Sequence mixer is sensitive to the sequencing order of the �� and achieve the corresponding intermediate representation b �,
interacted items since the correlation between them is sequential, � , and b
b � (line 4). Based on the image mixer (line 5), text mixer (line
which demonstrates the evolution of their interests across time. 6), and sequence mixer (line 7), we can generate the representation
As a result of sequence mixer, we have the following output: corresponding to the three modalities as � � , � � , and � � . We fuse
the multi-modal features and achieve � � based on the fusion mixer
b � ∗,� + � 10 � � 9 LayerNorm(b
� ∗,� = b �)∗,� , for � = 1 . . . �� , layer (line 8). The model parameter � is then updated with the
gradient until convergence (line 9). It is important to note that the
� �,∗ + � 12 � � 11 LayerNorm(b
� ��,∗ = b �) �,∗ , for � = 1 . . . � image mixer, text mixer, and sequence mixer perform only simple
(4) matrix multiplications, thus preserving their linear complexity.
1112
Algorithm 1 Optimization pipeline of MMMLP. Table 1: Statistics of the datasets.

Input: Historical interacted item feature matrix of image modality:
� , text modality: � , item sequence: �. Dataset ML-100K ML-1M
Output: Well-trained model �� .
# interactions 99287 999611
1: Randomly initialize parameters of model ��
# users 944 6041
2: for Epoch in 1,...,max epochs do
# items 1,350 3417
3: for Batch in 1,...,batch number do # avg. length 105 165
4: Generate the intermediate representation of three
modalities b �,b
�, b
� by Eq. (1).
5: Learn image representation � � by Eq. (2).
6: Learn text representation � � by Eq. (3). • RQ4: Can our proposed Feature Mixer Layer enhance other se-
7: Learn sequence representation � � by Eq. (4). quential recommendation methods?
8: Fuse the multi-modal features and get the comprehensive To begin with, we will introduce the experimental settings. Follow-
representation � � by Eq. (5). ing this, we seek answers to these questions.
9: Calculate the loss by Eq. (6) and (7) and update � .
10: end for 3.1 Dataset
11: end for We choose two widely used datasets, MovieLens 100K and Movie-
12: Return: Well-trained model �� . Lens 1M2 to benchmark our performance. Their number of in-
teractions and average length are 99,287 (105) and 999,611 (165),
respectively. And other concrete statistics can be found in Table 1.
Our training procedure follows the commonly used paradigm in We flter out items and users that have fewer than fve interactions.
SRS, using Cross-Entropy loss: We set the maximum sequence length as 50 for both datasets and
∑
�=− �� )
�� log(b (7) conduct zero-padding for shorter sequences. For dataset splitting,
� ∈ [1,...,� ] we use the last item in an interaction sequence as the test set, the
where � is the one-hot vector of ground truth. item before as the validation set, and the rest of the items will be
used as the training set. Following standard settings, we pair 100
2.6 Discussion negative samples with ground-truth items during prediction [15].
2.6.1 Relation to MLPMixer. In terms of architecture, the MMMLP 3.1.1 MovieLens 100K. MovieLens is a site for recommending movies
resembles the MLPMixer [30]. However, the main diference lies to users given their historical ratings, which is now one of the most
in the fact that the MMMLP is able to handle and fuse multiple commonly used benchmarks across the feld of recommender sys-
modalities based on information from two-dimensional, i.e., token tem. We use MovieLens 100K in our experiments. The MovieLens
and channel. Comparatively, MLPMixer can only integrate visual 100K dataset is a stable benchmark, consisting of 100,000 ratings
information through two-dimensional information. from 1,000 users on 1,700 movies released in 1998.
2.6.2 Complexity Analysis. The following discussion regarding the 3.1.2 MovieLens 1M. The MovieLens 1M metadata dataset con-
time and space complexity of our model is for the inference stage. tains 1 million ratings of 4,000 movies, sourced from 6,000 diverse
(1) Time Complexity: MMMLP’s time complexity is � (� + � + �), users. This dataset is systematically organized into three distinct
which is linear complexity with respect to the sequence length � , tables: ratings, user data, and movie details.
embedding size �, and modalities number � = 3. Compared to the
time complexity of self-attention, i.e., � (� 2 � + � 2 � ), the theoreti- 3.2 Evaluation Settings
cal lower bound of the MMMLP’s time complexity is signifcantly
3.2.1 Metrics. We verify the efcacy based on next-item predic-
lower. (2) Space Complexity: The space complexity of MMMLP is
tion. Two commonly used metrics are employed to evaluate rec-
� (� (� + � + 1)), where the number of modalities � is usually
ommender systems [17], namely mean reciprocal rank (MRR) and
limited, especially after feature selection. On the other hand, the
normalized discounted cumulative gain (NDCG). MRR takes the
space complexity of self-attention is � (� � + � 2 ) [15], which is
joint of the ground-truth item’s ranking in the top-K recommenda-
quadratic to the embedding size.
tion and averages it across all evaluated items. Three random seeds
3 EXPERIMENTS are used to average all results. NDCG measures the order in which
the top-K recommended items are generated by the recommender.
In this section, we conduct extensive experiments to assess the efec-
tiveness of the proposed framework. Specifcally, the experiments 3.2.2 Baselines. We compare the performance of our proposed
primarily aim to answer the following research questions: MMMLP model against several widely used baselines in the feld
• RQ1: How does the MMMLP perform compared with current of recommender systems. These baselines include: FPMC [25],
advanced baselines? BPR [26], GRU4Rec [10], SASRec [15], GRU4RecF+ [11], SASRecF+ ,
• RQ2: How do the hyper-parameters afect performance? FDSA+ [36] and MLPMixer [30]. Note that: The superscript “+”
• RQ3: How do the components in the framework contribute to
performance? 2 https://grouplens.org/datasets/movielens/
1113
Table 2: Overall performance comparison on two datasets. The results are averaged over three random seeds. Bold scores
indicate the best model for each metric and underlined scores indicate the second best model.
Dataset Metric FPMC BPR GRU4Rec SASRec GRU4RecF+ FDSA+ SASRecF+ MLPMixer MMMLP
MRR@10 0.1314 0.1513 0.1829 0.1946 0.1839 0.1980 0.2040 0.1909 0.2144*
ML-100K
NDCG@10 0.1932 0.2132 0.2521 0.2704 0.2574 0.2643 0.2758 0.2636 0.2764*
MRR@10 0.2419 0.2959 0.3383 0.3743 0.3614 0.3994 0.4063 0.4043 0.4129*
ML-1M
NDCG@10 0.3040 0.3535 0.4062 0.4430 0.4342 0.4668 0.4691 0.4695 0.4795*
“*” indicates the statistically signifcant improvements (i.e., two-sided t-test with � < 0.05) over the best baseline.
indicates that we improve the original model, which takes the em- incorporate the pre-trained bert-base-uncased provided by hug-
bedded matrix of item ID, image, and text features as input and can gingface3 for text data preprocessing [5, 7]. The implementation
be fairly compared with MMMLP. code is available online to ease reproducibility4 .
• FPMC [25]: FPMC combines Markov Chains and Matrix Fac- 3.4 Overall Performance (RQ1)
torization method to learn the sequential dependencies in user To answer RQ1, we compare MMMLP with representative baselines.
interaction history as well as users’ general preferences. The comparison results are summarized in Table 2, where models
• BPR [26]: BPR builds matrix factorization model from pair-wise including FPMC, BPR, GRU4Rec, SASRec, MLPMixer only consider
loss function to learn from implicit feedback, and it is a classical item embeddings, while models such as GRU4RecF+ , FDSA+ and
general recommender system. SASRecF+ also involve multi-modal information.
• GRU4Rec [26]: GRU4Rec uses gated recurrent unit to improve From Table 2, we can make the following general observations: (i)
the performance of vanilla RNN, allowing it to mitigate the van- Starting from GRU4Rec, deep learning-based methods largely out-
ishing gradient problem to some extent. perform traditional methods such as BPR, which indicates that deep
• SASRec [15]: A sequential recommendation model based on learning models do a better job of capturing sequential correlation
attention that uses a self-attention network for the generation of in the sequential recommendation. More specifcally, we can also
sequential recommendations. observe that: (ii) self-attention models generally have better perfor-
• GRU4RecF+ [11]: This is an improved version of GRU4RecF. In mance compared to RNN-based models. This can be attributed to the
order to make a fair comparison, we replace the classical bag-of- stronger ability of self-attention to capture sequential patterns. (iii)
words with TF-IDF with a pre-trained bert-base-uncased. Models that can handle multi-modal features (models with + ), such
• SASRecF+ : This is our improved version of SASRec, which fuses as GRU4RecF+ , FDSA+ , and SASRecF+ are generally superior to
the text, image, and sequence representation of items through models that cannot handle multi-modal features, such as GRU4Rec
concatenation operation before feeding them to the model. and SASRec, which indicates the importance of multi-modal fea-
• FDSA+ [36]: This is our improved version of FDSA, which fuses tures in the sequential recommendation. (iv) MLPMixer can achieve
the text, image, and sequence representation of items through comparable performance with SASRec, FDSA+ , and other SOTA
concatenation operation before feeding them to the model. methods, which indicates that a simple MLP architecture can re-
• MLPMixer [30]: MLPMixer is our improved version of MLP- place the self-attention mechanism. (v) MMMLP consistently out-
Mixer [30] to make it adapt to sequential recommendation tasks performs all multi-modal baselines including SASRecF+ or FDSA+ ,
based on item embeddings. which indicates that MLPMixer-based multimodal information fu-
sion is signifcantly efective and even rivals transformer-based
multimodal information fusion capability.
3.3 Implementation Details In summary, MMMLP’s performance on both datasets is supe-
This MMMLP implementation and all baselines are based on the rior to that of the state-of-the-art baselines, which validates the
RecBole [38] library, an open-source recommendation system li- efectiveness of our proposed MMMLP model in sequential recom-
brary that enables us to test and compare all methods in a fair mendations with multi-modal information.
environment, allowing our results to be replicated easily. We ad-
just the hyperparameters based on the original paper. The Adam 3.5 Parameters Analysis (RQ2)
optimizer [16] and the early stopping policy were adapted, and we To investigate RQ2, in this section, we analyze the parameters of
perform cross-validation for hyperparameter selection when the MMMLP on ML-100K and ML-1M, including image mixer layer
original paper did not provide detailed hyperparameters. depth and text mixer layer depth. Table 3 shows the efect of layer
We set the learning rate to 1e-4, and fx the batch size as 256. depth and embedding size on MMMLP and MLPMixer.
Moreover, to handle diferent item sequence lengths, we use padding
to fll users whose interaction numbers are less than the maximum 3.5.1 Image Mixer layer depth. In MMMLP, we applied Image
sequence length, and use the most recent interactions from users Mixer to obtain the visual information of items. Based on our ex-
with more interactions than the maximum sequence length [15]. periments, it has been shown in Table 3 that when the Image Mixer
We only used GELU [9] as the nonlinear activation across all mod- 3 https://huggingface.co/bert-base-uncased
els for fair comparison [11]. To achieve efcient text modeling, we 4 https://github.com/Applied-Machine-Learning-Lab/MMMLP
1114
Table 3: Parameters analysis
Dataset Metric Image mixer layer depth Text mixer layer depth
2 layers 4 layers 8 layers 2 layers 4 layers 8 layers
MRR@10 0.2059 0.2144 0.2034 0.2082 0.2144 0.2056
ML-100K
NDCG@10 0.2676 0.2764 0.2643 0.2708 0.2764 0.2664
MRR@10 0.4073 0.4129 0.4039 0.4076 0.4129 0.4053
ML-1M
NDCG@10 0.4751 0.4795 0.4702 0.4763 0.4795 0.4723
layer depth is 4, the performance of the model is optimal. Models Table 4: Ablation study comparison.
with shallower layers, such as a layer depth of 2, sufer from poor
performance because the embedding size has limited representation Model MRR@10 NDCG@10
power. As the number of layers gets deeper, like a layer of 8, it is
MLPMixer 0.1909 0.2636
likely to end up having a problem with overftting.
MMMLP-Image 0.2094 0.2743
3.5.2 Text Mixer layer depth. As shown in Table 3, we found that MMMLP-Text 0.2092 0.2747
our model achieves optimal performance with a Text Mixer layer C/B-MMMLP 0.2108 0.2746
depth of 4. However, if the text input is short, such as the title MMMLP 0.2144 0.2764
of a movie, a lighter layer can yield better results. On the other
hand, embedding sizes of models with a layer depth of 2 may not be
sufcient for presenting long text, resulting in poor performance, which can be attributed that our proposed Feature Mixer Layer
such as long descriptions of movies. Meanwhile, deeper layers, such performs better than pre-trained image or text extractors.
as 8, may result in overftting. In our analysis of the Text Mixer
layer depth, we used the �� of movies, which contain 3.7 Compatibility Study (RQ4)
long text information.
To answer RQ4, we conduct an experimental analysis to examine
the compatibility of MMMLP’s feature mixer module on ML-100K.
3.6 Ablation Study (RQ3) In particular, we aim to investigate whether using our proposed
We conducted an ablating study on ML-100K to investigate RQ3. As (i) text mixer (Model_T), (ii) image mixer (Model_I) and (iii) both
mentioned earlier, MMMLP achieves better performance than MLP- text and image mixers (Model_TI) in other multi-modal sequential
Mixer on both datasets across all metrics, with the only diference recommendation models can improve their performance.
between their architectures being the feature Mixer. Here, we in- From Figure 4, we can observe that all three modifed version out-
vestigate the need for a feature mixer by answering two important perform the original multi-modal sequential recommendation mod-
questions - Q1: Can our model maintain satisfactory performance els. Additionally, GUR4RecF_TI, SASRec_TI, and FDSA_TI, which
when replacing our mixer module with a common feature extractor? are replaced by both our proposed image and text mixers, work
and Q2: What is the contribution of each module in our proposed better than GUR4RecF_T(I), SASRec_T(I), FDSA_T(I), which are
model? In order to answer these questions, we designed the follow- replaced by either image or text mixer alone. These observations
ing alternatives to MMMLP and MLPMixer : indicate that (i) our proposed text and image mixers can capture
• MLPMixer: MLPMixer is a plain MLPMixer and does not include diferent and complementary information from items, and (ii) our
item features. proposed mixers possess excellent compatibility that could be ap-
• MMMLP-Image: MMMLP-Image is a simplifed MMMLP, which plied to other multi-modal sequential recommendation models and
only uses the image extractor to extract visual features. improve their performance.
• MMMLP-Text: MMMLP-text is a simplifed MMMLP, which only
uses text extractor to extract text features. 4 RELATED WORK
• C/B-MMMLP: CNN and Bert are used as the image and text This section briefy reviews the representative works related to us,
extractors of MMMLP. including sequential recommendation, multi-modal recommenda-
From Table 4, we can conclude that: (i) Without incorporating tion, and MLP-based models. We also discuss the advantages of our
multi-modal features, MLPMixer has signifcantly worse perfor- method beyond these works.
mance, which confrms the importance of including multi-modal
features in sequential recommendations. (ii) For the two simplifed 4.1 Sequential Recommendation
versions of MMMLP, we can observe that MMMLP always performs In the early stages of sequential recommendation, a number of ap-
better than both of them, while MMMLP-Text and MMMLP-Image proaches were based on the Markov Chain assumption [8, 14, 25, 48]
perform better than vanilla MLPMixer. This shows that either image and focused on modeling item-item transition relationships in order
or text information can enhance the efectiveness of the model. (iii) to predict the next item depending on the user’s last interaction with
MMMLP consistently outperforms C/B-MMMLP overall metrics, the item. In recent years, several works have been developed along
1115
(a) GRU4RecF/HRR@10 (b) SASRecF/HRR@10 (c) FDSA/HRR@10

0.21 0.21 0.21
0.20 0.20 0.20
0.19 0.19 0.19
0.18 0.18 0.18
0.17
cF _T F_I TI 0.17 F_I TI 0.17
4Re 4RecF 4Rec cF_ Rec
F _T
ecF SRec cF_
R U U 4 Re
A S R Re FDS
A A_T A_I A_T
I
G GR
U GR GR
U S SA S SA SAS FDS FDS FDS
Figure 4: Compatibility study results.
this line, with the Reinforcement Learning (RL) approach focusing gMLP [20], and the sMLP [29], etc. In addition, MLP-based models
on the Markov Decision Process (MDP) being particularly note- are further enhanced by gMLP, which uses a gated version of MLP
worthy [39–46]. As neural networks developed, Hidasi et al. [10] to enhance the model. Among the NLP representatives are pNLP-
introduced a new neural network architecture, the GRU4Rec, to cap- Mixer [6] and Hypermixer [23], where gMLP and sMLP [34] can also
ture sequential patterns, and in recent years there has been a surge be applied. With the use of token mixing and input weighted sum-
of work leveraging other neural network architectures, e.g., CNN, mation, these MLP-based models are capable of achieving a similar
GNN, and Transformer for sequential recommendation [35, 37]. functionality to self-attention. FMLP-REC [47] and MLP4Rec [17]
Numerous studies have introduced other contextual information are among the pioneers when it comes to sequence recommenda-
(e.g., item attributes and reviews). Although these existing mod- tions. In spite of this, MLP-based models are not frequently used in
els have been successful in sequential recommendation tasks, we multi-modal sequence recommendation, and our proposed model
fnd that they have not fully captured users’ interests from multi- provides a very efective solution for this problem.
modal information. Our proposed model efectively leverages the
multi-modal information and achieves a more comprehensive user
preference modeling. 5 CONCLUSION
This paper proposes MMMLP, an MLP-based architecture for multi-
4.2 Multi-modal Recommendation modal sequential recommendation. Specifcally, we devise a unique
Multi-modal information has emerged as a crucial research area in feature mixer layer that can extract image, text, and item sequence
the feld of recommendation systems. Presently, most methods for information simultaneously, a fusion mixer layer for fusing these
integrating multi-modal information focus on combining features representations, and a prediction layer for generating recommen-
for recommendations. Some approaches have been proposed for dations. Compared to other methods, MMMLP ofers superior ca-
concatenating diferent patterns in user behavior sequences and an- pabilities for extracting and fusing multi-modal information, while
alyzing the transition patterns between adjacent behaviors [11, 36]. maintaining a linear computational complexity. Extensive experi-
For instance, Zhang et al. [36] proposed a transformation pattern ments on two benchmark datasets demonstrate that MMMLP con-
model of item and attribute-level behaviors based on the neighbor- sistently outperforms other baseline methods. As a pioneering ap-
ing behaviors. Multi-modal content can also be merged with item proach in the context of multi-modal sequential recommendation,
representations, which are subsequently fed into sequential models MMMLP has been shown to be highly efective at combining multi-
such as RNN for the recommendation of the next item [2, 13, 18, 21]. modal information. Additionally, we provide compatibility analysis
According to Huang et al. [13], unifed multi-type actions and multi- to demonstrate that our proposed mechanism can enhance other
modal representations of content are combined to form a contextual methods that use multi-modal data capture.
self-attention network for sequential recommendation. However,
these models do not fully exploit multi-modal information, or they
are overly complex. Our proposed model is more than capable of ACKNOWLEDGEMENTS
compensating for these issues.
This research was partially supported by APRC - CityU New Re-
search Initiatives (No.9610565, Start-up Grant for New Faculty of
4.3 MLP-based Models City University of Hong Kong), SIRG - CityU Strategic Interdis-
According to recent studies, MLP-based architectures have shown ciplinary Research Grant (No.7020046, No.7020074), HKIDS Early
superior performances in Computer Vision(CV) and Natural Lan- Career Research Grant (No.9360163), Huawei (Huawei Innovation
guage Processing(NLP), and are comparable to mainstream Trans- Research Program) and Ant Group (CCF-Ant Research Fund), and
formers in terms of performance. The feld of CV includes a number Key Laboratory of Smart Education of Guangdong Higher Educa-
of representatives, such as the MLP-Mixer [30], the resMLP [31], the tion Institutes, Jinan University (2022LSYS003).
1116
REFERENCES Uncertainty in Artifcial Intelligence. 452–461.

[1] E. Bugliarello, R. Cotterell, N. Okazaki, and D. Elliott. 2020. Multimodal Pretrain- [27] Fengyi Song, Bo Chen, Xiangyu Zhao, Huifeng Guo, and Ruiming Tang. 2022.
ing Unmasked: Unifying the Vision and Language BERTs. (2020). AutoAssign: Automatic Shared Embedding Assignment in Streaming Recom-
[2] J. Chen, H. Zhang, X. He, L. Nie, L. Wei, and T. S. Chua. 2017. Attentive Collabo- mendation. In 2022 IEEE International Conference on Data Mining (ICDM). IEEE,
rative Filtering: Multimedia Recommendation with Item- and Component-Level 458–467.
Attention. In International Acm Sigir Conference. 335–344. [28] K Suneetha and M Usha Rani. 2012. Web page recommendation approach using
[3] K. Cho, B Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, weighted sequential patterns and Markov model. Global Journal of Computer
and Y. Bengio. 2014. Learning Phrase Representations using RNN Encoder- Science and Technology (2012).
Decoder for Statistical Machine Translation. Computer Science (2014). [29] C. Tang, Y. Zhao, G. Wang, C. Luo, W. Xie, and W. Zeng. 2021. Sparse MLP for
[4] Keunho Choi, Donghee Yoo, Gunwoo Kim, and Yongmoo Suh. 2012. A hybrid Image Recognition: Is Self-Attention Really Necessary? arXiv e-prints (2021).
online-product recommendation system: Combining implicit rating-based col- [30] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua
laborative fltering and sequential pattern analysis. electronic commerce research Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob
and applications 11, 4 (2012), 309–317. Uszkoreit, et al. 2021. Mlp-mixer: An all-mlp architecture for vision. Advances in
[5] Yingpeng Du, Hongzhi Liu, and Zhonghai Wu. 2022. M 3-IB: A Memory-Augment Neural Information Processing Systems 34 (2021), 24261–24272.
Multi-modal Information Bottleneck Model for Next-Item Recommendation. [31] H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard,
In Database Systems for Advanced Applications: 27th International Conference, A. Joulin, G. Synnaeve, and J. Verbeek. 2021. ResMLP: Feedforward networks for
DASFAA 2022, Virtual Event, April 11–14, 2022, Proceedings, Part II. Springer, image classifcation with data-efcient training. (2021).
19–35. [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
[6] F. Fusco, D. Pascual, and P. Staar. 2022. pNLP-Mixer: an Efcient all-MLP Archi- and I. Polosukhin. 2017. Attention Is All You Need. In arXiv.
tecture for Language. (2022). [33] Yueqi Xie, Peilin Zhou, and Sunghun Kim. 2022. Decoupled Side Information
[7] Tengyue Han, Yu Tian, Jiwei Zhang, and Shaozhang Niu. 2020. Sequential Fusion for Sequential Recommendation. arXiv preprint arXiv:2204.11046 (2022).
recommendation with a pre-trained module learning multi-modal information. [34] P. Yu, M. Artetxe, M. Ott, S. Shleifer, H. Gong, V. Stoyanov, and X. Li. 2022.
In 2020 International Conferences on Internet of Things (iThings) and IEEE Green Efcient Language Modeling with Sparse all-MLP. (2022).
Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social [35] Chi Zhang, Yantong Du, Xiangyu Zhao, Qilong Han, Rui Chen, and Li Li. 2022.
Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Hierarchical Item Inconsistency Signal Learning for Sequence Denoising in Se-
Cybermatics (Cybermatics). IEEE, 611–616. quential Recommendation. In Proceedings of the 31st ACM International Conference
[8] Ruining He and Julian McAuley. 2016. Fusing similarity models with markov on Information & Knowledge Management. 2508–2518.
chains for sparse sequential recommendation. In 2016 IEEE 16th international [36] Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S Sheng, Jiajie Xu, De-
conference on data mining (ICDM). IEEE, 191–200. qing Wang, Guanfeng Liu, and Xiaofang Zhou. 2019. Feature-level Deeper
[9] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). Self-Attention Network for Sequential Recommendation.. In IJCAI. 4320–4326.
arXiv preprint arXiv:1606.08415 (2016). [37] Kesen Zhao, Xiangyu Zhao, Zijian Zhang, and Muyang Li. 2022. MAE4Rec:
[10] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Storage-saving Transformer for Sequential Recommendations. In Proceedings of
2015. Session-based recommendations with recurrent neural networks. arXiv the 31st ACM International Conference on Information & Knowledge Management.
preprint arXiv:1511.06939 (2015). 2681–2690.
[11] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos [38] Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu
Tikk. 2016. Parallel recurrent neural network architectures for feature-rich Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, et al. 2021. Recbole:
session-based recommendations. In Proceedings of the 10th ACM conference on Towards a unifed, comprehensive and efcient framework for recommendation
recommender systems. 241–248. algorithms. In Proceedings of the 30th ACM International Conference on Information
[12] S. Hochreiter and J. Schmidhuber. 1997. Long Short-Term Memory. Neural & Knowledge Management. 4653–4664.
Computation 9, 8 (1997), 1735–1780. [39] Xiangyu Zhao, Changsheng Gu, Haoshenglun Zhang, Xiwang Yang, Xiaobing
[13] X. Huang, S. Qian, F. Quan, J. Sang, and C. Xu. 2018. CSAN: Contextual Self- Liu, Hui Liu, and Jiliang Tang. 2021. DEAR: Deep Reinforcement Learning for
Attention Network for User Sequential Recommendation. In 2018 ACM Multime- Online Advertising Impression in Recommender Systems. In Proceedings of the
dia Conference. AAAI Conference on Artifcial Intelligence, Vol. 35. 750–758.
[14] Santosh Kabbur, Xia Ning, and George Karypis. 2013. Fism: factored item simi- [40] Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin. 2019. Deep reinforcement
larity models for top-n recommender systems. In Proceedings of the 19th ACM learning for search, recommendation, and online advertising: a survey. ACM
SIGKDD international conference on Knowledge discovery and data mining. 659– SIGWEB Newsletter Spring (2019), 1–15.
667. [41] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang
[15] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- Tang. 2018. Deep Reinforcement Learning for Page-wise Recommendations. In
mendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, Proceedings of the 12th ACM Recommender Systems Conference. ACM, 95–103.
197–206. [42] Xiangyu Zhao, Long Xia, Lixin Zou, Hui Liu, Dawei Yin, and Jiliang Tang. 2020.
[16] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- Whole-Chain Recommendations. In Proceedings of the 29th ACM International
mization. arXiv preprint arXiv:1412.6980 (2014). Conference on Information & Knowledge Management. 1883–1891.
[17] Muyang Li, Xiangyu Zhao, Chuan Lyu, Minghao Zhao, Runze Wu, and Ruocheng [43] Xiangyu Zhao, Long Xia, Lixin Zou, Hui Liu, Dawei Yin, and Jiliang Tang. 2021.
Guo. 2022. MLP4Rec: A Pure MLP Architecture for Sequential Recommendations. UserSim: User Simulation via Supervised GenerativeAdversarial Network. In
arXiv preprint arXiv:2204.11510 (2022). Proceedings of the Web Conference 2021. 3582–3589.
[18] X. Li, C. Wang, J. Tan, X. Zeng, D. Ou, and B. Zheng. 2020. Adversarial Multimodal [44] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin.
Representation Learning for Click-Through Rate Prediction. 2018. Recommendations with Negative Feedback via Pairwise Deep Reinforce-
[19] Duen-Ren Liu, Chin-Hui Lai, and Wang-Jung Lee. 2009. A hybrid of sequential ment Learning. In Proceedings of the 24th ACM SIGKDD International Conference
rules and collaborative fltering for product recommendation. Information Sciences on Knowledge Discovery & Data Mining. ACM, 1040–1048.
179, 20 (2009), 3505–3519. [45] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang
[20] H. Liu, Z. Dai, D. R. So, and Q. V. Le. 2021. Pay Attention to MLPs. Tang. 2017. Deep Reinforcement Learning for List-wise Recommendations. arXiv
[21] K. Liu, S. Xing, and P. Natarajan. 2017. Sequential Heterogeneous Attribute preprint arXiv:1801.00209 (2017).
Embedding for Item Recommendation. In 2017 IEEE International Conference on [46] Xiangyu Zhao, Xudong Zheng, Xiwang Yang, Xiaobing Liu, and Jiliang Tang.
Data Mining Workshops (ICDMW). 2020. Jointly learning to recommend and advertise. In Proceedings of the 26th
[22] Qiang Liu, Shu Wu, Diyi Wang, Zhaokang Li, and Liang Wang. 2016. Context- ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
aware sequential recommendation. In 2016 IEEE 16th International Conference on 3319–3327.
Data Mining (ICDM). IEEE, 1053–1058. [47] Kun Zhou, Hui Yu, Wayne Xin Zhao, and Ji-Rong Wen. 2022. Filter-enhanced
[23] F. Mai, A. Pannatier, F. Fehr, H. Chen, F. Marelli, F. Fleuret, and J. Henderson. MLP is all you need for sequential recommendation. In Proceedings of the ACM
2022. HyperMixer: An MLP-based Green AI Alternative to Transformers. (2022). Web Conference 2022. 2388–2399.
[24] Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practice [48] Andrew Zimdars, David Maxwell Chickering, and Christopher Meek. 2013. Using
on long sequential user behavior modeling for click-through rate prediction. temporal data for making recommendations. arXiv preprint arXiv:1301.2320
In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge (2013).
Discovery & Data Mining. 2671–2679. [49] Lixin Zou, Long Xia, Yulong Gu, Xiangyu Zhao, Weidong Liu, Jimmy Xiangji
[25] Stefen Rendle. 2010. Factorization machines. In 2010 IEEE International conference Huang, and Dawei Yin. 2020. Neural Interactive Collaborative Filtering. In
on data mining. IEEE, 995–1000. Proceedings of the 43rd International ACM SIGIR Conference on Research and
[26] Stefen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars BPR Schmidt- Development in Information Retrieval. 749–758.
Thieme. 2014. Bayesian personalized ranking from implicit feedback. In Proc. of
1117

MMMLP: Multi-Modal Multilayer Perceptron For Sequential Recommendations

Uploaded by

Copyright:

Available Formats

MMMLP: Multi-Modal Multilayer Perceptron For Sequential Recommendations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MMMLP: Multi-Modal Multilayer Perceptron For Sequential Recommendations

Uploaded by

Copyright:

Available Formats

MMMLP: Multi-modal Multilayer Perceptron

for Sequential Recommendations

Zijian Zhang Wanyu Wang Haochen Liu

recommendation models and leads to signifcant improvements. In

Figure 2: The framework overview of the proposed MMMLP.

respectively. The Feature Mixer Layer also includes layer normal-

2.3 Feature Mixer Layer Channel Channel

tract image, text, and item sequence information. We frst transmit

quence. Then, three diferent types of embedded inputs � , � , and �

2.3.3 Sequence mixer. Using the items’ ID embeddings as the input

Algorithm 1 Optimization pipeline of MMMLP. Table 1: Statistics of the datasets.

Table 3: Parameters analysis

(a) GRU4RecF/HRR@10 (b) SASRecF/HRR@10 (c) FDSA/HRR@10

Figure 4: Compatibility study results.

REFERENCES Uncertainty in Artifcial Intelligence. 452–461.

You might also like