Multi-label Text Classification on TextCNN Fused B
Multi-label Text Classification on TextCNN Fused B
Multi-label Text Classification on TextCNN Fused B
Research Article
DOI: https://doi.org/10.21203/rs.3.rs-3814441/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
application value.
Abstract—In order to effectively manage and utilize the
network text information and realize the automatic labeling of II. RELATED WORK
text content, this paper proposes to use a variety of deep
learning models to study multi-label text classification. In this At present, the common multi-label text classification
paper, GloVe is used to obtain the semantic features of text data, technologies based on deep learning mainly include
and the convolutional neural network and BiLSTM neural Recurrent neural Network (RNN) [6], Convolutional Neural
network are fused. The latter introduces the Attention Network (CNN) [7] and attention-based models. RNN-based
mechanism to form a parallel neural network model of text classification models process text as a sequence of words,
TextCNN and BiLSTM_Attention. Experimental results show
that TextCNN and BiLSTM_Attention model structures
and extract semantic features through the structural
combine the advantages of convolutional neural network model information in the sequence and the dependence between
and recurrent neural network, and can better understand local context for downstream classifiers. However, ordinary RNN
and global semantic information. The Attention mechanism is models do not perform well and cannot support long
more reasonable for text feature extraction, so that the model sequence memory of text. Among the many variants of RNN,
focuses its attention on the features that contribute more to the Long Short-Term Memory (LSTM) [8] and Gated Recurrent
text classification task, and the classification effect is better.
Unit (GRU) [9] are the most widely used model architectures,
Index Terms—BiLSTM, Attention, TextCNN, multi-label designed to better capture long-term dependencies. In 2019,
Yang [10] proposed the SGM model, which used a Bi-LSTM
I. INTRODUCTION network in the encoder structure and introduced an
out-of-order prediction module in the decoder structure to
I n the field of natural language processing,
multi-label text classification is always a basic and
important topic, which has certain guiding significance
solve the error accumulation problem caused by sequential
prediction in the Seq2Seq model. At the same time, in 2019,
You [11] proposed the AttentionXML model, which extracts
and application value in many tasks. Examples include text semantic features based on Bi-LSTM network, enhances
information retrieval [1], dialog behavior classification [2], text semantic features by using the attention mechanism, and
tag recommendation topic recognition [3], sentimental uses the label tree to group labels, which solves the problem
analysis [4] question answering [5]. The specific of excessive calculation under the large number of labels.
application of multi-label text classification can be seen in Recurrent neural networks intuitively extract temporal
the following aspects: First, news classification, a news features, while convolutional neural networks focus on
can involve multiple topics at the same time, such as spatial features and are better at extracting local context. In
science and technology, economy, digital, etc. The use of 2014, Kim [12] proposed a CNN-based text classification
multi-label classification method can more accurately model (TextCNN), which used a layer of multi-scale
classify the news into different topics. The second is convolution after training word vectors based on the
product recommendation. On e-commerce websites, a Word2Vec model, and achieved good results at that time. In
product can be classified into multiple categories, such as 2016, Kurata [13] proposed to add a hidden layer to the CNN
appliances, furniture, kitchenware, etc., and the use of structure and initialize the neural network based on label
multi-label classification methods can more accurately co-occurrence information, which improved the accuracy of
recommend related products. The third is social media multi-label classification compared with random
analysis. In social media, a post may contain multiple initialization. This model applied co-occurrence information
topics or tags at the same time, such as yoga, fitness, to convolutional neural network for the first time. In 2017,
relaxation, etc. The use of multi-label classification Liu [14] proposed the XML-CNN model, which used the
method can better understand the intention and topic of a dynamic pooling mechanism. For the extreme multi-label
user's intention. Because the text in the real world usually text classification problem, a low-dimensional hidden layer
involves multiple topics, multi-label text classification was added after the pooling layer to reduce the amount of
calculation.
method can better understand the content of the text when
In addition to recurrent Neural Network and
analyzing the text content, which has great potential
convolutional neural network, some studies have proposed to
use Graph Neural Network (GNN) to mine the relationship
Aihua Duan is a PhD candidate of National University, Manila,0900, between words or words, documents or labels in documents
Philippines. (e-mail: duana@students.national-u.edu.ph).
to extract richer text features. Among all types of Graph
RODOLFO C. RAGA JR is a professor of College of Computer Studies
and Engineering José Rizal University, Mandaluyong City, Philippines.
neural networks, Graph Convolutional Network (GCN) [15]
(e-mail:rodolfo.raga@jru.edu) and its variants are the most popular because their
combination with other neural networks is efficient, technologies are TF-IDF, word2Vec, GloVe, ELMo, and
convenient, and achieves excellent results in many BERT. Different tokenizers have different effects on the
applications. GCN is essentially a convolution operation, but performance of machine learning. GloVe has some
the relationship between adjacent nodes in the graph structure advantages over other word embedding methods such as
is used, and the dependency syntax tree or word Word2Vec. It performs well on semantic and grammatical
co-occurrence information can be used to obtain the relevant tasks and is better able to capture linear relationships between
internal information of the text. In 2020, Pal [16] proposed to words. In addition, GloVe's training process is relatively
extract the relational semantics of labels based on graph simple and capable of efficient training on large-scale
attention network, and then optimize the text representation corpora.
features.
However, the task of multi-label text classification also B. Attention mechanism
faces some challenges. When the dimension of text features Attention mechanism [8] is a solution to the problem
is very high, the feature extraction of text becomes more proposed by imitating human attention, which is simply to
difficult and needs to bear extremely high computational cost. quickly screen out high-value information from a large
At the same time, in many scenarios, multi-label data has the amount of information. Due to the limitation of computing
problem of unbalanced sample distribution, label data usually power and optimization algorithm, the introduction of
presents a long-tail distribution, and the number of samples attention mechanism can help the neural network model deal
corresponding to some labels is very small. All these have a with information overload and improve the ability of the
certain impact on the classification performance of the model, neural network model to process information. In the RNN
so that the accuracy of the multi-label text classification task model, it is used to solve the bottleneck of information loss
is still unable to reach an excellent level. In addition, the caused by the transformation from a long sequence to a
labels in multi-label text data usually present a hierarchical fixed-length vector, as shown in Figure 2.
structure. How to exploit the hierarchical structure of labels
or the correlation between labels is also an important research
direction in multi-label text classification.
III. METHODOLOGY
In this paper, we propose to use BiLSTM_Attention
model and Text Convolution Neural Network (TextCNN)
model for multi-label text classification experiments.
BiLSTM is good at dealing with sequence structure and
can take into account the context information of the
sentence, but the overall running speed is too slow.
TextCNN is an unbiased model and has a strong ability to
extract shallow features of text. TextCNN mainly extracts
features based on filter window, which is limited in
long-distance modeling ability and insensitive to word
Fig. 2. Diagram of Attention model with BiLSTM
order. In order to solve the limitations of the two, this
paper proposes to fuse BiLISM with TextCNN, and
introduces the Attention mechanism, so that the model can Where BiLSTM is used to obtain the global features of the
focus on the text features that contribute greatly to the text input information, and the context information is fully
considered to better capture the semantic information in the
text. In the BiLSTM model, the current hidden layer at
time t is obtained by the weighted sum of the forward hidden
layer and the backward hidden layer . The calculation
formula is as follows.
(1)
(2)
, (3)
In formula (1) is the input of the current hidden layers,
represents the forward hidden layer state at time (t-1),
represents the backward hidden layer state at time t-1 in
formula (2), , represent the relative weight values of
Fig. 1. Model Structure Diagram the pre-hidden layer and post-hidden layer corresponding to
the BiLSTM at time t, respectively, represents the bias
classification results. value of the hidden layer state at time t in formula (3) .
A. Embedding layer The output matrix of the BiLSTM
It mainly performs tokenizer on words, and the main model is input into the hidden layer of the Attention
mechanism to obtain the attention initial state matrix
. According to the importance of each
feature in S, the corresponding weight is assigned, and the
different weight coefficients are multiplied and
accumulated with their corresponding initial state vector in
formula (5), and finally the output vector Y of the Attention
layer is obtained in formula (6). The formula is below.
(4)
(5)
(6)
Fig. 3. Diagram of TextCNN model
Where, is the weight matrix, is the bias quantity and
is the energy value determined by the state vector in
formula (4) (5). (7)
Attention mechanism is a technique in artificial neural Where k represents the word vector dimension
networks that mimics cognitive attention. This gives the corresponding to each word in the text sequence, w represents
neural network more weight on some parts of the input data the convolution kernel with dimension size h×k, and T i: i+h-1
and less weight on others, allowing it to focus on the small represents the sliding window consisting of row i to row i+h-1
parts of the data that are most important. It is mainly a of the input matrix. b denotes the bias parameter and f denotes
recurrent neural network by nature, which obtains more main the nonlinear mapping function. The pooling layer uses the
meaning of the article by learning the connection of the 1-MaxPool maximum pooling strategy to screen out a
article context. LSTM uses the gate structure characteristics maximum feature value from each sliding window.
to simulate the forgetting mechanism and memory (8)
mechanism of the human brain, so as to overcome the The concatenation layer concatenates all the pooled
problem of gradient disappearance or gradient explosion in feature values to obtain the high-level feature vector of the
the process of long sequence training, and realize the text.
modeling of arbitrary time series. The bidirectional LSTM (9)
can get more information. Where n represents the number of words in the text
In the process of text representation, the output vectors of sequence, C represents the text feature vector trained by the
each moment are usually directly added and then the average TextCNN module, and the dimension size is i+h-1. After the
is obtained. This practice assumes that each input word has an convolutional pooling operation is completed, the fully
equal contribution to the text, but the actual situation is often connected neural network layer is connected in the
contrary. When merging these output vectors, the attention downstream task to complete the label prediction of the text.
resources are reasonably allocated, and different weights are If it is a binary classification problem, softmax function
assigned to each vector, so as to select the text vector features is used as the classification function, as shown in Figure 3.
that are more important for the current classification results. In multi-label text classification, Sigmoid function is
The Attention mechanism essentially assigns a weight to each generally used as the activation function of the output layer,
vector, which is a weighted average of all the output vectors. and binary_crossentropy (BCE) function is used as the loss
The weight is determined by the contribution of the term to function. That is, each output node of the final classification
the output result of the text content, which reduces the effect layer is activated using the Sigmoid activation function, and
of other unrelated words and improves the calculation then the cross-entropy loss function is calculated for each
efficiency. Applying the Attention mechanism to the output node and the corresponding label. The formula is as
multi-label text classification model can make the text follows.
features better explained, so as to make the classification (10)
results more accurate. Where x is the input; C is the number of classification
classes, i belongs to [1, C] and is the true label
C. TextCNN
corresponding to the ith category.
TextCNN convolutional neural network is a classic text The selected model evaluation indications are accuracy
classification model. Kim [6] used different sizes of sliding and Micro-F1, accuracy’s formula is as follows.
Windows to convolutional pooling operation on the input text 11)
vector, captured the local features of the text sequence for
Where True Positive (TP) means that the actual sample value
combination and screening, extracted text semantic
is Positive, the sample is input into the prediction model, and
information at different levels of abstraction, and obtained
the model output value is also Positive, which is the part that
the high-level feature vector representation of the text. Its
the classification model predicts correctly. True Negative
model structure is shown in FIG. 3. (TN) means that the actual sample value is Negative, the
TextCNN convolutional neural network model is sample is fed into the prediction model, and the model output
mainly composed of four parts: input layer, convolutional value is Negative, which is also the part that the classification
layer, pooling layer and output layer. For the text input length model predicted correctly. False Positive (FP) means that the
n, the convolution layer learns the text features by using h actual sample value is Negative, the sample is fed into the
sliding Windows of different sizes to convolve the text input prediction model, and the output value of the model is
vector, and the convolution feature values are obtained by the Positive, which is the part of the classification model that is
convolution kernel at position i.
wrong. False Negative (FN) means that the actual sample TABLE I
value is Positive, the sample is fed into the prediction model, DATASET DESCRIBE
and the model output value is Negative, which is the part of Data and Labels Number
the classification model that is wrong.
Considering that multi-label text classification needs to Training Data 6992
consider the classification under each category, another Validation Data 777
Test Data 3019
evaluation metric is also selected in this paper, which is Total labels 90
Micro-F1. Because Micro-F1 considers the number of Avg label per text 1.2336
categories, it is more suitable for the case of imbalanced data
distribution. The Precision of Micro-F1 is calculated using
the following formula. The proposed model was built using Python language and
Tensorflow+Keras deep learning framework. Parameter
(12)
selection is very important in deep learning models. Three
Where the formula for the calculation of the and filters are used in the TextCNN model: 3,4,5. In order to
are as follows. prevent the model from overfitting, dropout is 0.2, and the
(13) batch number is set to 600, which is more suitable for the
needs of model gradient descent model convergence. The
(14) best parameter settings are shown in the
table.
Where C stands for the number of classes. TABLE II
PARAMETERS SETTING OF TEXTCNN WITH BILSTM-ATTENTION MODEL
IV. EXPERIMENT
Parameters Values
In this paper, the corpus is divided into training set,
Word vector dimension 300
validation set and test set. The training set is used to train the Max length of article 500
model, the validation set facilitates model selection and Filters in TextCNN [3,4,5]
hyperparameter tuning, and the test set evaluates its Dropout 0.2
performance on unseen data, as shown in Fig. 4. Batch size 600
Epochs 200
Activate Function Sigmoid
Loss BCE
Optimizer Adam
Authors
Aihua Duan was born in Qinhuangdao, China in
1979.She received the B.S. degree in computer science
and technology from Hebei University of Economics
and Business, Shijiazhuang, China, in 2002, and the
M.S. degree in computer application from Soochow
University, Jiangsu, China, in 2005. She is currently
pursuing the Ph.D. degree with the College of
Computing and Information Technologies National
University, Manila, Philippines.
From 2005 to 2023, she was a Lecturer in Anhui
University of Finance and Economics. Her research interests include
machine learning and natural language processing.