Research Paper
Research Paper
Research Paper
Corresponding Author:
Dian Palupi Rini
Universitas Sriwijaya, Jl. Srijaya Negara Bukit Besar, 30139, Palembang, South Sumatera, Indonesia
Email: dprini@unsri.ac.id
1. INTRODUCTION
Text classification is an important part of Natural Language Processing with many applications [1], such
as sentiment analysis [2][3], information search [4], ranking [5], and document classification [6]. The text
classification model is generally divided into two categories: machine learning and deep learning. Much
research on text classification has involved traditional machine learning algorithms such as k-Nearest
Neighbors [7][8], Naive Bayes [9][10], Support Vector Machine [11][12], Logistic Regression [13]. Also,
compared to traditional machine learning classification algorithms have high efficiency and stability
characteristics. However, it has certain limitations in the case of large-scale dataset training [14].
Recently, neural network-based models are becoming increasingly popular [15][16][17]. These models
achieve excellent performance in practice, tend to be relatively slow both during training and testing, limiting
their use to very large datasets [14]. Several recent studies have shown that the success of deep learning about
text classification is highly dependent on the effectiveness of word embedding [17]. Specifically, Shen et al.
2018 quantitatively show that the task of text classification based on word embedding can have the same level
of difficulty regardless of the model used, using the concept of intrinsic dimension [1].
Some applications of deep learning methods used for text classifications include convolutional neural
networks [16][17], autoencoder [19][20], deep belief network [21]. Recurrent Neural Network (RNN) is one
of the most popular architectures used in natural language processing (NLP) because the recurrent structure is
suitable for variable length text processing. One of the deep learning methods proposed in this study is RNN
with the application of the Long Short-Term Memory (LSTM) architecture. RNN can use a distributed word
representation by first changing the token consisting of each text into a vector, which forms a matrix. Whereas,
LSTM was developed to solve exploding and vanishing gradient problems that can be faced when training
traditional RNN [22]. In addition to expanding memory, the classification of texts using LSTM in this study
because the structure of LSTM is a sequence in which an integrated whole or cannot be cut as well as the
structure of text documents that if cut will change the meaning of the sentence. The use of word embedding
will be an input feature on LSTM before classifying text.
2. RESEARCH METHOD
2.1 Methodology
In general, the steps in the research methodology used to assist in the preparation of this research proposal
require a clear framework for the stages. The research framework used as in Figure 1, which consists of
literature review by research in the past 1 year and 5 years, in preparing data the dataset used in this study is
AGNews consist of 400,000 data samples, after preparing the dataset is pre-processing data by removing
punctuation and tokenization, do the classification process with LSTM, and analyzing the results, and make
conclusions. The classification process with LSTM consists of 3 sub-processes, namely the training process,
validation, and testing.
2.3 Evaluation
The multi-label evaluation steps of the confusion matrix in the following equations:
𝑇𝑃𝑖 + 𝑇𝑁𝑖
∑𝑙𝑖=1
𝑇𝑃𝑖 + 𝐹𝑁𝑖 + 𝑇𝑁𝑖 + 𝐹𝑃𝑖 (9)
𝐴𝑐𝑐 = ∗ 100%
𝑙
∑𝑙𝑖=1 𝑇𝑃𝑖
𝑃𝑟𝑒𝑠𝑖𝑠𝑖 = ∗ 100% (10)
∑𝑙𝑖=1(𝐹𝑃𝑖 + 𝑇𝑃𝑖 )
∑𝑙𝑖=1 𝑇𝑃𝑖
𝑅𝑒𝑐𝑎𝑙𝑙 = ∗ 100% (11)
∑𝑙𝑖=1(𝑇𝑃𝑖 + 𝐹𝑁𝑖 )
2 ∗ 𝑃𝑟𝑒𝑠𝑖𝑠𝑖 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 (12)
𝐹1𝑠𝑐𝑜𝑟𝑒 =
𝑃𝑟𝑒𝑠𝑖𝑠𝑖 + 𝑅𝑒𝑐𝑎𝑙𝑙
2.4 Optimization
There are some types of optimizers for deep learning models such as SGD, Adam, RMSProp, etc. This
paper applied Adam and RMSProp for training the data. Adam Optimizer can control sparse gradient issues
[34]. It is an expansion to stochastic gradient descent that has currently seen wider adoption for deep learning
applications such as Natural Language Processing.
where m and v refer averages the first two moments of the gradient, g indicates gradient on current mini-
batch. RMSProp can adapt the learning rate for each of the parameters. It aims to divide the learning rate for
weight by a running average of the magnitudes of recent gradients for that weight [35].
𝑣(𝑤, 𝑡) ≔ 𝛾𝑣(𝑤, 𝑡 − 1) + (1 − 𝛾)(∇𝑄𝑖 (𝑤))2 (15)
Where γ is the forgetting factor.
And the parameters are updates as,
𝑛 (16)
𝑤 ∶= 𝑤 − ∇𝑄𝑖 (𝑤)
√𝑣(𝑤, 𝑡)
2.5 Pre-Processing
Categorical
2 50 128 0.001 Adam Tanh Softmax 300
Cross Entropy
Categorical
3 50 128 0.001 RMSProp Relu Softmax 300
Cross Entropy
Categorical
4 50 128 0.001 RMSProp Tanh Softmax 300
Cross Entropy
Categorical
5 50 128 0.0001 Adam Relu Softmax 300
Cross Entropy
Categorical
6 50 128 0.0001 Adam Tanh Softmax 300
Cross Entropy
Categorical
7 50 128 0.0001 RMSProp Relu Softmax 300
Cross Entropy
Categorical
8 50 128 0.0001 RMSProp Tanh Softmax 300
Cross Entropy
3.4.1 Model 1
Table 3 shows the results of the evaluation performance of the LSTM training process that was trained
using Relu activation, Adam optimizer with a learning rate of 0.001. The accuracy of the training obtained in
model 1 is 95.33. Confusion matrix will be used to calculate the Precision, Recall, and F1-score, the results of
which can be seen in Table 3 as a result of the evaluation performance of the test. The results in Table 4 show
that the training and testing accuracy values are not much different, which is 95 with an average value of
Precision, Recall, and F1-score of 95. To see the comparison of training and testing per epoch in the accuracy
curve can be seen in Figure 3 and the curve loss in Figure 4.
3.4.2 Model 2
Table 5 shows the results of the performance evaluation of the LSTM training process that was trained
using Tanh activation, Adam optimizer with a learning rate of 0.001. The accuracy of the training obtained in
model 2 is 95.34. Confusion matrix will be used to calculate the Precision, Recall, and F1-score, the results of
which can be seen in Table 6 as a result of the evaluation performance of the test. Based on the two models
above using the same optimizer and learning rate with both Relu and Tanh activation, the resulting value is
also not much different. The value of training and testing accuracy, average precision, recall, and f1-score of
95. Figure 5 shows the comparison curve of training and testing accuracy, and Figure 6 shows the Loss curve.
3.4.3 Model 3
In model 3 is trained with Relu activation hyperparameter, RMSprop optimizer and learning rate
0.001. The results of the training evaluation performance can be shown in Table 7, while the results of the
testing evaluation are shown in Table 8. The accuracy obtained in the training process is 94.25 with an average
value of precision, recall, and f1-score of 94. Not much different from the value of testing accuracy which is
equal to 94.37. The comparison training curve and testing of accuracy and loss can be seen in Figure 7 and
Figure 8.
3.4.4 Model 4
In model 4 is trained with Tanh activation hyperparameter, RMSprop optimizer and learning rate
0.001. The results of the training evaluation performance can be shown in Table 9, and the results of the testing
evaluation are shown in Table 10. The accuracy obtained in the training process is 94.32 with an average value
of precision, recall, and f1-score of 94. The testing accuracy is 94.56. The test results in Table 10 show that the
macro average of precision is 95, while the recall and f1-score are 94. The accuracy value in this process is 95.
The comparison training curve and accuracy testing can be seen in Figure 9 and the loss in Figure 10. Both
Adam and RMSprop optimizers trained with a learning rate of 0.001 showed results that are not much different.
3.4.5 Model 5
The LSTM model 5 was trained with the same hyperparameter with a tuning learning rate of 0.0001.
Table 11 shows the results of the training evaluation performance and Table 12 shows the performance results
of the classification testing evaluation with the activation of Relu, Adam optimizer, and 300-dimensional
GloVe word embedding. The accuracy value in the training and testing process for learning rates 0.001 and
0.0001 with the same optimizer, namely Adam gets results that are not much different, both precision, recall,
and f1-score of 95. However, the accuracy and loss curves obtained in learning a rate of 0,0001 is better than
an accuracy and loss curve with a learning rate of 0.001. It can be seen in Figure 11 and Figure 12.
3.4.6 Model 6
In Model 6, training was carried out with the same hyperparameter with Tanh activation, Adam
optimizer, and a learning rate of 0.0001. The results of training evacuation performance and confusion matrix
are shown in Table 13 with training accuracy of 95. While the results of testing evaluation performance are in
Table 14 with an average value of precision, recall, and f1-score of 95. Figure 13 shows a comparison curve
of training and testing accuracy for 50 epochs. Although the loss in the validation process continues to decrease,
at the 40th epoch the same and slightly greater than the training loss can be seen in Figure 14.
3.4.7 Model 7
In model 7, it was trained with Relu activation parameters, RMSprop optimizer, and tuning learning
rate 0,0001. Table 15 shows the results of the training evaluation performance and confusion matrix of 50
epochs obtained an accuracy of 93.24. The results of the evaluation performance of precision testing, recall,
and f1-score are in Table 16. The accuracy curve resulting from training and testing can be seen in Figure 15
and the loss curve in Figure 16, which shows that the results of the RMSprop optimizer parameter with a tuning
learning rate of 0,0001 are more fit than the RMSprop with a learning rate of 0.001 although there is a slight
up and down in accuracy and loss.
3.4.8 Model 8
Model 8 was trained with Tanh activation parameters, RMSprop optimizer, and a learning rate of
0.0001 resulting in training accuracy of 93.21. The results of the training evaluation performance can be seen
in Table 17 where there are four class confusion matrix multilabel. The results of the evaluation performance
of the test are in Table 18 with an average value of precision, recall, and f1-score of 94. The comparison training
curve and testing accuracy model can be seen in Figure 17 with the value of testing accuracy exceeding training
accuracy. While the loss model curve decreases with the passage of 50 epochs, where the test loss is smaller
than the training loss can be seen in Figure 18. Table 19 shows a comparison of the testing accuracy of the
eight LSTM models using the word embedding GloVe.
In the eight of tuning models LSTM using the word embedding Glove feature, the highest test
accuracy was 95.17 in model 6 with Tanh activation parameters, Adam optimizer, and a learning rate of 0.0001.
While the accuracy and loss model which close to good-fit on models with a learning rate of 0.0001 either with
Adam or RMSprop optimizer. Table 20 shows the comparison results of previous works.
4 CONCLUSION
Text classification using LSTM is done by conducting trial and error experiments. Text classification
using LSTM with the Glove feature does hyper-parameter tuning to get the best model. Whereas, the LSTM
and hyperparameter structure used from the test results are using embedding of the GloVe features in the input,
softmax activation function in the output, Relu and Tanh activation functions, loss categorical cross-entropy
function, learning rate 0.001 and 0.0001, with the number epoch 50. The highest accuracy with the Glove
feature is on the sixth model of 95.17 with an average precision, recall, and F1-score of 95. It can be concluded
that the LSTM evaluation results using the GloVe feature can achieve good performance both in accuracy and
the curves.
REFERENCES
[1] L. Li, L. Xiao, W. Jin, H. Zhu, and G. Yang, “Text Classification Based on Word2vec and Convolutional Neural
Network,” Neural Information Processing, International Conference on Neural Information Processing (ICONIP),
Lecture Notes in Computer Science, vol. 11305, 2018. DOI: 10.1299/jsmemag.90.823_758
[2] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive Deep Models for
Semantic Compositionality Over a Sentiment Treebank,” Proceedings of the 2013 conference on Empirical
Methods in Natural Language Processing, pp.1631-1642, 2013, Online
[3] H. Yuan, Y. Wang, X. Feng and S. Sun, “Sentiment Analysis Based on Weighted Word2vec and Att-LSTM,”
Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence, pp. 420-
424, 2018. DOI: 10.1145/3297156.3297228
[4] J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and Word2vec for text classification with semantic
features,” Proceedings of 2015 IEEE 14th International Conference on Cognitive Informatics and Cognitive
Computing, ICCI*CC 2015, pp. 136-140, 2015. DOI: 10.1109/ICCI-CC.2015.7259377
[5] K. Chen, Z. Zhang, J. Long, and H. Zhang, “Turning from TF-IDF to TF-IGM for term weighting in text
classification,” Expert Systems with Applications, vol. 66, pp. 245-260, 2016. DOI: 10.1016/j.eswa.2016.09.009
[6] R. G. Rossi, A. D. A. Lopes, and S. O. Rezende, “Optimization and label propagation in bipartite heterogeneous
networks to improve transductive classification of texts,” Information Processing and Management, vol. 52, no. 2,
pp. 217-257, 2016. DOI: 10.1016/j.ipm.2015.07.004
[7] B. Y. Pratama, and R. Sarno. Personality classification based on Twitter text using Naive Bayes, KNN and SVM.
Proceedings of 2015 International Conference on Data and Software Engineering, ICODSE, 2016, DOI:
10.1109/ICODSE.2015.7436992
[8] M. Azam, T. Ahmed, F. Sabah, F. and M.I. Hussain, “Feature Extraction based Text Classification using K-Nearest
Neighbor Algorithm”. IJCSNS Int. J. Comput. Sci. Netw. Secur, 18, pp. 95-101, 2018. Online
[9] S. Xu, “Bayesian Naïve Bayes classifiers to text classification,” Journal of Information Science, vol. 44, no. 1,
pp.48-59. 2018. DOI: 10.1177/0165551516677946
[10] L. Jiang, C. Li, S. Wang, and L. Zhang, “Deep feature weighting for naive Bayes and its application to text
classification,” Engineering Applications of Artificial Intelligence, vol. 52, pp. 26-39, 2016. DOI:
10.1016/j.engappai.2016.02.002
[11] M. Fanjin, H. Ling, T. Jing, and W. Xinzheng, “The Research of Semantic Kernel in SVM for Chinese Text
Classification,” In Proceedings of the 2nd International Conference on Intelligent Information Processing, pp. 8,
2017. DOI: 10.1145/3144789.3144801
[12] M. Goudjil, M. Koudil, M. Bedda, and N. Ghoggali, “A novel active learning method using SVM for text
classification,” International Journal of Automation and Computing, vol. 15, no.3, pp. 290-298, 2018. DOI:
10.1007/s11633-015-0912-z
[13] A. Onan, S. Korukoğlu, and H. Bulut, “Ensemble of keyword extraction methods and classifiers in text
classification,” Expert Systems with Applications, vol. 57, pp. 232-247, 2016. DOI: 10.1016/j.eswa.2016.03.045
[14] M. Gao, T. Li, and P. Huang, “Text Classification Research Based on Improved Word2vec and CNN,” In
International Conference on Service-Oriented Computing, pp. 126-135. Springer, Cham, 2018. DOI: 10.1007/978-
3-030-17642-6_11
[15] K. Kowsari, D.E. Brown, M. Heidarysafa, K.J. Meimandi, M.S. Gerber, and L. E. Barnes, “Hdltex: Hierarchical
deep learning for text classification,” In 2017 16th IEEE International Conference on Machine Learning and
Applications (ICMLA), pp. 364-371, 2017. DOI: 10.1109/ICMLA.2017.0-134
[16] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint, arXiv:1408.5882, 2014. DOI:
10.3115/v1/D14-1181
[17] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” In Advances in
neural information processing systems, pp. 649-657, 2015. DOI: arXiv:1509.01626v3
[18] D. Shen, G. Wang, W. Wang, M.R. Min, Q. Su, Y. Zhang, C. Li, R. Henao, and L. Carin, “Baseline needs more
love: On simple word-embedding-based models and associated pooling mechanisms,” arXiv preprint, 2018. DOI:
arXiv:1805.09843
[19] Xu, W., Sun, H., Deng, C., and Tan, Y. Variational autoencoder for semi-supervised text classification. In Thirty-
First AAAI Conference on Artificial Intelligence. 2017. Online
[20] R. G. F. Soares, “Effort Estimation via Text Classification and Autoencoders,” Proceedings of the International
Joint Conference on Neural Networks, pp. 1-8, 2018. DOI: 10.1109/IJCNN.2018.8489030
[21] P. Ruangkanokmas, T. Achalakul, and K. Akkarajitsakul, “Deep Belief Networks with Feature Selection for
Sentiment Classification,” Proceedings - International Conference on Intelligent Systems, Modelling and
Simulation, ISMS, 9-14, 2017. DOI: 10.1109/ISMS.2016.9.
[22] Y. Yan, Y. Wang, WC. Gao, BW. Zhang, C. Yang, and XC. Yin, "LSTM 2: Multi-Label Ranking for Document
Classification," Neural Processing Letters, vol. 47, no. 1, pp. 117-138, 2018. DOI: 10.1007/s11063-017-9636-0
[23] T. Mikolov, K. Chen, K., G. Corrado, and J. Dean. “Distributed Representations of Words and Phrases and their
Compositionality,” Advances in Neural information processing systems, pp. 3111-3119, 2013. Online
[24] J. Pennington, R. Socher and C. Manning, “Glove: Global vectors for word representation,” In Proceedings of the
2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014. Online
[25] H. Zen, and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output layer
for low-latency speech synthesis”. In 2015 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 4470-4474, 2015. DOI: 10.1109/ICASSP.2015.7178816
[26] I. Sutskever, O. Vinyals, and Q.V. Le, “Sequence to sequence learning with neural networks,” In Advances in
neural information processing systems, pp. 3104-3112, 2014. Online
[27] Li, K., Daniels, J., Liu, C., Herrero-Vinas, P. and Georgiou, P., “Convolutional recurrent neural networks for
glucose prediction,” IEEE Journal of Biomedical and Health Informatics. Vol. 24, no. 2, Febuary 2020 DOI:
10.1109/JBHI.2019.2908488
[28] K. Tseng, C. Ou, A. Huang, R.F. Lin, and X. Guo, “Genetic and Evolutionary Computing,” Proceedings of the
Twelfth International Conference on Genetic and Evolutionary Computing, vol. 834, 2019. DOI: 10.1007/978-981-
13-5841-8.
[29] C. Zhou, C. Sun, Z. Liu, and F. Lau, “A C-LSTM neural network for text classification,” arXiv preprint, 2015.
DOI: arXiv:1511.08630
[30] M. Pota, F. Marulli, M. Esposito, G. De Pietro, and H. Fujita, “Multilingual POS tagging by a composite deep
architecture based on character-level features and on-the-fly enriched Word Embeddings,” Knowledge-Based
Systems, vol. 164, pp. 309-323, 2019. DOI: 10.1016/j.knosys.2018.11.003
[31] C.C. Chiu, T.N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.J. Weiss, K. Rao, E. Gonina,
and N. Jaitly, “State-of-the-art speech recognition with sequence-to-sequence models,” In 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774-4778, 2018. DOI:
10.1109/ICASSP.2018.8462105
[32] A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint, 2013. DOI: arXiv:1308.0850
[33] A. Kumar, and R. Rastogi, “Attentional Recurrent Neural Networks for Sentence Classification,” In Innovations in
Infrastructure, pp. 549-559. Springer, Singapore, 2019. DOI: 10.1007/978-981-13-1966-2_49
[34] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint. 2014. DOI: arXiv:1412.6980
[35] T. Tieleman, T. and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent
magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26-31, 2012. Online
[36] G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, and L. Carin, “Joint embedding of words and
labels for text classification,” arXiv preprint, 2018. DOI: arXiv:1805.04174