A Method of Fine-Grained Short Text Sentiment Analysis Based On Machine Learning
Abstract: Text sentiment analysis plays an important role in social network in-
formation mining. It is also the theoretical foundation and basis of personalized
recommendation, circle of interest classification and public opinion analysis. In
view of the existing algorithms for feature extraction and weight calculation, we
find that they fail to fully take into account the influence of sentiment words. There-
fore, this paper proposed a fine-grained short text sentiment analysis method based
on machine learning. To improve the calculation method of feature selection and
weighting and proposed a more suitable sentiment analysis algorithm for features
extraction named N-CHI and weight calculation named W-TF-IDF, increasing the
proportion and weight of sentiment words in the feature words Through experimen-
tal analysis and comparison, the classification accuracy of this method is obviously
improved compared with other methods.
1. Introduction
With the rapid development of the Internet, the function of the network is more and
more comprehensive, and the use is more and more convenient. The rapid develop-
ment of mobile Internet and the massive growth of mobile phone users, making the
various functions of APP are constantly changing in a rapidly changing network
environment. The development and replacement of the network security issues
have also received more and more people’s attention [1, 2]. Under the premise of
ensuring network security,, social platforms such as Twitter, Facebook, micro-blog
and WeChat, etc., have also rapidly emerged in the rapidly developing cyberspace
and gradually developed from a single web-based terminal to a dual-platform based
on mobile terminals.
Many of the comment information and real-time short texts contain the per-
sonal sentiment and tendencies of users, which are of great significance for users
to personalize recommendations, interest circle division, network public opinion
monitoring, and privacy protection [3]. How to use the computer technology to
acquire and analyze these comments in the emotional information has attracted
many experts and scholars to research. The information has attracted the com-
peting research of many experts and scholars, which involves many fields such as
artificial intelligence, natural language processing, data analysis and mining [4].
The main purpose of sentiment analysis is to process, extract, summarize and
analyze the information in the text through different methods, so as to infer the
emotion and viewpoint expressed by the author of the text, and divide the emo-
tional tendency of the text through the subjective information contained in it. Text
sentiment analysis can be divided into 3 tasks, including sentiment information ex-
traction, sentiment information classification and sentiment information retrieval
and induction [5]. Sentiment information classification is one of the important
tasks of text sentiment analysis, and is also the focus of this paper.
Among them, Sina micro-blog has become the mainstream social platform with
more than 300 million users. It is through the mutual attention between users,
sharing real-time information, commenting on micro-blog content and other ways
to socialize. But since micro-blog released the text words provisions shall not be
more than 140 characters, the text length is short, the structure is different [6],
the sentiment content in the text is relatively less. By analyzing the characteristics
of this short text, this paper proposed a method of short text sentiment analysis
in micro-blog bases on machine learning, which the proportion of emotional words
and improves the accuracy of text sentiment analysis through the improvement of
feature selection and weight calculation algorithm.
The specific contents of this paper are organized as follows: The second part
mainly introduces the related work; the third part introduces the methods used in
this paper and explains the proposed algorithm for feature extraction and weight
calculation in detail; the fourth part gives the experimental results and analysis.
Finally, the full text is summarized.
2. Related work
Natural language processing has extensive research scope, which includes sentiment
analysis, machine translation, text classification, semantic analysis, etc. [7–9]. As
the practicality of sentiment analysis has gradually become a research hotspot in
recent years. Text sentiment analysis, also known as opinion mining, has become
one of the hot topics in the field of natural language processing and data mining due
to its practicality. Its main research methods can be divided into two categories,
one method is based on the sentiment knowledge, another method is based on
feature classification [10].
The method based on the sentiment knowledge is to use the existing sentiment
knowledge, such as the sentiment lexicons, domain dictionary, etc. to classify the
text. Turney et al. [11] put forward a method based on point mutual information
for the co-occurrence of words in the corpus and use the method to make senti-
ment judgments on the words. Sentiment words are the basis of text sentiment
analysis [12], building sentiment lexicons was also very important research content.
Baccianella et al. [13] used General Inquirer (GI) to construct an sentiment lexi-
cons, Gyamfi et al. [14] used the MPQA corpus [15] to establish an sentiment seed
lexicons and implemented an expansion of the sentiment lexicons in conjunction
with WordNet. The construction of sentiment lexicon gradually transforms from
relying completely on artificial addition to automatically adding new sentiment
words through various methods [16–20].
The method based on feature classification is to select a large number of features
that can represent the sentiment of the text and to classify the texts by means of
statistics or machine learning methods such as Naive Bayes, K-proximity, support
vector machine, etc. Nowadays, most of the paper are also used in this method [21–
24], and these methods have also been successfully applied to different fields [25].
Pang et al. [26] classify the movie reviews, they used the similarity of sentences as
a feature and training classification was conducted through NB and SVM that the
experiments show classification based on reviews is more difficult than classification
based on facts text. In the subjective and objective classification of English and
French, Toprak et al. [27] used words, part of speech and lexical information of
features, then used SVM as a classifier that the experiments proved that the recall
of lexical information features to improve subjective and objective classification
Better effect. Moraes et al. [28] used TF-IDF and GI as the feature extraction
algorithm respectively and then classified used Pang et al.’s data sets with classifiers
as NB, SVM and ANN, experiments proved the ANNs were found to be superior to
unbalanced data sets good expressiveness. At the same time, sentiment analysis has
also begun to gradually develop from coarse to fine-grained [29–31]. Fink et al. [32]
conducted sentiment analysis using method of machine learning which subjective
and objective classification using coarse-grained (sentence poles) and sentiment
classification using fine-grained (clause or part-of-speech). Shi et al. [33] made
use of the correlation of words and part-of-speech information to make sentiment
analysis of hotel reviews and achieved good results. However, the method based
on feature classification doesn’t take into account the influence of sentiment words
and there is still much room for improve the methods in feature extraction.
In addition, due to the continuous development of deep learning in various
fields [34], the research of sentiment analysis gradually begins to tend to unsuper-
vised classification [35–38]. Such methods are quite difficult and of great research
significance also can save labor, but the method is not mature enough yet and the
classified accuracy is relatively low which can’t be used to the application now.
The development of the network has promoted the large-scale growth of data.
The small language of foreign languages and the national language in Chinese have
begun to occupy a place in the network. For the protection and development of
various languages, the textual sentiment analysis of this small language has also
attracted many eyes of scholars [39–42].
Due to the existing methods, there are still some defects [43]. Therefore, we
combining the method based on sentiment knowledge and the method based on
feature classification, and proposed a text sentiment analysis method based on ma-
chine learning, which applies the sentiment lexicon to feature extraction to achieve
better classification results.
to remove the excess interference information and retain useful information. Text
preprocessing is the basic and important link in sentiment analysis, and the quality
of preprocessing is very important for sentiment classification and information pro-
cessing in the later stage. The contents of text preprocessing include data cleaning,
word segmentation and removing stop word.
The content between the two wells of the # Champion not only dynasty endless
# and # Hengda seven consecutive # in the sentence belongs to Micro-blog’s topic
and the content expresses the joy of victory. The emotions expressed by users in
comment 1 were consistent with the topics, while the contents of comment 2 were
opposite to those expressed in the topics. Therefore, these topics do not fully
represent the emotional tendency of the user, which play a role of interference and
should be deleted.
For interference items similar to the hot topic existing in the data set, they
should be cleaned up in order to lay a good foundation for the extraction of the
later emotional information, and delete the contents as shown in Tab. I.
one of the very important step is to carry out Chinese word segmentation. Chinese
word segmentation involves grammar, statistics and other fields of knowledge and
there are some difficulties, but now have a lot of more mature word segmentation
system. Through the screening and comparison, this paper uses NLPIR Chinese
word segmentation system (also known as ICTCLAS2016), this system has higher
accuracy in Chinese word segmentation and has better recognition effect on collo-
quial vocabulary and network new words recognition. The segmentation effect is
as follows:
Removing stop words: In Chinese there are some functional words that make
the text fluent, such as function words, prepositions, quantifiers and so on. These
words usually neither have practical meaning nor emotional information, but exist
in a large amount in the text, at the same time, there are also some extra infor-
mation such as Roman symbols, mathematical characters and punctuation, which
we collectively refer to as stop words. There are many different types of stop word
list, we integrate the existing stop words by screening and contrast, and retain
some adverbs which have some influence on the emotion of the text. The complete
segmentation data set to remove stop words, data preprocessing part is completed.
N (AD − BC)2
CHI(t, c) = . (1)
(A + C)(B + D)(A + B)(C + D)
Among them t represents the characteristic term, c represents the category, N
represents the total number of texts, A, B, C, and D represent as shown in Tab. II.
Since N, (A + C), (B + D) are all constant in Eq. 1, then Eq. 1 can be simplified
as follows:
(AD − BC)2
CHI(t, c) = .
(A + B)(C + D)
However, since the chi-square test does not calculate the frequency of the feature
words in the statistical process, it is easy to select the words with the lower word
frequency and in the original method the influence of the sentiment words is not
taken into consideration. For these problems, we join the word frequency and
sentiment words into the algorithm that improved chi-square test as follows:
where f is the frequency of the feature word. The algorithm adds some weight
to the original CHI algorithm, increases the score of the sentiment word as much
as possible, makes the sentiment word more easily become the feature word and
reduces the probability of the CHI algorithm selecting the low word frequency so
as to improve the classification accuracy rate of the later. The improvement effect
as shown Fig. 2.
It can be seen from the figure that the number of sentiment words contained
in the improved algorithm is significantly higher than that of the no improved
algorithm in different dimensions. We choose the words with higher scores as the
features and the choice of their feature dimensions is also very important which
number is too much or too little will affect the accuracy rate. The specific content
of the feature data selection will be discussed in the fourth section.
It can be seen from the table that the improved algorithm for sentiment words
and modifiers has a significant increase in its weight. In the meantime, in order to
make the calculation more convenient and rapid in the later training classification,
we also normalized the expression of the weights.
Regardless of the improvement of the feature selection algorithm or the weight
calculation algorithm, the complexity of the algorithm is increased to some extent
compared with the original algorithm, but there is not much difference in the
running time. The main purpose is to improve the text sentiment classification.
The final correct rate.
ti represents the n-th feature of the text, wi represents the weight of the n-th
feature, which is often used to represent features in the text classification because
it is simple and easy to implement and does not consider the correlation and order
of the features.
Construct classifiers: There are many ways to construct classifiers, such as
method based on the statistical, methods based on machine learning, method
based on deep learning, which method based on the machine learning are also
included in the MaxEnt (Maximum Entropy), KNN(K-Nearest Neighbor), NB and
SVM(support vector machine) etc. NB is a classic probabilistic model algorithm
that determines the classification result by calculating the probability that text
d belongs to class c, p(d | c) is the conditional probability that text d belongs
to class c and p(c) is the prior probability of that class. This method has simple
logic, classification algorithm, mature and stable characteristics, combined with the
characteristics of text classification using NB constructing classifier and achieved
better classification results, and compared with other more mature algorithms, the
formula is as follows:
p(d | c) · p(c)
p(d | c) = .
Positive Negative
Be Judged as Positive t TP FP
Be Judged as Negative t FN TN
Method 3: Since the final purpose is to judge the textual tendencies, the weight
is directly reduced to [−1, 0, 1] according to the propensity. X1 is the weight of
the feature t in the positive class, X2 is the weight of the feature t in the negative
class, The method is as follows:
X1 = 0 ∪ X 2
< 0.6
∗ X1
X = 0 0.6 ≤ X2 ≤ 1.5 .
X2 = 0 ∪ X2 > 1.5
Method One 0.91 0.90 0.90
Method Two 0.89 0.88 0.88
Method Three 0.89 0.89 0.89
As shown in the above table, the first method has higher P, R and F. The
method retains the advantage of the sentiment word in the weight, while the second
method ignores the function of the sentiment word. When dealing with the low
frequency words, the range is too large, resulting in more weight is 0, although
the third method is more intuitive to make the weight close to the classification,
but the method ignores the weight of feature words size, and range selection has
some limitations, scalability is not strong. So this paper choice the method one as
a method for weight normalization.
method, named CHI; The third groups using improved N-CHI as feature selection
method, named N-CHI. With the increase of dimensions, three methods of P, R, F -
measure experimental results are shown as follows in Fig. 4, 5, 6.
It can be seen from the above figure that this method is effective for the im-
provement of the method of feature extraction. Although the effect at lower lati-
tudes is slightly lower than that of CHI, with the increase of dimension, the N-CHI
algorithm is in P, R and F above point mutual information and CHI.
CHI as the method of feature selection: The first group uses the word frequency as
the weight calculation method, named TF; The second group uses TF-IDF as the
weight calculation method, named TF-IDF; The third group uses a modified W-
TF-IDF as the weight calculation method, named W-TF-IDF. The experimental
results are as follows in Fig. 7.
It can be seen from the figure that the text is effective for the method of weight
calculation. Although the effect of promotion is not obvious, the values of P, R
and F are all improved.
CHI-IDF 0.86 0.92 0.89 0.88 0.88
N-CHI 0.89 0.94 0.91 0.90 0.90
W-TF-IDF 0.86 0.94 0.90 0.90 0.90
C-CHI-IDF 0.91 0.95 0.93 0.93 0.93
It can be seen from the above table that the text achieves good results for the
improvement of feature extraction and weight calculation, both in terms of positive,
negative and overall precision.
The above experimental results are the average of multiple experiments. By
comparison, it can be found that the improvement of feature extraction and weight-
ing calculation is effective, the classification accuracy rate is significantly higher,
and the weighting improvement is better than that of feature selection The effect is
more obvious, and the improved algorithm has significantly improved the accuracy
of the positive class.
NB 0.92 0.94 0.93 0.93 0.93
KNN 0.90 0.94 0.92 0.92 0.92
ANN 0.91 0.95 0.93 0.93 0.93
The classification accuracy rate of ANN is not significantly different from the other
two, but it has better performance for the sample imbalance of ANN.
As can be seen from Fig. 8, we also found that the correct rate of the method for
the negative class is significantly higher than the correct rate of the active class,
although the number of negative class sentences is much lower than the active
class. This shows that words in negative sentences are more likely to be selected as
features, and in Chinese natural language, people are more likely to carry negative
words when expressing negative emotions.
In this paper, we used a general evaluation data set. Other sentiment analysis
methods of the data set and some evaluation results are compared with the results
of this paper. The comparison results are shown in Tab. VIII.
5. Conclusion
According to existing method of the feature extraction and weight calculation with-
out considering the influence for the sentiment words, this paper proposed a method
of fine-grained Chinese sentiment analysis for short text in Micro-blog which pro-
posed the N-CHI and W-TF-IDF two kinds of new algorithm based on improved
the algorithm of CHI and TF-IDF. This method effectively improves the accuracy
of text sentiment information classification and has certain universality.
Literature [42] – – 0.900
Literature [6] Method 1 0.895 0.837 –
Literature [6] Method 2 0.919 0.886 –
COAE2014-sjtu 0.954 0.885 0.919
Bjut-coae2014 0.914 0.915 0.914
scool 0.769 0.800 0.785
WB-SA 0.892 0.914 0.903
hut 0.901 0.848 0.875
LEO-WH Run1 0.964 0.791 0.877
Medians 0.891 0.850 0.877
C-CHI-IDF 0.910 0.950 0.930
However, this method has some limitations that does not consider the influ-
ence of the relationship between part-of-speech and words on affective information,
which will be the focus of the fourth part of this article.
This research is supported by National Natural Science Foundation of China under
the Grant 61672210 and supported by the Henan Research Program of Foundation
and Advanced Technology under the Grant 162300410183.
