Swaraj Project 12march
Swaraj Project 12march
Swaraj Project 12march
on
”Feature Selection and Data Visualization”
Submitted to the
Bachelor of Engineering
in
Information Technology
by
Swaraj Surendra Patel
(T150228554/4354 & TE)
Under the guidance of
Prof. Rupali Amit Bagate
i
CERTIFICATE
This is to certify that the project based seminar report entitled ”Feature Se-
lection and Data Visualization” being submitted by Swaraj Surendra Patel
(4354 / T150228554 and TE IT) is a record of bonafide work carried out
by him under the supervision and guidance of Prof. Rupali Amit Bagate
in partial fulfillment of the requirement for TE (Information Technology
Engineering) - 2015 Course of Savitribai Phule Pune University, Pune in
the academic year 2018-2019.
Date: 31/03/2019
Place: Pune
Dr. B. P. Patil
Principal
This Project Based Seminar report has been examined by us as per the
Savitribai Phule Pune University, Pune requirements at Army Institute of
Technology, Pune-411015 on . . . . . . . . . . .
ii
ACKNOWLEDGMENT
I am highly indebted to my guide Prof. Rupali Amit Bagate for her guidance
and constant supervision as well as for providing necessary information re-
garding the seminar report and also for her support in completing the seminar
report. I would like to express my special gratitude and thanks to Seminar
Coordinator Prof Ashwini Sapkal for giving me such attention and time.
This acknowledgment would be incomplete without expressing my thanks to
Dr. Sangeeta Jadhav, Head of the Department (Information Technology) for
her support during the work.
I would like to extend my heartfelt gratitude to my Principal, Dr.B. P. Patil
who provided a lot of valuable support, mostly being behind the veils of
college bureaucracy.
I would also like to express my gratitude towards my parents and friends
for their kind co-operation and encouragement which help me in completion
of this report. My thanks and appreciations also go to my colleague in
developing the seminar report and people who have willingly helped me out
with their abilities.
iii
Abstract
Sarcasm is a sophisticated form of irony widely used in social networks and
microblogging websites. It is usually used to convey implicit information
within the message a person transmits. Sarcasm might be used for different
purposes, such as criticism or mockery. However, it is hard even for humans
to recognize. Therefore, recognizing sarcastic statements can be very useful
to improve automatic sentiment analysis of data collected from microblogging
websites or social networks. Sentiment Analysis refers to the identification
and aggregation of attitudes and opinions expressed by Internet users to-
ward a specific topic. In this seminar, we propose a pattern-based approach
to detect sarcasm on Twitter.Some set of features are proposed that cover
the different types of sarcasm defined.These are used to classify the text
as sarcastic and non-sarcastic.The proposed approach reaches an accuracy of
83.1% with precision equal to 91.1%.The importance of each sets of proposed
feature is also studied and evaluated its added value to the classification.
iv
Contents
Certificate ii
Acknowledgment iii
Abstract iv
Chapter Contents vii
List of Figures viii
v
Contents
3 Feature Selection 11
3.1 Statistical Based Approach . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Document Frequency . . . . . . . . . . . . . . . . . . . 11
3.1.2 Information Gain . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Gain Ratio . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.4 Chi Statistics . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.5 Relief-F Algorithm . . . . . . . . . . . . . . . . . . . . 15
3.2 Lexicon Based Approach . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Lexicon Creation . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Automatic Lexicon Generation . . . . . . . . . . . . . 18
4 CONCLUSION 22
vi
List of Figures
vii
Chapter 1
INTRODUCTION TO
SARCASM DETECTION ON
TWEETS
During past few years, there has been a lot of increase in opinionated textual
data in social media over the Internet. Sentiment Analysis is used to ana-
lyze the opinioned text. It helps us to understand (text) the emotion behind
the writer. It is facing many challenges and Sarcasm detection is one of the
major challenges in it. Sarcasm is the unconventional way of conveying a
message which conflicts the context. It can lead to a state of ambiguity. The
popularity and reach of Internet and, specifically the Web social media have
changed the way we generate, share and utilize the information. Since Twit-
ter is a relatively new service, a somewhat lengthy description of the medium
and the data is appropriate. Twitter is a very popular microblogging service.
Apart from simple text, tweets may contain references to url addresses, refer-
ences to other Twitter users (these appear as @user) or a con- tent tag (called
hashtags) assigned by the tweeter (tag). The information on these mediums
is freely available online in a text format and contains voice of the public. For
example, many Internet based discussion forums and review sites enable peo-
ple to express their views about a product or service. While analyzing such
1
consumer-generated online text content offers tremendous business opportu-
nities in term of finding consumer preferences and getting their feedback .
This presents an opportunity for business. Data pre-processing is one of the
primal works implemented by many researchers. Many data preprocessing
techniques such as tokenization, stemming and lemmatization, removal of
stop words is done by many researchers. Several research works have been
done on sarcasm detection. Many feature extraction techniques were im-
plemented. Sarcasm detection may depend on sentiment and other cognitive
aspects. For this reason, we incorporate both sentiment and emotion clues in
our framework. Along with these, we also argue that personality of the opin-
ion holder is an important factor for sarcasm detection. In order to address
all of these variables, we create different models for each of them, namely:
sentiment, emotion and personality. The idea is to train each model on its
corresponding benchmark dataset and, hence, use such pre-trained models
together to extract sarcasm-related features from the sarcasm datasets.
2
1.1.2 Motivation behind Sarcasm Detection On Tweets
The aim of this project is by detecting Sarcasm from social mediaaËs like
twitter we can identify irrelevant and sarcastic opinions of people. These
opinions can be used as reviews for the eďŹective decisions. To make the
classďŹcation achieve a higher accuracy ,using machine learning algorithms
and multiple CNN models are ensembled together , which results in a very
decent F-Score. Using n-Fold cross validation , the validity of the model is
enhanced.
3
Sentiment based classification of text documents is more challenging tasks
than topic based classification since this involves discrimination based on
opinions, feelings, and attitudes contained in text.The opinionated or sub-
jective text on the Web is often non-structured or semi-structured. Further-
more, due to high dimensionality of the text content feature selection is a
crucial problem.
Applying effective and efficient feature selection can enhance the performance
of sentiment analysis in term of accuracy and time to train the classifier.The
applicability of feature selection methods for sentiment based classification
are explored and their performance is investigated in terms of recall, precision
and accuracy.Five feature selection methods (Document Frequency, Informa-
tion Gain, Gain Ratio, Chi Squared, Relief-F) and three popular sentiment
feature lexicons are investigated.
Support vector machine (SVM) classifier is used for sentiment classification
experiments. SVM have shown good track record in text classification.
4
Chapter 2
5
events in real time ”.Recently, group of researchers from Carnegie Mellon
University (CMU) came up with a study where they tried to detect sarcasm
on contextualized basis . They mentioned that most of the approaches in
the past were treating sarcasm as a purely linguistic phenomenon, however
by adding some extra-linguistic information such as the audience, the author
and immediate communicative environment the accuracy of the protocol and
further be increased. They also suggested that mere considering #sarcasm
hashtag is not a natural indicator of sarcasm expressed in conversation, rather
serves as an important communicative function to show the intent of the
author to the audience, who may not otherwise be able to draw the correct
idea or motive about the message. This article created waves around the
globe and got a lot of media attention too.
6
To narrow the scope, using twitter data initially,they selected #sarcasm,
#sarcastic, etc as well as other hashtags that allowed them to tailor the
system to a specific niche area to focus upon. This focus will make the system
less generalizable but, in theory, be more accurate with that particular data
set. If we are able to attain a sufficiently high accuracy with a niche focus
then testing the system on a more general data selection would be the next
goal. Tweets will likely require pre-processing to remove erroneous hashtags
and replace proper nouns with generics so that the sarcasm of the language
is analyzed not the subject of the message.
7
2.4.1 Tokenization
2.4.2 Stemming
8
after stemming. These are examples of overstemming and understemming
errors respectively.
Overstemming lowers precision and understemming lowers recall. They used
WEKA package to perform stemming operation. WEKA contains imple-
mentation of a SnowballStemmer and LovinsStemmer. The overall impact
of stemming depends on the dataset and stemming algorithm. After testing
both implementations they discovered that stemming reduced the accuracy
of the sentiment analysis, therefore stemming operation was avoided in the
final implementation of the sentiment analysis algorithm.
2.4.3 N-Grams
9
Figure 2.2: n-grams for the sentence[6]
To extract grams for each,the text goes through tokenization , and through
stemming,is uncapitalized and then finally every n-gram is added to the
dataset.
10
Chapter 3
Feature Selection
11
3.1.2 Information Gain
m
X
Inf o(D) = − (Pi )log2 (Pi ) (3.1)
i=1
[1]
Where m represents the number of classes (m=2 for binary classification)
and Pi denotes probability that a random instance in partition D belongs to
class Ci estimated as
| Ci, D | / | D |
(i.e. proportion of instances of each class or category). A log function to the
base 2 justifies the fact that we encode information in bits.
If we have to partition the instance in D on some feature A{a1,a2,.....av},D
will split into v partitions set {D1,D2,....,Dv}.The amount of information in
bits,still required for an exact classification is measured by:
v
X
Inf oA (D) = − |Dj |/|D| ∗ Inf o(Dj ) (3.2)
j=1
[1]
| Dj | / | D |
12
Finally information gain by partitioning on A is
[1]
We select the features ranked as per the highest information gain score.
We can optimize the information needed or decrease the overall entropy by
classifying the instances using those ranked features.
13
3.1.3 Gain Ratio
v
X |Dj | |Dj |
SplitInf oA (D) = − ∗ log2 (3.4)
j=1 |D| |D|
[1]
Where high SplitInfo means partitions have equal size(uniform) and low
SplitInfo means few partitions contains most of the tuples(peaks).Finally
the gain ratio is defined as:
The chi square (χ2 ) statistics measures the lack of independence between t
and c and can be compared to the χ2 distribution with one degree of freedom
to judge extremeness.Using the two-way contingency table of a term t and a
category c,where A is the number of times t and c co-occur,B is the number
of times t occurs without c,C is the number of times c occurs without t,D
is the number of times neither c nor t occurs,and N is the total number of
documents,the term-goodness measure is defined to be:
N ∗ (AD − CB)2
χ2 (t, c) = (3.6)
(A + C) ∗ (B + D) ∗ (A + B) ∗ (C + D)
[1]
2
The χ statistic has a natural value of zero if t and c are independent.We
computed for each category the χ2 statistic between each unique term in a
14
training corpus and that category and then combined the category specific
scores of each term into two scores”
m
χ2avg (t) = Pr (ci )χ2 (t, ci )
X
(3.7)
i=1
[1]
m
χ2max (t) = max{χ2 (t, ci )} (3.8)
i=1
[1]
The computation of chi scores has a quadratic complexity similar to IG.
A major difference between chi and MI is that χ2 is a normalized value and
hence χ2 values are comparable across terms for the same category.However
this normalization breaks down(can no longer be accurately compared to the
χ2 distribution) if any cell in the contingency table is lightly populated,which
is the case for low frequency terms.Hence χ2 statistic is known not to be
reliable for low frequenncy terms.
15
quality of classification in the lexicon-based approach depends solely on the
quality of the lexicon. The idea behind the lexicon-based approach is quite
simple:
• Construction of the lexicon of words with assigned to them polarity
scores (different ways of creating lexicons are described in detail in the
following section).
• A bag-of-words is created from the text that needs to be analyzed for
sentiment.
• Preprocessing step: lowering of words in case, stop-words removal,
stemming, part-of-speech tagging, negations handling.
• Sentiment score calculation: each word from the bag-of-words gets com-
pared against the lexicon. If the word is found in the lexicon, the sen-
timent score of that word is added to the total sentiment score of the
text. For example, the total score of the text: ”A masterful[+0.92] film
from a master[+1] filmmaker , unique[+1] in its deceptive grimness ,
compelling[+1] in its fatalist[-0.84] worldview .” will be calculated as
follows: ’total sentiment score’ = +0.92 +1 +1 +1 -0.84 = 3.08, which
means, that the text expresses a positive opinion.
Since creation of a lexicon is the central part of the lexicon based approach,
many researchers created their own sentiment lexicons. Among them:
• Opinion Lexicon[10]
• SentiWordNet[10]
• AFINN Lexicon [10]
• NRC-Hashtag [9]
• Harvard Inquirer Lexicon[6]
• LoughranMcDonald Lexicon [7]
Different ways of the lexicon creation process are described below: Hand-
tagged lexicons. The straightforward approach, but also the most time con-
suming, is to manually construct a lexicon and tag words in it as positive
or negative. constructed their lexicon by reading several thousands of mes-
sages and manually selecting words, which were carrying sentiment. They
16
then used a discriminant function to identify words from a training dataset,
which can be used for sentiment classiďŹer purposes. The remained words
were expanded to include all potential forms of each word into the ďŹnal
lexicon. Another example of hand-tagged lexicon is The Multi-Perspective-
Question Answering (MPQA) Opinion Corpus constructed by . MPQA is
publicly available and consists of 4,850 words, which were manually labeled
as positive or negative and whether they have strong or weak subjectivity. In
order to create their lexicon, extended a list of subjectivity clues created by
. 8000 words, comprising the lexicon, were then manually tagged with their
prior polarity. Another resource is The SentiWordNet (Figure ) created by
17
important resource for sentiment calculation.
#P ositivesentences
sentScore = (3.9)
(#P ositivesentences + #N egativesentences)
[9]
122
sentScore = = 0.73 (3.10)
(122 + 44)
18
Similarly, the negative score for the word âpleasantâ can be calculated by
dividing the number of occurrences in negative sentences by the total number
of mentions:
#N egativesentences
sentScore = (3.11)
(#P ositivesentences + #N egativesentences)
[9]
44
sentScore = = 0.27 (3.12)
(122 + 44)
Based on the negative and positive score of the word we can make a decision
about its polarity. Thus, the word âpleasantâ carries a positive sentiment.
Some words, however, have negative and positive polarity scores very close to
each other. These words are called âneutralâ and do not have impact on the
overall sentiment score of the message, since they tend to appear with equal
frequency in both positive and negative texts. We checked the sentiment
score of some words and presented the results in the table given below:
As we can see from the table, the words ”GOOD” and ”BAD” have strongly
defined polarity scores as we would expect. The word ”LIKE” have polarity
scores in the range between 0.4 and 0.6 and is considered to be a neutral
word. The word ”LIKE” at a first sight creates an impression of a positive
word and needs a deeper analysis to understand its ”neutral” nature. The
word ”LIKE” generally plays one of two roles:
1. Being a verb to express preference. For example: ”I like ice-cream”.
19
2. Being a preposition for the purpose of comparison. For example: ”This
town looks like Brighton”.
The example sentence in case a) has positive sentiment, however can easily
be transformed into a negative sentence: ”I donât like ice-cream”. This
demonstrates that the word ”LIKE” is being used with equal frequency for
expressing positive and negative opinions. In the case b) the word ”LIKE” is
playing a role of preposition and does not play role in determining the overall
polarity of the sentence. words from the Bag-of-Words with a polarity in the
range [0.4; 0.6] were removed, since they do not help to classify the text
as positive or negative. As we can see from the table 4, both positive and
negative scores have positive values, and in order to determine the polarity
of the word both scores need to be compared. We mapped the âgoodnessâ
scores of the words into the range [-1;1], where negative scores were assigned
for the words with negative polarity and positive scores indicated the words
with positive polarity by. The following formula was used for mapping:
[9]
According to this formula, the word âLIKEâ, for example, will have a score
0.446 * 2 -1 = -0.1, which is very close to 0 and indicates the neutrality of the
word. In case when the word was absolutely positive and had a âgoodnessâ
score of 1, the ďŹnal score would also be positive: 1 * 2 â 1 = 1. If the word
was absolutely negative and had the âgoodnessâ score 0, the ďŹnal score
would be negative: 0 * 2 -1 = -1.
20
21
Chapter 4
CONCLUSION
22
Bibliography
23
[9] Methodology for Twitter Sentiment AnalysisOlga Kolchyna1, Th´arsis
T. P. Souza1, Philip C. Treleaven12 and Tomaso Aste12 1 Department
of Computer Science, UCL, Gower Street, London, UK, 2 Systemic Risk
Centre, London School of Economics and Political Sciences, London, UK
[10] M. Hu and B. Liu, âOpinion lexicon,â 2004
[11] Bing Liu Lei Zhang: A survey of opinion mining and sentiment analy-
sis.In Mining Text Data.Springer US 2013 p.415-463
24