Swaraj Project 12march

A Project Based Seminar Report
on
”Feature Selection and Data Visualization”
Submitted to the
Savitribai Phule Pune University

In partial fulfillment for the award of the Degree of
Bachelor of Engineering
in
Information Technology
by
Swaraj Surendra Patel
(T150228554/4354 & TE)
Under the guidance of
Prof. Rupali Amit Bagate
Department of Information Technology

Army Institute of Technology
Dighi Hills, Pune-Alandi Road,
Pune 411015.
Semester-VI, Third Year Engineering
2018-2019
i
CERTIFICATE
This is to certify that the project based seminar report entitled ”Feature Se-
lection and Data Visualization” being submitted by Swaraj Surendra Patel
(4354 / T150228554 and TE IT) is a record of bonafide work carried out
by him under the supervision and guidance of Prof. Rupali Amit Bagate
in partial fulfillment of the requirement for TE (Information Technology
Engineering) - 2015 Course of Savitribai Phule Pune University, Pune in
the academic year 2018-2019.
Date: 31/03/2019
Place: Pune
Prof. Rupali Amit Bagate Dr. Sangeeta Jadhav

Seminar Guide Head of the Department
Dr. B. P. Patil
Principal
This Project Based Seminar report has been examined by us as per the
Savitribai Phule Pune University, Pune requirements at Army Institute of
Technology, Pune-411015 on . . . . . . . . . . .
Internal Examiner External Examiner

Scanned by CamScanner
ii
ACKNOWLEDGMENT
I am highly indebted to my guide Prof. Rupali Amit Bagate for her guidance
and constant supervision as well as for providing necessary information re-
garding the seminar report and also for her support in completing the seminar
report. I would like to express my special gratitude and thanks to Seminar
Coordinator Prof Ashwini Sapkal for giving me such attention and time.
This acknowledgment would be incomplete without expressing my thanks to
Dr. Sangeeta Jadhav, Head of the Department (Information Technology) for
her support during the work.
I would like to extend my heartfelt gratitude to my Principal, Dr.B. P. Patil
who provided a lot of valuable support, mostly being behind the veils of
college bureaucracy.
I would also like to express my gratitude towards my parents and friends
for their kind co-operation and encouragement which help me in completion
of this report. My thanks and appreciations also go to my colleague in
developing the seminar report and people who have willingly helped me out
with their abilities.
(Swaraj Surendra Patel & Signature)
iii
Abstract
Sarcasm is a sophisticated form of irony widely used in social networks and
microblogging websites. It is usually used to convey implicit information
within the message a person transmits. Sarcasm might be used for different
purposes, such as criticism or mockery. However, it is hard even for humans
to recognize. Therefore, recognizing sarcastic statements can be very useful
to improve automatic sentiment analysis of data collected from microblogging
websites or social networks. Sentiment Analysis refers to the identification
and aggregation of attitudes and opinions expressed by Internet users to-
ward a specific topic. In this seminar, we propose a pattern-based approach
to detect sarcasm on Twitter.Some set of features are proposed that cover
the different types of sarcasm defined.These are used to classify the text
as sarcastic and non-sarcastic.The proposed approach reaches an accuracy of
83.1% with precision equal to 91.1%.The importance of each sets of proposed
feature is also studied and evaluated its added value to the classification.
iv
Contents
Certificate ii
Acknowledgment iii
Abstract iv
Chapter Contents vii
List of Figures viii
v
Contents
1 INTRODUCTION TO SARCASM DETECTION ON TWEETS 1

1.1 INTRODUCTION TO SARCASM DETECTION ON TWEETS 1
1.1.1 Introduction to Sarcasm Detection On Tweets . . . . . 1
1.1.2 Motivation behind Sarcasm Detection On Tweets . . . 3
1.1.3 Aim and Objective(s) of the work . . . . . . . . . . . 3
1.1.4 Introduction to Data Visualization And Feature Se-
lection . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Data Collection And PreProcessing 5

2.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Why Twitter? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Data Collection Process . . . . . . . . . . . . . . . . . . . . . 6
2.4 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.2 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.3 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Feature Selection 11
3.1 Statistical Based Approach . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Document Frequency . . . . . . . . . . . . . . . . . . . 11
3.1.2 Information Gain . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Gain Ratio . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.4 Chi Statistics . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.5 Relief-F Algorithm . . . . . . . . . . . . . . . . . . . . 15
3.2 Lexicon Based Approach . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Lexicon Creation . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Automatic Lexicon Generation . . . . . . . . . . . . . 18
4 CONCLUSION 22
vi
List of Figures
2.1 Flowchart for feature engineering[6] . . . . . . . . . . . . . . . 7

2.2 n-grams for the sentence[6] . . . . . . . . . . . . . . . . . . . . 10
3.1 The graphical representation of the opinion-related properties

of a word in SentiWordNet.[9] . . . . . . . . . . . . . . . . . . 17
3.2 Sentiment scores of words in the automatically generated lex-
icon [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vii
Chapter 1
INTRODUCTION TO
SARCASM DETECTION ON
TWEETS
1.1 INTRODUCTION TO SARCASM DE-

TECTION ON TWEETS
1.1.1 Introduction to Sarcasm Detection On Tweets
During past few years, there has been a lot of increase in opinionated textual
data in social media over the Internet. Sentiment Analysis is used to ana-
lyze the opinioned text. It helps us to understand (text) the emotion behind
the writer. It is facing many challenges and Sarcasm detection is one of the
major challenges in it. Sarcasm is the unconventional way of conveying a
message which conflicts the context. It can lead to a state of ambiguity. The
popularity and reach of Internet and, specifically the Web social media have
changed the way we generate, share and utilize the information. Since Twit-
ter is a relatively new service, a somewhat lengthy description of the medium
and the data is appropriate. Twitter is a very popular microblogging service.
Apart from simple text, tweets may contain references to url addresses, refer-
ences to other Twitter users (these appear as @user) or a content tag (called
hashtags) assigned by the tweeter (tag). The information on these mediums
is freely available online in a text format and contains voice of the public. For
example, many Internet based discussion forums and review sites enable peo-
ple to express their views about a product or service. While analyzing such
1
consumer-generated online text content offers tremendous business opportu-
nities in term of finding consumer preferences and getting their feedback .
This presents an opportunity for business. Data pre-processing is one of the
primal works implemented by many researchers. Many data preprocessing
techniques such as tokenization, stemming and lemmatization, removal of
stop words is done by many researchers. Several research works have been
done on sarcasm detection. Many feature extraction techniques were im-
plemented. Sarcasm detection may depend on sentiment and other cognitive
aspects. For this reason, we incorporate both sentiment and emotion clues in
our framework. Along with these, we also argue that personality of the opin-
ion holder is an important factor for sarcasm detection. In order to address
all of these variables, we create different models for each of them, namely:
sentiment, emotion and personality. The idea is to train each model on its
corresponding benchmark dataset and, hence, use such pre-trained models
together to extract sarcasm-related features from the sarcasm datasets.
2
1.1.2 Motivation behind Sarcasm Detection On Tweets
To Detect sarcasm properly a computer would have to ďŹgure out that we

meant the opposite of what you we say as it is sometimes hard for Humans
to detect sarcasm, and Humans have a much better grasp at the English
language than computers do.
1.1.3 Aim and Objective(s) of the work
The aim of this project is by detecting Sarcasm from social mediaaËs like
twitter we can identify irrelevant and sarcastic opinions of people. These
opinions can be used as reviews for the eďŹective decisions. To make the
classďŹcation achieve a higher accuracy ,using machine learning algorithms
and multiple CNN models are ensembled together , which results in a very
decent F-Score. Using n-Fold cross validation , the validity of the model is
enhanced.
1.1.4 Introduction to Data Visualization And Feature

Selection
The seminar topic is Data Visualization and Feature Selection.Due to the

huge and ever-growing information on the Web, which is highly unstructured
and scattered, we are now barely able to access and utilize the information.
Multiple solutions have been proposed to solve this problem, and they are
mainly specialized in factual information retrieval (IR), data mining (DM)
or more specifically text mining (TM), natural language processing (NLP)
and machine learning techniques (ML).
Sentiment analysis (often referred as opinion mining) is a recent area of re-
search where we apply these advanced techniques to process vast amounts
of user generated text content. The purpose of sentiment analysis is to de-
termine the overall opinion or attitude, in term of positive or negative, ex-
pressed in text available over Internet.Sentiment analysis may be as simple as
the basic sentiment based categorization of text documents to more complex
and advanced procedures to extract opinion at different granularity levels.
Sentiment classification differs from text categorization in term of criterion
of classification, which is the opinion or view expressed, instead of topic or
frequent features in the text.
3
Sentiment based classification of text documents is more challenging tasks
than topic based classification since this involves discrimination based on
opinions, feelings, and attitudes contained in text.The opinionated or sub-
jective text on the Web is often non-structured or semi-structured. Further-
more, due to high dimensionality of the text content feature selection is a
crucial problem.
Applying effective and efficient feature selection can enhance the performance
of sentiment analysis in term of accuracy and time to train the classifier.The
applicability of feature selection methods for sentiment based classification
are explored and their performance is investigated in terms of recall, precision
and accuracy.Five feature selection methods (Document Frequency, Informa-
tion Gain, Gain Ratio, Chi Squared, Relief-F) and three popular sentiment
feature lexicons are investigated.
Support vector machine (SVM) classifier is used for sentiment classification
experiments. SVM have shown good track record in text classification.
4
Chapter 2
Data Collection And

PreProcessing
2.1 Data Set

Through all data set accounted by previous studies done by many researchers,
we discovered that most microblogging social media, specifically, Twitter is
employed the most. The reason Twitter is mostly employed because of its
distinguishable properties comparing to another kind of dataset.
2.2 Why Twitter?

Twitter is an online social network and micro-blogging site, which enable
users to send and read the short 140-character text messages known as tweets.
âThe registered users can post and read tweets, while the unregistered users
can only read the post. If a person likes the post of one user then he/she can
share that particular tweet from the own profile and the process is known
as a retweet. Twitter users have developed metadata annotation schemes
which, shows that it compresses substantial amount of information in to a
comparatively tiny space ”.Currently, the user base of Twitter is over 650
million worldwide, with almost 0.13 million new users joining daily. Over a
billion, new tweets are being posted every month on this from a wide range of
interest groups. ” Twitter is in the list of the most visited sites according to
Alexa ranking, and has been described as ” SMS of the Internet ”. Its large
scale and streaming nature makes Twitter an ideal platform for monitoring
5
events in real time ”.Recently, group of researchers from Carnegie Mellon
University (CMU) came up with a study where they tried to detect sarcasm
on contextualized basis . They mentioned that most of the approaches in
the past were treating sarcasm as a purely linguistic phenomenon, however
by adding some extra-linguistic information such as the audience, the author
and immediate communicative environment the accuracy of the protocol and
further be increased. They also suggested that mere considering #sarcasm
hashtag is not a natural indicator of sarcasm expressed in conversation, rather
serves as an important communicative function to show the intent of the
author to the audience, who may not otherwise be able to draw the correct
idea or motive about the message. This article created waves around the
globe and got a lot of media attention too.
2.3 Data Collection Process

The data is collected using Twitter’s streaming API. To collect sarcastic
tweets,they queried the API for tweets containing the hashtag ”#sarcasm”.
Although Liebrecht et al.concluded in their work that this hashtag is not the
best way to collect sarcastic tweets, other works suggested the fact that this
hashtag can be used for this purpose. However, they also concluded that the
hashtag cannot be reliable and is used mainly for 3 purposes:
• to serve as a search anchor
• to serve as a sarcasm marker in case of a very subtle sarcasm where it is
very hard to get the sarcasm without an explicit marker, as in ”Today
was fun.The first time since weeks! #Sarcasm”
• clarify the presence of sarcasm in a previous tweet,as in ”I forgot to
add sarcasm so people like you get it!”.
The data is gathered for a period of a few days or weeks until sufficiently
large amount of data is gathered to train the system with. However twitter
data is problematic, while the data is readily available there is little to no
context due to the short messages. In order to gain context with twitter data
there is need to look at replies, past tweets, and the userâs profile. While
these may be possible it is a more advanced option that we hope to be able to
get to by the end of the project, but using and gathering the context for each
tweet is more of a stretch goal. Ideally we will be able to attain a data set
that has more context in the surrounding text and does not require specific
background of the users.
6
To narrow the scope, using twitter data initially,they selected #sarcasm,
#sarcastic, etc as well as other hashtags that allowed them to tailor the
system to a specific niche area to focus upon. This focus will make the system
less generalizable but, in theory, be more accurate with that particular data
set. If we are able to attain a sufficiently high accuracy with a niche focus
then testing the system on a more general data selection would be the next
goal. Tweets will likely require pre-processing to remove erroneous hashtags
and replace proper nouns with generics so that the sarcasm of the language
is analyzed not the subject of the message.
2.4 Feature Engineering
Figure 2.1: Flowchart for feature engineering[6]
7
2.4.1 Tokenization
Tokenization[3] is one of the main features of lexically analyzing the text.Here

the aggregation of the sequence of characters takes place. The incoming
string is broken into tokens: comprising words and other elements, for ex-
ample URL links. The common separator for identifying individual words is
whitespace, however other symbols can also be used.
Decluttering the words of the sentence into tokens is very helpful in indi-
vidual identification.Tokenization of social-media data is more diďŹcult than
tokenization of the general text.A token could be a complete group of words
or a whole paragraph,but most frequently, words are used as tokens.
ArkTweetNLP library which was developed by the team of researchers from
Carnegie Mellon University was used and was specially designed for working
with twitter messages. ArkTweetNLP recognizes speciďŹc to Twitter sym-
bols, such as hashtags, at-mentions, retweets, emoticons, commonly used
abbreviations, and treats them as separate tokens.
Example- using Hashtag-Tokenizer to separate the hashtags that contain
connected words like splitting #Sarcasticirony into #Sarcastic and irony.
2.4.2 Stemming
Stemming[3] is a procedure of replacing words with their stems, or roots. It is

a common method in Natural Language Processing(NLP). The main motive
of this is to reduce the repetition of words by dropping the suffix of the words
to arrive at a basic form of the word. Hence, transforming the text into a more
accurate form that is easily analyzable. The basic idea behind stemming is
that we group the words with common or close meanings together and thus
try to enhance the efficiency of NLP . To increase the efficiency , applications
should be made of stemming to group these words into a single term.
For example the words :depressing,depressed,depress,depresses , depression
etc. are having a common stem here which is ”depress” accompanied by
the suffixes as ”ed”,”ing” etc.It is not necessary that a suffix would have a
meaning or not.
However, one should be careful when applying stemming, since it can increase
bias. For example, the biased effect of stemming is merging of distinct words
experiment and experience into one word exper, or words which ought to
be merged together (such as ”adhere” and ”adhesion”) may remain distinct
8
after stemming. These are examples of overstemming and understemming
errors respectively.
Overstemming lowers precision and understemming lowers recall. They used
WEKA package to perform stemming operation. WEKA contains imple-
mentation of a SnowballStemmer and LovinsStemmer. The overall impact
of stemming depends on the dataset and stemming algorithm. After testing
both implementations they discovered that stemming reduced the accuracy
of the sentiment analysis, therefore stemming operation was avoided in the
final implementation of the sentiment analysis algorithm.
2.4.3 N-Grams
According to this approach accompanying words are being grouped into

phrases called n-grams. N-grams[3] method can decrease bias, but may in-
crease statistical sparseness. It has been shown that the use of n-grams
improves the quality of text classification ( [24]; [25]; [26]), however there
is no a unique solution for which size of n-gram to use. Bigrams are collec-
tions of two neighboring words in a text, and trigrams are collections of three
neighboring words.
Caropreso et al. conducted an experiment of text categorization on the
Reuters dataset . They have reported that in general the use of bigrams
helped to produce better results than the use of unigrams, however while us-
ing Rocchio classifier the use of bigrams led to the decrease of classification
quality in 28 out of 48 experiments. Tan et al. reported that use of bi-
grams on Yahoo!Science dataset allowed to improve the performance of text
classification using Nave Bayes classifier from 65% to 70% break-even point.
On Reuters dataset the increase of accuracy was not significant. Trigrams
were reported to generate poor performance [30]. In this study in order to
extract unigrams and bigramswe they used a tokenizer from a free machine
learning software WEKA. WEKA was developed in the university Wakaito
and provides implementations of many machine learning algorithms. Since it
is written in Java and has an API, WEKA algorithms can be easily embedded
within other applications.
Example-”You should definitely go there” is a sentence in the form of text.
Unigram, bigram and trigram are shown in fig 2. Here the n-grams we have
are of four forms,”you should” to ”go there”.
9
Figure 2.2: n-grams for the sentence[6]
To extract grams for each,the text goes through tokenization , and through
stemming,is uncapitalized and then finally every n-gram is added to the
dataset.
10
Chapter 3
Feature Selection
3.1 Statistical Based Approach

Feature selection methods reduce the original feature set by removing ir-
relevant features for text sentiment classification to improve classification
accuracy and decrease the running time of learning algorithms. We have
investigated performance of five commonly used feature selection methods in
data mining research, i.e., DF, IG, CHI, GR and Relief-F. All these feature
selection methods compute a score for each individual feature and then select
top ranked features as per that score.
3.1.1 Document Frequency
Document Frequency measures the number of documents in which the feature

appears in a dataset.This method removes those features whose document
frequency is less than or greater than a predefined threshold frequency.The
basic assumption is that the rare terms are either non -informative for cate-
gory prediction or not influential in global performance.In either case removal
of rare terms reduces the dimensionality of the feature space.Improvement
in categorization accuracy is also possible if rare terms happens to be noise
terms. DF thresholding is the simplest technique for vocabulary reduction.
It can be easily scaled for a large corpora, with computational complexity
approximately linear with the no .of documents . However it is usually con-
sidered as an ad hoc approach(i.e applied when much needed) to improve
efficiency or for selecting predictive features. because it removes terms with
low DF aggresively which might sometimes be relatively informative.
11
3.1.2 Information Gain
Information gain is frequently employed as a term or feature goodness cri-

terion in machine learning based classification.It measures the no. of bits
of information obtained for category prediction by knowing the presence or
absence of a term in a document.Information Gain is calculated by the fea-
tureâs contribution on decreasing overall entropy. The expected information
needed to classify an instance (tuple) for partition D or identify the class
label of an instance in D is known as entropy and is given by:
m
X
Inf o(D) = − (Pi )log2 (Pi ) (3.1)
i=1
[1]
Where m represents the number of classes (m=2 for binary classification)
and Pi denotes probability that a random instance in partition D belongs to
class Ci estimated as
| Ci, D | / | D |
(i.e. proportion of instances of each class or category). A log function to the
base 2 justifies the fact that we encode information in bits.
If we have to partition the instance in D on some feature A{a1,a2,.....av},D
will split into v partitions set {D1,D2,....,Dv}.The amount of information in
bits,still required for an exact classification is measured by:
v
X
Inf oA (D) = − |Dj |/|D| ∗ Inf o(Dj ) (3.2)
j=1
[1]
Where Info(Dj) is the entropy of partition Dj and
| Dj | / | D |
is the weight of the jth partition.
12
Finally information gain by partitioning on A is
Inf ormationGain(A) = Inf o(D) − Inf oA (D) (3.3)
[1]
We select the features ranked as per the highest information gain score.
We can optimize the information needed or decrease the overall entropy by
classifying the instances using those ranked features.
13
3.1.3 Gain Ratio
Gain Ratio enhances Information Gain as it offers a normalized score of a

featureâs contribution to an optimal information gain based classification de-
cision.Gain Ratio is utilized as an iterative process where we select smaller
sets of features in incremental fashion.These iterations terminate when there
is only predefined number of features remaining.The high gain ratio for se-
lected feature implies that the feature will be useful for classification.The
split information value corresponds to the potential information obtained by
partitioning the training data set D into v partitions, resulting to v outcomes
on attribute A:
v
X |Dj | |Dj |
SplitInf oA (D) = − ∗ log2 (3.4)
j=1 |D| |D|
[1]
Where high SplitInfo means partitions have equal size(uniform) and low
SplitInfo means few partitions contains most of the tuples(peaks).Finally
the gain ratio is defined as:
GainRatio(A) = Inf ormationGain(A)/SplitInf o(A) (3.5)
3.1.4 Chi Statistics
The chi square (χ2 ) statistics measures the lack of independence between t
and c and can be compared to the χ2 distribution with one degree of freedom
to judge extremeness.Using the two-way contingency table of a term t and a
category c,where A is the number of times t and c co-occur,B is the number
of times t occurs without c,C is the number of times c occurs without t,D
is the number of times neither c nor t occurs,and N is the total number of
documents,the term-goodness measure is defined to be:
N ∗ (AD − CB)2
χ2 (t, c) = (3.6)
(A + C) ∗ (B + D) ∗ (A + B) ∗ (C + D)
[1]
2
The χ statistic has a natural value of zero if t and c are independent.We
computed for each category the χ2 statistic between each unique term in a
14
training corpus and that category and then combined the category specific
scores of each term into two scores”
m
χ2avg (t) = Pr (ci )χ2 (t, ci )
X
(3.7)
i=1
[1]
m
χ2max (t) = max{χ2 (t, ci )} (3.8)
i=1
[1]
The computation of chi scores has a quadratic complexity similar to IG.
A major difference between chi and MI is that χ2 is a normalized value and
hence χ2 values are comparable across terms for the same category.However
this normalization breaks down(can no longer be accurately compared to the
χ2 distribution) if any cell in the contingency table is lightly populated,which
is the case for low frequency terms.Hence χ2 statistic is known not to be
reliable for low frequenncy terms.
3.1.5 Relief-F Algorithm
The basic principle of Relief-F is to select feature instances at random, com-

pute their nearest neighbors, and optimize a feature weighting vector to
award more importance (weight)to features that discriminate the instance
from neighbors of different classes.Relief-F attempt to evaluate a good es-
timation of weight Wf from the following probabilities for weighting and
ranking feature f:
Wf=P( different value of f nearest instances from different class)
- P(different value of f nearest instances from same class)
3.2 Lexicon Based Approach

Lexicon-based approach is also called a dictionary approach and relies on
a lexicon or dictionary of words with pre-calculated polarity.Sometimes this
method is considered to be part of the Machine Learning Unsupervised ap-
proach, however we will describe it as an independent method, since the
15
quality of classification in the lexicon-based approach depends solely on the
quality of the lexicon. The idea behind the lexicon-based approach is quite
simple:
• Construction of the lexicon of words with assigned to them polarity
scores (different ways of creating lexicons are described in detail in the
following section).
• A bag-of-words is created from the text that needs to be analyzed for
sentiment.
• Preprocessing step: lowering of words in case, stop-words removal,
stemming, part-of-speech tagging, negations handling.
• Sentiment score calculation: each word from the bag-of-words gets com-
pared against the lexicon. If the word is found in the lexicon, the sen-
timent score of that word is added to the total sentiment score of the
text. For example, the total score of the text: ”A masterful[+0.92] film
from a master[+1] filmmaker , unique[+1] in its deceptive grimness ,
compelling[+1] in its fatalist[-0.84] worldview .” will be calculated as
follows: ’total sentiment score’ = +0.92 +1 +1 +1 -0.84 = 3.08, which
means, that the text expresses a positive opinion.
3.2.1 Lexicon Creation
Since creation of a lexicon is the central part of the lexicon based approach,
many researchers created their own sentiment lexicons. Among them:
• Opinion Lexicon[10]
• SentiWordNet[10]
• AFINN Lexicon [10]
• NRC-Hashtag [9]
• Harvard Inquirer Lexicon[6]
• LoughranMcDonald Lexicon [7]
Different ways of the lexicon creation process are described below: Hand-
tagged lexicons. The straightforward approach, but also the most time con-
suming, is to manually construct a lexicon and tag words in it as positive
or negative. constructed their lexicon by reading several thousands of mes-
sages and manually selecting words, which were carrying sentiment. They
16
then used a discriminant function to identify words from a training dataset,
which can be used for sentiment classiďŹer purposes. The remained words
were expanded to include all potential forms of each word into the ďŹnal
lexicon. Another example of hand-tagged lexicon is The Multi-Perspective-
Question Answering (MPQA) Opinion Corpus constructed by . MPQA is
publicly available and consists of 4,850 words, which were manually labeled
as positive or negative and whether they have strong or weak subjectivity. In
order to create their lexicon, extended a list of subjectivity clues created by
. 8000 words, comprising the lexicon, were then manually tagged with their
prior polarity. Another resource is The SentiWordNet (Figure ) created by
Figure 3.1: The graphical representation of the opinion-related properties of

a word in SentiWordNet.[9]
Esuli and Sebastiani . SentiWordNet extracted words from WordNet11 and

gave them probability of belonging to positive, negative or neutral classes,
and subjectivity score. demonstrated that SentiWordNet can be used as an
17
important resource for sentiment calculation.
3.2.2 Automatic Lexicon Generation
Automatic Lexicon Generation For the purpose of this study we generated

a lexicon using an approach âConstracting a lexicon from the trained dataâ
described in the previous section. We used a Twitter dataset comprised by
Mark Hall [46] to automatically generate our own sentiment lexicon. The
messages are labelled as positive or negative. The method was implemented
as follows:
1. The messages from the training datasets of Mark Hall were read from
the database.
2. Words were extracted by tokenizing all the sentences in the dataset;
words in a BOW were assigned POS tags; stemming was performed
(originally we performed stemming, but after analyzing the results and
discovering that stemming reduced the accuracy of the analyser, we
omitted this step); tokens were filtered based on their length (1-99); all
words were lowered in case.
3. All the words obtained after tokenizing the instances of the training
dataset and pre-processing were combined into one set called a bag-of-
words.
4. The number of occurrences of each word in positive and negative sen-
tences from the training dataset was calculated. 5. The positive polar-
ity of each word was calculated according to the following formula: the
number of occurrences in positive sentences divided by the number of
all occurrences. For example, we calculated that the word âpleasantâ
appeared 122 times in the positive sentences and 44 times in the neg-
ative sentences. According to the formula, the âgoodnessâ score of the
word âpleasantâ is:
#P ositivesentences
sentScore = (3.9)
(#P ositivesentences + #N egativesentences)
[9]
122
sentScore = = 0.73 (3.10)
(122 + 44)
18
Similarly, the negative score for the word âpleasantâ can be calculated by
dividing the number of occurrences in negative sentences by the total number
of mentions:
#N egativesentences
sentScore = (3.11)
(#P ositivesentences + #N egativesentences)
[9]
44
sentScore = = 0.27 (3.12)
(122 + 44)
Based on the negative and positive score of the word we can make a decision
about its polarity. Thus, the word âpleasantâ carries a positive sentiment.
Some words, however, have negative and positive polarity scores very close to
each other. These words are called âneutralâ and do not have impact on the
overall sentiment score of the message, since they tend to appear with equal
frequency in both positive and negative texts. We checked the sentiment
score of some words and presented the results in the table given below:
Figure 3.2: Sentiment scores of words in the automatically generated lexicon

[9]
As we can see from the table, the words ”GOOD” and ”BAD” have strongly
defined polarity scores as we would expect. The word ”LIKE” have polarity
scores in the range between 0.4 and 0.6 and is considered to be a neutral
word. The word ”LIKE” at a first sight creates an impression of a positive
word and needs a deeper analysis to understand its ”neutral” nature. The
word ”LIKE” generally plays one of two roles:
1. Being a verb to express preference. For example: ”I like ice-cream”.
19
2. Being a preposition for the purpose of comparison. For example: ”This
town looks like Brighton”.
The example sentence in case a) has positive sentiment, however can easily
be transformed into a negative sentence: ”I donât like ice-cream”. This
demonstrates that the word ”LIKE” is being used with equal frequency for
expressing positive and negative opinions. In the case b) the word ”LIKE” is
playing a role of preposition and does not play role in determining the overall
polarity of the sentence. words from the Bag-of-Words with a polarity in the
range [0.4; 0.6] were removed, since they do not help to classify the text
as positive or negative. As we can see from the table 4, both positive and
negative scores have positive values, and in order to determine the polarity
of the word both scores need to be compared. We mapped the âgoodnessâ
scores of the words into the range [-1;1], where negative scores were assigned
for the words with negative polarity and positive scores indicated the words
with positive polarity by. The following formula was used for mapping:
P olarityScore = 2 ∗ Goodness − 1 (3.13)
[9]
According to this formula, the word âLIKEâ, for example, will have a score
0.446 * 2 -1 = -0.1, which is very close to 0 and indicates the neutrality of the
word. In case when the word was absolutely positive and had a âgoodnessâ
score of 1, the ďŹnal score would also be positive: 1 * 2 â 1 = 1. If the word
was absolutely negative and had the âgoodnessâ score 0, the ďŹnal score
would be negative: 0 * 2 -1 = -1.
20
21
Chapter 4
CONCLUSION
This seminar is undertaken to explain Data collection ,Data PreProcessing

and Feature Selection techniques required forsarcasm detection. Five feature
selection methods (Document Frequency,Information Gain, Gain Ratio, Chi
Squared, Relief-F) andsome popular sentiment feature lexicons are investi-
gated.The experimental results show that information gain givesstable per-
formance for different number of features.Gain Ratio gave the best results for
large number ofsentimental features selection.Sentiment lexicons gave poor
performance. A promising direction for future work is to investigate theper-
formance of feature selection methods on different machinelearning classifiers.
22
Bibliography
[1] Yang, Y., Pedersen, Jan O. (1997). A comparative study on featureselec-

tion in text categorization. ICML, 412â420.
[2] Hu, M., Liu, B. (2004). Mining and summarizing customer reviews.In
The 2004 SIGKDD (pp. 168â177).
[3] Methodology for Twitter Sentiment Analysis Olga Kolchyna1,Tharsis T.
P. Souza1, Philip C. Treleaven12 and Tomaso Aste12.
[4] A Pattern-Based Approach for Sarcasm Detection on TwitterMOND-
HER BOUAZIZI AND TOMOAKI OTSUKI (OHTSUKI),(Senior Mem-
ber, IEEE) Graduate School of Science and Technology,Keio University,
Yokohama 223-8522, Japan
[5] A. Rajadesingan, R. Zafarani, H. Liu, âSarcasm detection onTwitter: A
behavioral modeling approachâ, Proc. 18th ACM Int.Conf. Web Search
Data Mining, pp. 79-106, Feb. 2015.
[6] Automatic Sarcasm Detection using feature selection Paras Dhar-
wal,Rajat Mittal et al.Amity School of Engineering and Technology
,Amtity University Sector 123,Noida Feb2016
[7] Performance Investigation of Feature Selection Methods and Sentiment
Lexicons for Sentiment Analysis Special Issue of International Journal
of Computer Applications (0975 â 8887) on Advanced Computing and
Communication Technologies for HPC Applications - ACCTHPCA, June
2012
[8] Feature Selection for Sentiment Analysis Based on Content and Syntax
Models Adnan Duric and Fei Song School of Computer Science, Univer-
sity of Guelph, 50 Stone Road East, Guelph, Ontario, N1G 2W1, Canada
aduric,fsong@uoguelph.ca
23
[9] Methodology for Twitter Sentiment AnalysisOlga Kolchyna1, ThÂ´arsis
T. P. Souza1, Philip C. Treleaven12 and Tomaso Aste12 1 Department
of Computer Science, UCL, Gower Street, London, UK, 2 Systemic Risk
Centre, London School of Economics and Political Sciences, London, UK
[10] M. Hu and B. Liu, âOpinion lexicon,â 2004
[11] Bing Liu Lei Zhang: A survey of opinion mining and sentiment analy-
sis.In Mining Text Data.Springer US 2013 p.415-463
24

Swaraj Project 12march

Uploaded by

Copyright:

Available Formats

Swaraj Project 12march

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Swaraj Project 12march

Uploaded by

Copyright:

Available Formats

A Project Based Seminar Report

Savitribai Phule Pune University

Department of Information Technology

Prof. Rupali Amit Bagate Dr. Sangeeta Jadhav

Internal Examiner External Examiner

(Swaraj Surendra Patel & Signature)

1 INTRODUCTION TO SARCASM DETECTION ON TWEETS 1

2 Data Collection And PreProcessing 5

2.1 Flowchart for feature engineering[6] . . . . . . . . . . . . . . . 7

3.1 The graphical representation of the opinion-related properties

1.1 INTRODUCTION TO SARCASM DE-

1.1.1 Introduction to Sarcasm Detection On Tweets

To Detect sarcasm properly a computer would have to ďŹgure out that we

1.1.3 Aim and Objective(s) of the work

1.1.4 Introduction to Data Visualization And Feature

The seminar topic is Data Visualization and Feature Selection.Due to the

Data Collection And

2.1 Data Set

2.2 Why Twitter?

2.3 Data Collection Process

2.4 Feature Engineering

Figure 2.1: Flowchart for feature engineering[6]

Tokenization[3] is one of the main features of lexically analyzing the text.Here

Stemming[3] is a procedure of replacing words with their stems, or roots. It is

According to this approach accompanying words are being grouped into

3.1 Statistical Based Approach

3.1.1 Document Frequency

Document Frequency measures the number of documents in which the feature

Information gain is frequently employed as a term or feature goodness cri-

Where Info(Dj) is the entropy of partition Dj and

is the weight of the jth partition.

Inf ormationGain(A) = Inf o(D) − Inf oA (D) (3.3)

Gain Ratio enhances Information Gain as it offers a normalized score of a

GainRatio(A) = Inf ormationGain(A)/SplitInf o(A) (3.5)

3.1.4 Chi Statistics

3.1.5 Relief-F Algorithm

The basic principle of Relief-F is to select feature instances at random, com-

3.2 Lexicon Based Approach

3.2.1 Lexicon Creation

Figure 3.1: The graphical representation of the opinion-related properties of

Esuli and Sebastiani . SentiWordNet extracted words from WordNet11 and

3.2.2 Automatic Lexicon Generation

Automatic Lexicon Generation For the purpose of this study we generated

Figure 3.2: Sentiment scores of words in the automatically generated lexicon

P olarityScore = 2 ∗ Goodness − 1 (3.13)

This seminar is undertaken to explain Data collection ,Data PreProcessing

[1] Yang, Y., Pedersen, Jan O. (1997). A comparative study on featureselec-

You might also like