Projet Scientifique

A FAST MODEL OF SENTIMENT ANALYSIS FOR
SOCIAL MEDIA
By NDJOKO A BIDIAS Cesar Junior FE21P043
SUPERVISED BY
Pr. FUTE ELIE Tuesday, 17 January 2022

OUTLINE OF THE PRESENTATION
❏ Introduction
❏ Problem Statement
❏ Objectives
❏ Main Objective
❏ Specific Objectives
❏ Literature Review
❏ Comparative Studies of the
different techniques
❏ Proposed Methodology
❏ Conclusions / Further Works
❏ References
INTRODUCTION
INTRODUCTION
The emergence of web 2.0 is changing the world of social media. Not only online
social media used to connect, share information and their personal opinion to others,
but even business can also communicate, understand and improve their product and
services through connecting in social media. The number of social media users
increases every day and it is estimated in 2019 there will be up to 2.77 billion social
media users worldwide.
INTRODUCTION
Social media is rich with raw and

unprocessed data and the improvement in
technology, especially in machine learning
and artificial intelligence, allow the data to be
processed and converted it into a useful data
that they can benefit most business
organization, especially in the domain of
sentiment analysis
INTRODUCTION
Sentiment analysis is an
approach that uses Natural
Language Processing (NLP)
to extract, convert and
interpret opinion from a text
and classify them into
positive, negative or natural
sentiment
PROBLEM STATEMENT
PROBLEM STATEMENT
The main issues with sentiment analysis include the slowness of the models which
in part may be due to the complexity of the text data, this includes the context of
the words/phrases or sentences, the cultural background of the user, models trained
on poorly prepared data, negation, sarcasm and irony, difficulty in determining the
user’s stance. This can have an effect in industries like the Healthcare industry as
this can endanger effective communication between patients and doctors since they
will not understand patients’ needs and hence get to what services dissatisfied
them, and may later on be unable to monitor the side-effects of medications based
on social media posts and prevent unforeseen consequences, for example.
OBJECTIVES
OBJECTIVES
OBJECTIVES
MAIN OBJECTIVE
❏ The main objective of this paper is to obtain a fast and efficient model of sentiment
analysis for social media.
SPECIFIC OBJECTIVES
❏ To assess the existing techniques and compare their efficiencies.

❏ Investigate on what gave the following results and improve on existing
deficiencies, bugs or hindrances discovered during this step
❏ After solving the issues in the previous objective, reassess the existing techniques,
and later on based on the results, propose another technique or a blend of existing
techniques in order to attain the main objective.
LITERATURE REVIEW
LITERATURE REVIEW
There are basically two main approaches in terms of sentiment

analysis which are :
❏ Lexicon-based approach
❏ Machine-Learning based approach
Let us discuss what was done in each of those approaches in
previous works
LITERATURE REVIEW
The work of Oscar Araque et al. (DepecheMood++: a Bilingual

Emotion Lexicon Built Through Simple Yet Powerful Techniques
(2019)) is an example of lexicon-based approach to sentiment
analysis.
LITERATURE REVIEW
Here the main problem this work tackles is the low performance of
lexicon-based approaches in terms of precision and coverage. In an
attempt to solve this issue, this paper investigates whether
computationally cheap techniques like document filtering, text pre-
processing, & frequency cut-off can be used to improve the
performance of rule-based techniques, as well as if machine
learning techniques relying only on lexicon emotion scores can be
used as the baseline for robust, complex and fast models which can
be portable across languages.
LITERATURE REVIEW
To this end, a new lexicon was built upon the publicly available DepecheMood
lexicon, which is generated from news sources distantly annotated with emotional
scores. It was evaluated and released to the community as an extension of the
original lexicon, built using a larger dataset, with a novel emotion lexicon targeting
the Italian language and built with the same methodology. Experiments were then
performed on six datasets/tasks exhibiting a wide diversity in terms of domain
(namely: news, blog posts, mental health forum posts, Twitter), in different
languages (English and Italian) and with different settings (both supervised and
unsupervised), and task (regression and classification). The lexicon is called
DepecheMood ++.
LITERATURE REVIEW
It is an upgrade on DepecheMood (which was built upon a dataset of 25.3k documents and
13.5M words (530 words per document on average) built using an expanded dataset in
order (i) to re-build the English lexicon on a larger corpus, and (ii) to build a novel lexicon
targeting the Italian language.
LITERATURE REVIEW
First, a document-by-emotion matrix (MDE) per language is produced,

containing the voting percentages for each document in the eight affective
dimensions available in rappler.com for English(Happy, Sad, Angry,
Annoyed, Don’t Care, Inspired, Afraid, Amused ) and the six available in
corriere.it for Italian (Happy, Sad, Amused, Afraid, Angry / Annoyed,
Inspired / Don’t Care).Then, the word-by-document matrices using
normalized frequencies (MWD) are computed. That is, the number of
occurrences of the word in a given document divided by the total number of
words in that document. Previous experiments have shown this to be the
best normalization option. After that, matrix multiplication are applied
between the document-by-emotion and word-by-document matrices (MDE
· MWD) to obtain a (raw) wordby-emotion matrix MW E. This method
allows us to ‘merge’ words with emotions by summing the products of the
weight of a word with the weight of the emotions in each document.
LITERATURE REVIEW
Several configurations were made to generate the DepecheMood+

+ lexicon. To fairly assess the performance of each configuration,
randomly selected validation sets were employed, consisting of
25% of the articles in our data for both rappler.com and corriere.it.
For all the following evaluation experiments, the titles of such left-
out sets were used. On a given headline, a single value for each
affective dimension is computed by simply averaging the
DepecheMood++ affective scores of all the words contained in the
headline, this average value is then compared to the annotation for
the headline.
LITERATURE REVIEW
The first configuration is Untagged Document Filtering where the

performance of the lexicon on all documents(tagged and untagged)
is compared to the performance on documents with a non-zero
emotion annotation vector. It is evident that training with emotion-
annotated documents and removing non-annotated ones leads to an
improvement of results, which indicates that untagged documents
add noise to the lexicon generation process.
LITERATURE REVIEW
The second configuration is Frequency Cutoff. Here different word

frequency cutoff values are also explored to find a threshold that
would remove noisy items without eliminating informative ones.
LITERATURE REVIEW
The third configuration is Term Weighting. Here the text

representations are rather simple: an average of the affective scores
of the words in the given text. With the aim of improving this, we
propose an importance term weighting that makes use of the TF-
IDF values to multiply the affect scores. TF-IDF (term frequency-
inverse document frequency) is a statistical measure that evaluates
how relevant a word is to a document in a collection of documents.
This is done by multiplying two metrics: how many times a word
appears in a document, and the inverse document frequency of the
word across a set of documents.
LITERATURE REVIEW
Advantages of such a model include :

❏ Expanded training datasets which helped improve the
performance
❏ Higher precision and coverage
❏ Based on computationally cheap techniques
Weaknesses of such a model :
❏ Use of embeddings can improve the model (the technique
map words that do not initially appear in a specific lexicon to
a word that is contained in the lexicon mentioned above).
LITERATURE REVIEW
Machine-learning based approaches are discussed in the following

paper: “ Sentiment Analysis of Social Networking Sites (SNS) Data
using Machine Learning Approach for the Measurement of
Depression ” by Hassan, Hussain, Husain, Sadiq, Lee.(2017).
LITERATURE REVIEW
The main problem tackled here is Depression, which has become

the world’s fourth major disease. Compared with the high
incidence, however, the rate of depression medical treatment is
very low because of the difficulty of diagnosis of mental problems.
According to the World Health Organization (WHO) survey in
2012 more than 350 million people were suffering from depression
and almost 1 million people with depression end their lives each
year.
LITERATURE REVIEW
In that paper, different machine learning classifiers for sentiment classification were used
since no single classifier is good for all kinds of datasets. There, they combined three
different classifiers by using voting approach, in which each feature is assigned a number
of votes and choose that label which gets the most votes.
LITERATURE REVIEW
In the proposed methodology, there were four main components that are
preprocessing, feature extraction, meta learning and training data
LITERATURE REVIEW
LITERATURE REVIEW
The models used here are Support Vector Machines, Naive Bayes,
and Maximum Entropy.
LITERATURE REVIEW
The advantages of such an approach include :

❏ The use of multiple models, as there is no one-size-fits-all
situation, helps in reducing errors, and makes the models
faster and more efficient
❏ Since there are multiple models, the chance of accuracy is
higher when talking about label classification
LITERATURE REVIEW
PROS OF Lexicon Based Approach
❏ Lexicon-based sentiment analysis methods are easily accessible as many publicly available resources (e.g.,
SentiWordNet, DepecheMood) exist.
❏ They are less expensive because they do not require implementing advanced sentiment analysis algorithms.
❏ There is no need for training data, especially if companies use a dictionary-based approach, as the tags are
determined manually, and there is quick access to the meaning of the words.
CONS
● Lexicon-based sentiment analysis methods usually do not identify sarcasm, negation, grammar mistakes,
misspellings, or irony. Thus, it may not be suitable for analyzing data gathered from social media platforms.
LITERATURE REVIEW
PROS
● Can be trained to detect sarcasm, irony, or negation in sentiment analysis. This can ease social
media sentiment analysis.
● Learn the affective valence of the words, so they do not require a pre-determined dataset.
● Are faster than traditional sentiment analysis methods.
● Provide more accurate results.
CONS
● Companies need a large or high-quality small dataset to have accurate classifications

● Noise (e.g., emojis, slang, or punctuation marks) can reduce accuracy
● Costs are higher compared to traditional, rule-based methods.
PROPOSED METHODOLOGY
PROPOSED METHODOLOGY
● First, data is collected from multiple sources e.g Twitter
● Then data is pre-processed (document filtering, term
weighing, removal of stopwords etc…)
● Features of thé text are then extracted
● Data is then passed to text-based models like DepecheMood
or VADER for firstus classifications
● Then data is passed to Machine Learning models such as
Support Vector Machines, Naïve Bayes, Logistic Regression

Projet Scientifique

Uploaded by

Copyright:

Available Formats

Projet Scientifique

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Projet Scientifique

Uploaded by

Copyright:

Available Formats

A FAST MODEL OF SENTIMENT ANALYSIS FOR

By NDJOKO A BIDIAS Cesar Junior FE21P043

Pr. FUTE ELIE Tuesday, 17 January 2022

Social media is rich with raw and

❏ To assess the existing techniques and compare their efficiencies.

There are basically two main approaches in terms of sentiment

The work of Oscar Araque et al. (DepecheMood++: a Bilingual

First, a document-by-emotion matrix (MDE) per language is produced,

Several configurations were made to generate the DepecheMood+

The first configuration is Untagged Document Filtering where the

The second configuration is Frequency Cutoff. Here different word

The third configuration is Term Weighting. Here the text

Advantages of such a model include :

Machine-learning based approaches are discussed in the following

The main problem tackled here is Depression, which has become

The advantages of such an approach include :

PROS OF Lexicon Based Approach

● Companies need a large or high-quality small dataset to have accurate classifications

You might also like