Analyzing the tone of tweets using TextBlob and RandomForest

This project demonstrates the process of analyzing the tone of tweets. It involves pre-processing text, marking up the tonality using the TextBlob library, and training a RandomForestClassifier model to classify the tonality.

Description

The main goal of the project is to demonstrate skills in Natural Language Processing (NLP) and machine learning for text analysis tasks.

Includes the following steps:

Data Loading: Reading tweets from a CSV file.
Text preprocessing: ** Tokenization ** Stop word removal ** Lemmatization
Tone analysis using TextBlob: Each tweet is assigned a tone score, based on which a label ('positive', 'negative', 'neutral') is determined.
Preparing data for machine learning:
- Converting text labels into numerical labels.
- Text vectorization using TF-IDF.
Model training:
- Splitting the data into training and test samples (this code uses a specific approach, see Note).
- Training the `RandomForestClassifier'.
- Selection of hyperparameters using GridSearchCV.
Model Evaluation: Evaluating the quality of the model using F1 measure.

Requirements

The following libraries are required to run the project:

Python 3.x
The main dependencies are listed in the requirements.txt file.

Additionally, the script loads NLTK resources: 'punkt', 'stopwords', 'wordnet'.

Installation

Clone the repository:

git clone https://github.com/Jim-by/tweet-sentiment-analysis.git
cd tweet-sentiment-analysis

(Recommended) Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate # For Linux/Mac
# venv\Scripts\activate # For Windows
```

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Make sure the submission.csv file is in the data/ folder and contains a selected_text column with the texts of the tweets.
Run the script: ``bash python src/sentiment_analysis.py

Expected output:

The console will display progress messages, best model parameters and F1 measure.
A tweets_sentiments.csv file will be created in the data/ folder containing the original tweets, TextBlob analysis results, and preprocessed text.

Note on model estimation

In this script, tone labels are generated programmatically using TextBlob on the entire dataset. A machine learning model is then trained to predict these labels. The evaluation is performed on the same dataset on which the labels were generated.

This demonstrates the model's ability to approximate TextBlob logic based on TF-IDF features. Evaluation on fully independent data would require “true” tone labels assigned by a human or other reliable source.

Possible Improvements

Use of more advanced Word Embeddings techniques such as Word2Vec, GloVe, or BERT.
Utilizing other classification models.
More thorough cleaning and preparation of text data.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src		src
stopwords/tokenizers		stopwords/tokenizers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
git		git
requirements.txt		requirements.txt
tweets_sentiments.csv		tweets_sentiments.csv
tweets_sentiments_improved.csv		tweets_sentiments_improved.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyzing the tone of tweets using TextBlob and RandomForest

Description

Requirements

Installation

Usage

Note on model estimation

Possible Improvements

About

Releases

Packages

Languages

License

Jim-by/tweet-sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

Analyzing the tone of tweets using TextBlob and RandomForest

Description

Requirements

Installation

Usage

Note on model estimation

Possible Improvements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages