This project demonstrates the process of analyzing the tone of tweets. It involves pre-processing text, marking up the tonality using the TextBlob library, and training a RandomForestClassifier model to classify the tonality.
The main goal of the project is to demonstrate skills in Natural Language Processing (NLP) and machine learning for text analysis tasks.
Includes the following steps:
- Data Loading: Reading tweets from a CSV file.
- Text preprocessing: ** Tokenization ** Stop word removal ** Lemmatization
- Tone analysis using TextBlob: Each tweet is assigned a tone score, based on which a label ('positive', 'negative', 'neutral') is determined.
- Preparing data for machine learning:
- Converting text labels into numerical labels.
- Text vectorization using TF-IDF.
- Model training:
- Splitting the data into training and test samples (this code uses a specific approach, see Note).
- Training the `RandomForestClassifier'.
- Selection of hyperparameters using
GridSearchCV
.
- Model Evaluation: Evaluating the quality of the model using F1 measure.
The following libraries are required to run the project:
- Python 3.x
- The main dependencies are listed in the
requirements.txt
file.
Additionally, the script loads NLTK resources: 'punkt', 'stopwords', 'wordnet'.
-
Clone the repository:
git clone https://github.com/Jim-by/tweet-sentiment-analysis.git cd tweet-sentiment-analysis
-
(Recommended) Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # For Linux/Mac # venv\Scripts\activate # For Windows ```
-
Install dependencies:
pip install -r requirements.txt
- Make sure the
submission.csv
file is in thedata/
folder and contains aselected_text
column with the texts of the tweets. - Run the script:
``bash
python src/sentiment_analysis.py
Expected output:
- The console will display progress messages, best model parameters and F1 measure.
- A
tweets_sentiments.csv
file will be created in thedata/
folder containing the original tweets, TextBlob analysis results, and preprocessed text.
In this script, tone labels are generated programmatically using TextBlob on the entire dataset. A machine learning model is then trained to predict these labels. The evaluation is performed on the same dataset on which the labels were generated.
This demonstrates the model's ability to approximate TextBlob logic based on TF-IDF features. Evaluation on fully independent data would require “true” tone labels assigned by a human or other reliable source.
- Use of more advanced Word Embeddings techniques such as Word2Vec, GloVe, or BERT.
- Utilizing other classification models.
- More thorough cleaning and preparation of text data.