Proj01
Proj01
Proj01
NOTE:
Instructor may ask the project group or individual members questions about the actual work and
contribution after submission. The questions may be very specific (E.g., the lines of code written
by a member and the explanation of the codes.) It may be done via email or in-person after the
class without advanced notice. Failing to answer the questions satisfactorily will result in mark
deduction. Please ensure all group members have roughly even contribution in the project,
especially the coding part, and have full understanding on the whole project.
Grouping:
You need to form groups of 4 to complete the project in this course. Sign your project group list on the
Wiki page in Blackboard.
One of the members in your group, the group leader/captain, is responsible for submitting the project.
Marks will be deducted if more than one member of your group submit the same/different project.
Project Description:
In this project you will be working on an unclean dataset that contains information of statements about
climate. Each statement may or may not be relevant to a climate action. You task is to predict the
relevancy.
Proj01.ipynb (or a zip file if you have any extra files): In this Jupyter notebook, you have to utilize different
cells (code/markdown) to clearly indicate and explain every step. Your Jupyter notebook should include
all the markdown texts signifying the steps with correct heading, python code, comments/analysis, and
visualizations as stated in the following instruction. Note: You need to create the appropriate markdown
headings for each section mentioned below. Codes should have some short comment describing the
statement. Adding a markdown cell containing text before specific actions performed is appreciated.
(Note: Restart the kernel and re-run all cells of the notebook before submission. Substantial marks will be
deducted for cell errors.)
1
Fall 2024
Files Required:
climate_train.csv: This file contains the training set. It has three fields: Row Number, Text and
Label. Row Number is the unique incremental ID; Text is the content of the statement; Label is
the relevance to climate action (1 is yes, 0 is no).
climate_test.csv: This is the text set. It has the same format as the training set. This label
information of the test set must not be used in the training of the model, ZERO marks will be
given otherwise.
Part A. Planning
2. Planning
Examine the dataset carefully. Create another markdown cell. In this cell, describe your plan about
how to carry out the NLP pipelines to create classification models. Use appropriate markdown if
necessary to have a better formatting and illustrations.
In the following parts, you need to write python codes, with appropriate comments or markdown cells
to explain your work. Lacking explanations will result in mark deduction.
3. Feature generation
i. In this step, use TF-IDF to represent each data in both training set and test set
4. Model building
i. Use Naïve Bayes (Multinominal) Model to train a model.
2
Fall 2024
6. Post analysis
i. Research on how to identify the most impactful features (tokens) to classify climate action
statement.
1. Classifier comparison
i. Using the same features you found in Step 3 of Part A, training different classifiers: Logistic
Regression, Linear SVM, K-Nearest-Neighbour (You need to research on how to do this).
ii. Evaluate these models in test set
iii. Compare the performance and make a conclusions
Part D. Competition
In this part, your group competes with other groups in performance. In the same notebook, you need to
write python codes after Part B, with appropriate comments or markdown cells to explain your work.
Lacking explanations will result in mark deduction.
1. Dataset
i. You must only use the climate_train.csv in the model training.
ii. Evaluation must be done on climate_test.csv.
2. Evaluation
i. Micro F1-measure will be used as the evaluation metric to determine which group is the
winner.
3. Method
i. You are free to use any data preprocessing, feature generation, classifier in this part.
ii. Only one final pipeline should be included in the submission. For example, your group may try
different methods in another Jupyter notebook or python program. However, your group
should only report the best one.
3
Fall 2024
4. Restriction
i. The execution time of this part needs to be within 1 minute. Your group will be disqualified if
your codes cannot finish within 1 minute.
Member Contribution
In addition to the proposal, each group needs to submit a peer evaluation matrix. Each cell should be
a number between 1 and 4, which reflects how a member thinks the contribution by another member.
The evaluation is opened to open to all members of your group (i.e., Every one can see how others
grade on you), so that each member knows how to enhance their contribution in the project.
(Hint: You may refer to this link to see how to create a table in a Jupyter Markdown cell.)
Member 1
Member 2
Member 3
Member 4
Criteria Grading
Project submitted, named properly with all files included in their corresponding folders to 1
Blackboard.
Part Detail
A Planning for the analysis 2
B.2 Data exploration, pre-processing and cleansing 1
B.3 Feature generation 1
4
Fall 2024