CT107-3-3-TXSA - Group Assignment
CT107-3-3-TXSA - Group Assignment
CT107-3-3-TXSA - Group Assignment
In this part of the assignment, you are required to work on the implementation of text analytics
techniques and methodologies using Python and SAS Text Miner.
Group: Form a group comprises of Three or Four members to work on this assessment
component.
Use the Text Corpus.txt given in the Group Assignment Data.zip to answer Q1 – Q3 of this
assessment component.
The file Text Corpus.txt is a text corpus which contains three sentences. Each sentence is
limited by the sentence pads such as <s> and </s> as the starting and end of the sentence
respectively. You are expected to use the appropriate equations to perform the respective tasks.
The unknown words should be treated as UNK.
2) Compute manually the sentences probabilities using the bigram model. (5 marks)
3) Justify which language model is more suitable to calculate the sentence probabilities. (5
marks)
4) Implement and report the respective sentence probabilities in python using both unigram
and bigram language models. (10 marks)
1. Predict the sentiments (consider 3 levels) for each review found in the data set using the
Rule-Based Unsupervised Technique (a library named VADER from NLTK). Implement
using suitable python codes and report the portion of the results. (3 marks)
2. Export the resulting data set along with the predicted sentiments to a .csv format using
python and report the relevant code used for this operation. (Note: The resulting data set
must be submitted via the given submission link in MOODLE). (2 marks)
3. Use the data set exported in the Q4.2 to build a supervised sentiment classification model
using Naïve Bayes Classifier from NLTK and report the following model performance
measures:
a. Accuracy
b. Precision
c. Recall
d. F1 Score
The python codes must be neat with clear output. Provide relevant comments in the code
to explain the purpose of the code snippet.
Deliverable:
4. Use the data set exported in the Q4.2 to build a supervised sentiment classification model
using SAS Text Miner and report the following model performance measures:
a. Accuracy
b. Precision
c. Recall
d. F1 Score
The process flow diagram must be neat and provided in the report. Provide relevant and
necessary explanations with the suitable output (screenshots) to support the answers.
Deliverable:
Deliverables:
Report:
Word count: 2000 words
The report must be prepared in a professional manner following the proper documentation
formats.
Codes: Python code should run without any arguments. It should read files in the same directory.
The output must be as specified in respective questions. Suitable comments must be inserted in
proper places in the code.
Softcopy: The relevant softcopies such as the report (.doc or .docx or .pdf) and the python code
files (.py or. ipynb) must be uploaded via the specified submission links available in the
MOODLE.
Academic Integrity
Copying or paraphrasing someone's work (code included), or permitting your own work to be
copied or paraphrased, even if only in part, is not allowed, and will result in disciplinary action.
Your grade should reflect your own work.
Basically, 'plagiarism' means representing someone else's work as if it is your own. This is a very
serious academic offence for all students within the University regulations and is particularly
reprehensible for a researcher. Please do not even consider it. Remember that accidental
plagiarism (or the appearance of it) may be avoided by referencing your work properly. This
gains you credit, not loses it! The simple rule is that you must not represent the ideas of other
people (whether they are published works or the work of other students) as your own.
Carefully read APU guidelines and policies on the Webspace. The golden rule on plagiarism is
DO NOT DO IT.