# ELG 5255 Applied Machine Learning Fall 2020 # Assignment 3 (Multivariate Method)
# ELG 5255 Applied Machine Learning Fall 2020 # Assignment 3 (Multivariate Method)
# ELG 5255 Applied Machine Learning Fall 2020 # Assignment 3 (Multivariate Method)
Fall 2020
## Submission
You must submit your assignment on-line with Bright Space. This is the only method by which
we accept assignment submissions. We do not accept assignments sent via email, and we are not
able to enter a mark if the assignment is not submitted on Bright Space! The deadline date is
firm since you cannot submit a assignment passed the deadline. It is a student's
responsibility to ensure that the assignment has been submitted properly. A mark of 0 will be
assigned to any missing assignment.
Assignment must be done individually. Any team work, and any work copied from a source
external to the student will be considered as an academic fraud and will be reported to the
Faculty of Engineering as a breach of integrity. The consequence of academic fraud is, at the
very least, to obtain an F for this course. Note that we use sophisticated software to
compare assignments (with other student's and with other sources...). Therefore, you must
take all the appropriate measures to make sure that others cannot copy your assignment
(hence, do not leave your workstation unattended).
## Goal
This time we will build a multiclass multivariant Gaussian Discriminant Analysis (GDA) Model
and to classify Pokémon's type according to all Pokémon's attributes.
## Dataset
During this assignment, we keep using the Pokémon dataset. By invoking
loadSingleFileDataset(), you can load the dataset. Here is an example:
``` python
1 trX, teX, trY, teY = loadSingleFileDataset('Data/Pokemon.csv', 'type1', None,
True, randomState, 0.3)
```
### Model
GDA is a generative model, assuming instances are sampled from Gaussian distribution, so that
, where is the class, e.g. normal, bug, ..., water.
### Maximum Likelihood
We have derived the formulas many times in lecture notes and assignments, so this time we
only show the conclusion.
Therefore,
### Classification
In the following formulas, all are column vectors, and so as all . Nonetheless, in
practice, or are usually row vectors. Therefore, we prefer form.
Moreover, since both and are symmetric positive semi-definite matrices, . As for
numpy, if is a 1-D vector, . We will talk about numpy usage in the next sections.
Let
Since is scalar,
In this assignment, we will not only determine label for each x, but calculate possibility of
x in each class. Let's see the first task.
Therefore,
As a conclusion:
, where ,
, where
## Instructions
1. Data
This folder stores our dataset and results. When you start the assignment, it should
contain Pokemon.csv. You must not change the content within those files, otherwise, it
will influence your final grade. After you implement all functions and methods, a
folder named GaussianDiscriminantAnalysis_RS=0 should occur, where it contains all the
result including a confusion matrix, a precision-recall curve, a report showing a
comprehensive comparison metric scores, a CSV file storing predicted labels and true
labels, and a ROC curve.
2. Main.py
It is where you should implement our assignment. Please note that any function marked
as !!! Must Not Change the Content !!! must not be changed, otherwise, it will
influence your final grade. All functions or methods marked by ToDo: Implement This
Function should be well implemented.
3. Readme
Assignment Instructions.
4. requirements.txt
Required Python packets to finish the assignment. Please run the following code to
install these packets. Please note that numpy v1.19.4 has bugs on Windows system
currently; therefore numpy v1.19.3 will be used in this assignment.
``` bash
1 pip install -r requirements.txt
```
5. TicToc.py
6. Utils.py
It contains functions like inversing matrix, loading dataset, and generating results.
7. NumpyDotExamples.py
Numpy.dot will be used in this assignment several times. Hence let's talk about it a little
bit. Although in the formula, we strictly distinguish row and column vectors, in numpy,
vectors are normally represented as 1-D vectors which has shapes like (n,). For such 1-D
vectors, it's transpose is itself ( ). In other words, 1-D vector doesn't
distinguish row or column vectors.
When we invoke numpy.dot, if a 1-D vector shows up as the first argument, it will be treated
like a row vector, no matter whether it has been transposed or not. And if it shows up as the
second argument, it will be treated as a column vector. Therefore,
If we want to have a column vector dot multiplies a row vector to have a matrix (which will
be used when we calculate covariance matrix), then the column vector should be explicitly
converted into 2-D matrix which has a shape as (n, 1). Numpy.reshape should be explicitly
invoked to convert a 1-D vector to a 2-D column vector.
In this assignment, y is not encoded, meaning that they are a list of strings. If we let
numpy take a string as an item in an array, it will clip its length because it doesn't infer
the data type correctly. Therefore, we should use numpy.asarray to convert a string to array
item and explicitly specify the data type as object. This trick will be used in predictSample
function.
Must not change the shape (format of parameters and return values) of the functions that you
need to implement.
Must not change the imports. No additional python packets should be used.
You will see estimated lines, e.g., ≈1~2 lines. It is only an estimation and not to say you
have to finish coding within these lines
``` bash
1 pip install -r requirements.txt
```
2. Open Main.py
Calculate mean vector and prior probability of each class, and store them in
self.meanDict and self.priorDict respectively.
self.meanDict should be like {'bug': array([8.03030303e-01, ...,]), ..., 'water':
array([1.02459016e+00, ...,])}; and self.priorDict should be like {'bug':
0.08991825613079019, ..., 'water': 0.16621253405994552}
Calculate covariance and inverse of covariance which should be symmetric positive semi-
definite matrix. inv() function has been provided in Utils.py and imported.
Calculate w and b of each class, and store them in self.wDict and self.bDict
separately.
Predict a x's label. Hint: Without explicitly indicating the type of return value,
numpy will clip the string. Therefore, it's better to use np.asarray, and the type of y
is stored in self.yType
``` python
1 def predictSample(self, x):
```
Predict the probability of x in each class. This function should take an x as input and
output an array of probabilities of each class. The sum of them should be 1, and the
order of them should be same as self.labels
``` python
1 def predictSampleProba(self, x):
```
8. Run Main.py
After you implement and run the code, you should have the following results in
Data/GaussianDiscriminantAnalysis_RS=0
Precision Recall F1 Specificity GMean IbaGMean Support
dragon 1 1 1 1 1 1 3
fairy 1 1 1 1 1 1 2
ground 0 0 0 0.987342 0 0 0
ice 0 0 0 0.987097 0 0 3
normal 1 1 1 1 1 1 17
Accuracy 0.879747
``` csv
1 TrueY,PredY
2 normal,normal
3 normal,normal
4 bug,bug
5 water,water
```
Since the random state is fixed you should see the same or at least similar results as the
expected results.
Only if you implement the whole function or method, you can get marks; otherwise, you will
get 0 on the method.
Implementing methods can receive marks, while only submitting results files will not receive
any marks.