# ELG 5255 Applied Machine Learning Fall 2020 # Assignment 3 (Multivariate Method)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

# ELG 5255 Applied Machine Learning

Fall 2020

# Assignment 3 (Multivariate Method)


Start Date:

Due Date: Eastern Time (US and Canada)

## Submission
You must submit your assignment on-line with Bright Space. This is the only method by which
we accept assignment submissions. We do not accept assignments sent via email, and we are not
able to enter a mark if the assignment is not submitted on Bright Space! The deadline date is
firm since you cannot submit a assignment passed the deadline. It is a student's
responsibility to ensure that the assignment has been submitted properly. A mark of 0 will be
assigned to any missing assignment.

Assignment must be done individually. Any team work, and any work copied from a source
external to the student will be considered as an academic fraud and will be reported to the
Faculty of Engineering as a breach of integrity. The consequence of academic fraud is, at the
very least, to obtain an F for this course. Note that we use sophisticated software to
compare assignments (with other student's and with other sources...). Therefore, you must
take all the appropriate measures to make sure that others cannot copy your assignment
(hence, do not leave your workstation unattended).

## Goal
This time we will build a multiclass multivariant Gaussian Discriminant Analysis (GDA) Model
and to classify Pokémon's type according to all Pokémon's attributes.

## Dataset
During this assignment, we keep using the Pokémon dataset. By invoking
loadSingleFileDataset(), you can load the dataset. Here is an example:

``` python
1 trX, teX, trY, teY = loadSingleFileDataset('Data/Pokemon.csv', 'type1', None,
True, randomState, 0.3)
```

## Gaussian Discriminant Analysis (GDA)

### Model

GDA is a generative model, assuming instances are sampled from Gaussian distribution, so that
, where is the class, e.g. normal, bug, ..., water.
### Maximum Likelihood

We have derived the formulas many times in lecture notes and assignments, so this time we
only show the conclusion.

is one of the classes, like normal, bug, ..., water

is the number of samples in

is the total number of samples

is the mean of 's corresponding class

Let's assume models share the same covariance.

Therefore,

### Classification

In the following formulas, all are column vectors, and so as all . Nonetheless, in
practice, or are usually row vectors. Therefore, we prefer form.

Moreover, since both and are symmetric positive semi-definite matrices, . As for
numpy, if is a 1-D vector, . We will talk about numpy usage in the next sections.

is the number of classes

Let

Since is scalar,

In this assignment, we will not only determine label for each x, but calculate possibility of
x in each class. Let's see the first task.

, where won't influence the result.

Let , where , and

Now, it's time to calculate the possibility of x in each class.


Let

Therefore,

As a conclusion:

, where ,

, where

## Instructions

### File Structure

1. Data

This folder stores our dataset and results. When you start the assignment, it should
contain Pokemon.csv. You must not change the content within those files, otherwise, it
will influence your final grade. After you implement all functions and methods, a
folder named GaussianDiscriminantAnalysis_RS=0 should occur, where it contains all the
result including a confusion matrix, a precision-recall curve, a report showing a
comprehensive comparison metric scores, a CSV file storing predicted labels and true
labels, and a ROC curve.

2. Main.py

It is where you should implement our assignment. Please note that any function marked
as !!! Must Not Change the Content !!! must not be changed, otherwise, it will
influence your final grade. All functions or methods marked by ToDo: Implement This
Function should be well implemented.

3. Readme

Assignment Instructions.

4. requirements.txt

Required Python packets to finish the assignment. Please run the following code to
install these packets. Please note that numpy v1.19.4 has bugs on Windows system
currently; therefore numpy v1.19.3 will be used in this assignment.

``` bash
1 pip install -r requirements.txt
```

5. TicToc.py

Used for timing functions. No need to change or check.

6. Utils.py

It contains functions like inversing matrix, loading dataset, and generating results.

7. NumpyDotExamples.py

A demo shows how to use numpy.dot


### Main Function (main() in Main.py)

This function will perform:

1. Load the Pokémon Dataset


2. Split the dataset to training set and testing set
3. Train the classifier
4. Test on the testing set
5. Save the results to Data/GaussianDiscriminantAnalysis_RS=0

### Numpy Tricks

Numpy.dot will be used in this assignment several times. Hence let's talk about it a little
bit. Although in the formula, we strictly distinguish row and column vectors, in numpy,
vectors are normally represented as 1-D vectors which has shapes like (n,). For such 1-D
vectors, it's transpose is itself ( ). In other words, 1-D vector doesn't
distinguish row or column vectors.

When we invoke numpy.dot, if a 1-D vector shows up as the first argument, it will be treated
like a row vector, no matter whether it has been transposed or not. And if it shows up as the
second argument, it will be treated as a column vector. Therefore,

If we want to have a column vector dot multiplies a row vector to have a matrix (which will
be used when we calculate covariance matrix), then the column vector should be explicitly
converted into 2-D matrix which has a shape as (n, 1). Numpy.reshape should be explicitly
invoked to convert a 1-D vector to a 2-D column vector.

In this assignment, y is not encoded, meaning that they are a list of strings. If we let
numpy take a string as an item in an array, it will clip its length because it doesn't infer
the data type correctly. Therefore, we should use numpy.asarray to convert a string to array
item and explicitly specify the data type as object. This trick will be used in predictSample
function.

### What We Need to Do and Implement

Must not change the shape (format of parameters and return values) of the functions that you
need to implement.

Must not change the imports. No additional python packets should be used.

You will see estimated lines, e.g., ≈1~2 lines. It is only an estimation and not to say you
have to finish coding within these lines

1. Install Required Packets

``` bash
1 pip install -r requirements.txt
```

2. Open Main.py

3. Implement the calculation of self.meanDict and self.priorDict within


GaussianDiscriminantAnalysis.fit function (20 marks)

Calculate mean vector and prior probability of each class, and store them in
self.meanDict and self.priorDict respectively.
self.meanDict should be like {'bug': array([8.03030303e-01, ...,]), ..., 'water':
array([1.02459016e+00, ...,])}; and self.priorDict should be like {'bug':
0.08991825613079019, ..., 'water': 0.16621253405994552}

4. Implement the calculation of self.cov and self.covInv within


GaussianDiscriminantAnalysis.fit function (20 marks)

Calculate covariance and inverse of covariance which should be symmetric positive semi-
definite matrix. inv() function has been provided in Utils.py and imported.

5. Implement the calculation of self.wDict and self.bDict within


GaussianDiscriminantAnalysis.fit function (20 marks)

Calculate w and b of each class, and store them in self.wDict and self.bDict
separately.

self.wDict should be like {'bug': array([ 1.63711141e+00, ...,]), ..., 'water':


array([-2.19318664e+00, ...,])}; and self.bDict should be like {'bug':
-542.513594693476, ..., 'water': -485.7837037892891}

6. Implement GaussianDiscriminantAnalysis.predictSample (20 marks)

Predict a x's label. Hint: Without explicitly indicating the type of return value,
numpy will clip the string. Therefore, it's better to use np.asarray, and the type of y
is stored in self.yType

``` python
1 def predictSample(self, x):
```

7. Implement GaussianDiscriminantAnalysis.predictSampleProba (20 marks)

Predict the probability of x in each class. This function should take an x as input and
output an array of probabilities of each class. The sum of them should be 1, and the
order of them should be same as self.labels

Its return value should be like array([5.83179717e-57, 2.72778464e-38, ...,1.32143347e-


39])

``` python
1 def predictSampleProba(self, x):
```

8. Run Main.py

9. Open and check results files


### Expected Results

After you implement and run the code, you should have the following results in
Data/GaussianDiscriminantAnalysis_RS=0

1. Confusion Matrix (GaussianDiscriminantAnalysis_RS=0_ConfusionMatrix.png):

2. Precision-Recall Curve (GaussianDiscriminantAnalysis_RS=0_PrecisionRecallCurve.png):

3. Result Report (GaussianDiscriminantAnalysis_RS=0_Report.xlsx):

 
  Precision Recall F1 Specificity GMean IbaGMean Support

bug 0.95 0.95 0.95 0.992754 0.971142 0.939084 20

dark 1 0.5 0.666667 1 0.707107 0.475 4

dragon 1 1 1 1 1 1 3

electric 0.8 1 0.888889 0.993506 0.996748 0.994152 4

fairy 1 1 1 1 1 1 2

fighting 0.833333 0.833333 0.833333 0.993421 0.909863 0.814598 6

fire 0.9 1 0.947368 0.993289 0.996639 0.993955 9

ghost 0.666667 1 0.8 0.987013 0.993485 0.988295 4

grass 1 0.857143 0.923077 1 0.92582 0.844898 21

ground 0 0 0 0.987342 0 0 0

ice 0 0 0 0.987097 0 0 3

normal 1 1 1 1 1 1 17

poison 1 0.857143 0.923077 1 0.92582 0.844898 7

psychic 0.9 1 0.947368 0.993289 0.996639 0.993955 9

rock 1 0.6 0.75 1 0.774597 0.576 15

steel 0.428571 1 0.6 0.974194 0.987012 0.976708 3

water 0.878788 0.935484 0.90625 0.968504 0.951851 0.903028 31

Weighted Avg 0.908828 0.879747 0.882688 0.990661 0.921653 0.864629 158

Macro Avg 0.785727 0.796065 0.772708 0.992377 0.831572 0.784975 158

Accuracy 0.879747            

4. Results CSV (GaussianDiscriminantAnalysis_RS=0_Result.csv):

``` csv
1 TrueY,PredY
2 normal,normal
3 normal,normal
4 bug,bug
5 water,water
```

5. ROC Curve (GaussianDiscriminantAnalysis_RS=0_RocCurve.png):


### Marking Criterion

Since the random state is fixed you should see the same or at least similar results as the
expected results.

Only if you implement the whole function or method, you can get marks; otherwise, you will
get 0 on the method.

Implementing methods can receive marks, while only submitting results files will not receive
any marks.

1. Implement the calculation of self.meanDict and self.priorDict within


GaussianDiscriminantAnalysis.fit function (20 marks)
2. Implement the calculation of self.cov and self.covInv within
GaussianDiscriminantAnalysis.fit function (20 marks)
3. Implement the calculation of self.wDict and self.bDict within
GaussianDiscriminantAnalysis.fit function (20 marks)
4. Implement GaussianDiscriminantAnalysis.predictSample (20 marks)
5. Implement GaussianDiscriminantAnalysis.predictSampleProba (20 marks)

You might also like