Repairs chunk generation for unknown labels, solves issue #260 #263

RobinVogel · 2019-11-14T16:37:32Z

The discussions in pull request #254 were about the generations of chunks in constraints.py not providing a map of the indice of the element to the chunk in the presence of unknown labels.
See the example below:

I'll just add an example to make it clearer. Below is a comparison in the presence of unknown labels between the output that I expected from Constraints.chunks, i.e. expected_chunk and the one that I obtain from Constraints.chunks, i.e. chunks:

In [1]: from metric_learn.constraints import Constraints                                                                                                                                                        

In [2]: partial_labels = [1, 2, 2, 1, -1, 3, 3]                                                        

In [3]: cons = Constraints(partial_labels)                                                             

In [4]: cons.chunks(num_chunks=3, chunk_size=2)                                                        
Out[4]: array([0, 1, 1, 0, 2, 2])

In [5]: chunks = cons.chunks(num_chunks=3, chunk_size=2)                                               

In [6]: len(chunks), len(partial_labels)                                                               
Out[6]: (6, 7)

In [7]: expected_chunk = [0, 1, 1, 0, -1, 2, 2]

The output is not a map such that expected_chunk[i] is the chunk of element i, which I expected, but might not matter in RCA.

This pull request solves that, which resolves #260 about RCA crashing when used with unknown labels. Since unknown labels seem to be handled in rca.py with a variable named chunk_mask I expect rca.py to work with unknown labels now.

I had to save the partial labels in constraints.py because I need to locate the locations of the unknown labels (label < 0) in partial_labels. I also corrected the remains of unittest in test_constraints.py.

bellet

Thanks!

Some tests would be nice, in particular one which involves running RCA on a dataset which contains unlabeled points (which would crash before this PR). I think a natural candidate is to test whether RCA returns the same thing when fitted on a dataset with unlabeled point and the same dataset without the unlabeled points.

A second test to check that unlabeled points are not assigned to any chunk would be good too.

metric_learn/constraints.py

RobinVogel · 2019-11-28T13:10:03Z

I wrote, the one on RCA in tests/metric_learn_test.py in the class TestRCA, and
the one on the constraints in tests/constraints.py. I am interested in understanding why you chose these tests.

The tests of code coverage did not pass because of the fact that the method adjacency_matrix in Constraints of constraints.py is not tested, and I added a line in the function, which yields a negative coverage difference.

After a grep on all the .py, I noticed that the function has not been used anywhere in metric-learn.
I don't see why it exists, to correct the pull request I can either:

Delete it all together,
Write a test for that function.

For the moment, I opted for deleting it.

bellet

Thanks. There is a problem in one of the tests. Otherwise LGTM. Indeed adjacency_matrix is not used anywhere anymore since we rewrote the API. I do not think we will ever need it again so it is safe to delete it.

test/metric_learn_test.py

bellet · 2019-12-05T09:56:44Z

I wrote, the one on RCA in tests/metric_learn_test.py in the class TestRCA, and
the one on the constraints in tests/constraints.py. I am interested in understanding why you chose these tests.

The test on RCA is a natural one given the change made in this PR: it checks that unlabeled points are ignored. Note also that the test would fail in the previous version (since RCA would crash in the presence of unlabeled points). Generally when fixing a bug one should always write a test that would fail before to make sure the bug is not introduced again later without notice.

The test to check that no unlabeled points are assigned to any chunks is related to the fact that the change in this PR keeps entries corresponding to unlabeled points in the chunk array. Therefore it makes sense to check that there is no indexing issue, i.e. no such point is actually assigned to a chunk.

RobinVogel · 2019-12-10T16:36:35Z

I spent some thinking to understand why I always have a covariance matrix that is composed of NaN's in the context of my tests. This is caused by two points stacked up:

The default parameters for scikit-learn's make_classification involve a n_redundant=2 parameter, which means that one of the features is a linear combination of the others. So in that case, the algorithm will fit the data, but throw a useless NaN matrix. Using nothing but informative parameters solves this problem, see the documentation for more details.
The default value of chunk_size in RCA_supervised is 2, and at some point, one substract the average to the points in each chunks, e.g. if I have x1 and x_2 I replace them with (x1-x2)/2 and (x2-x1)/2. It means that with chunks of size 2, the maximum rank of my covariance matrix is num_chunks, which might be below the number of features when doing small scale tests.

I think both points should be addressed with proper errors. Should I raise an issue, or try and address those in this PR ?

bellet · 2019-12-10T17:44:55Z

I spent some thinking to understand why I always have a covariance matrix that is composed of NaN's in the context of my tests. This is caused by two points stacked up:

* The default parameters for scikit-learn's `make_classification` involve a `n_redundant=2` parameter, which means that one of the features is a linear combination of the others. So in that case, the algorithm will fit the data, but throw a useless NaN matrix. Using nothing but informative parameters solves this problem, see [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) for more details.

* The default value of `chunk_size` in `RCA_supervised` is 2, and at some point, one substract the average to the points in each chunks, e.g. if I have `x1` and `x_2` I replace them with `(x1-x2)/2` and `(x2-x1)/2`. It means that with chunks of size 2, the maximum rank of my covariance matrix is `num_chunks`, which might be below the number of features when doing small scale tests.

I think both points should be addressed with proper errors. Should I raise an issue, or try and address those in this PR ?

Good observations.

Note that we already have a warning which is raised in this case, advising the user to first reduce the dimension of the data:

UserWarning: The inner covariance matrix is not invertible, so the transformation matrix may contain Nan values. You should reduce the dimensionality of your input,for instance using `sklearn.decomposition.PCA` as a preprocessing step.

Regarding how to improve the errors:

IMO we should not check whether some features are linear combinations of others, this is too costly (and can be fixed by reducing the dimension). This is the responsability of the user.
It seems that what you describe for chunk_size=2 generalizes to larger chunk sizes, in the sense that we can have a general bound on the rank of the matrix as a function of chunk_size and num_chunks, i.e. when this is the case we know for sure it will fail based only on input parameters. If this is indeed the case then I agree we should throw a specific error and explain the bound to the user so she can adjust the parameters.

bellet · 2019-12-10T17:51:10Z

For the first point, we could simply adapt the current error:

UserWarning: The inner covariance matrix is not invertible, so the transformation matrix may contain Nan values. You should remove any linearly dependent features and/or reduce the dimensionality of your input, for instance using `sklearn.decomposition.PCA` as a preprocessing step.

…ic-learn into issue_260

RobinVogel · 2019-12-23T11:23:42Z

I did all the expected modifications and added the remark with the bound on the rank.

bellet

LGTM, thanks!

RobinVogel added 3 commits November 13, 2019 22:41

chunks return a map of index to chunk

05deb8c

maj

7a6f302

maj

d08c2d8

bellet requested changes Nov 21, 2019

View reviewed changes

metric_learn/constraints.py Outdated Show resolved Hide resolved

remove storing of known labels

6c4ba21

bellet reviewed Nov 21, 2019

View reviewed changes

metric_learn/constraints.py Outdated Show resolved Hide resolved

RobinVogel added 2 commits November 21, 2019 16:37

typo

2b1308d

no self.num_points

8003c0d

RobinVogel added 2 commits November 28, 2019 16:06

tests for unlabeled, repairs chunk generation

57b559c

maj

72a9ee5

RobinVogel requested a review from bellet November 29, 2019 10:38

bellet requested changes Dec 5, 2019

View reviewed changes

test/metric_learn_test.py Outdated Show resolved Hide resolved

testing diff features

0c98432

corrected test

02168eb

RobinVogel added 3 commits December 10, 2019 19:12

diff warning

03ad650

maj

b93c115

Merge branch 'master' of https://github.com/scikit-learn-contrib/metr…

5742ba6

…ic-learn into issue_260

RobinVogel requested a review from bellet December 23, 2019 09:03

added parameter bound test

5289e66

bellet approved these changes Jan 3, 2020

View reviewed changes

bellet merged commit 0370198 into scikit-learn-contrib:master Jan 3, 2020

RobinVogel deleted the issue_260 branch January 13, 2020 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repairs chunk generation for unknown labels, solves issue #260 #263

Repairs chunk generation for unknown labels, solves issue #260 #263

Uh oh!

RobinVogel commented Nov 14, 2019 •

edited by bellet

Loading

Uh oh!

bellet left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

RobinVogel commented Nov 28, 2019 •

edited

Loading

Uh oh!

bellet left a comment

Uh oh!

Uh oh!

bellet commented Dec 5, 2019 •

edited

Loading

Uh oh!

RobinVogel commented Dec 10, 2019 •

edited

Loading

Uh oh!

bellet commented Dec 10, 2019 •

edited

Loading

Uh oh!

bellet commented Dec 10, 2019

Uh oh!

RobinVogel commented Dec 23, 2019 •

edited

Loading

Uh oh!

bellet left a comment

Uh oh!

Uh oh!

Repairs chunk generation for unknown labels, solves issue #260 #263

Repairs chunk generation for unknown labels, solves issue #260 #263

Uh oh!

Conversation

RobinVogel commented Nov 14, 2019 • edited by bellet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bellet left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RobinVogel commented Nov 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bellet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bellet commented Dec 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobinVogel commented Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bellet commented Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bellet commented Dec 10, 2019

Uh oh!

RobinVogel commented Dec 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bellet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RobinVogel commented Nov 14, 2019 •

edited by bellet

Loading

bellet left a comment •

edited

Loading

RobinVogel commented Nov 28, 2019 •

edited

Loading

bellet commented Dec 5, 2019 •

edited

Loading

RobinVogel commented Dec 10, 2019 •

edited

Loading

bellet commented Dec 10, 2019 •

edited

Loading

RobinVogel commented Dec 23, 2019 •

edited

Loading