Skip to content

Repairs chunk generation for unknown labels, solves issue #260 #263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jan 3, 2020
Merged

Repairs chunk generation for unknown labels, solves issue #260 #263

merged 14 commits into from
Jan 3, 2020

Conversation

RobinVogel
Copy link
Contributor

@RobinVogel RobinVogel commented Nov 14, 2019

The discussions in pull request #254 were about the generations of chunks in constraints.py not providing a map of the indice of the element to the chunk in the presence of unknown labels.
See the example below:

I'll just add an example to make it clearer. Below is a comparison in the presence of unknown labels between the output that I expected from Constraints.chunks, i.e. expected_chunk and the one that I obtain from Constraints.chunks, i.e. chunks:

In [1]: from metric_learn.constraints import Constraints                                                                                                                                                        

In [2]: partial_labels = [1, 2, 2, 1, -1, 3, 3]                                                        

In [3]: cons = Constraints(partial_labels)                                                             

In [4]: cons.chunks(num_chunks=3, chunk_size=2)                                                        
Out[4]: array([0, 1, 1, 0, 2, 2])

In [5]: chunks = cons.chunks(num_chunks=3, chunk_size=2)                                               

In [6]: len(chunks), len(partial_labels)                                                               
Out[6]: (6, 7)

In [7]: expected_chunk = [0, 1, 1, 0, -1, 2, 2]

The output is not a map such that expected_chunk[i] is the chunk of element i, which I expected, but might not matter in RCA.

This pull request solves that, which resolves #260 about RCA crashing when used with unknown labels. Since unknown labels seem to be handled in rca.py with a variable named chunk_mask I expect rca.py to work with unknown labels now.

I had to save the partial labels in constraints.py because I need to locate the locations of the unknown labels (label < 0) in partial_labels. I also corrected the remains of unittest in test_constraints.py.

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Some tests would be nice, in particular one which involves running RCA on a dataset which contains unlabeled points (which would crash before this PR). I think a natural candidate is to test whether RCA returns the same thing when fitted on a dataset with unlabeled point and the same dataset without the unlabeled points.

A second test to check that unlabeled points are not assigned to any chunk would be good too.

@RobinVogel
Copy link
Contributor Author

RobinVogel commented Nov 28, 2019

I wrote, the one on RCA in tests/metric_learn_test.py in the class TestRCA, and
the one on the constraints in tests/constraints.py. I am interested in understanding why you chose these tests.

The tests of code coverage did not pass because of the fact that the method adjacency_matrix in Constraints of constraints.py is not tested, and I added a line in the function, which yields a negative coverage difference.

After a grep on all the .py, I noticed that the function has not been used anywhere in metric-learn.
I don't see why it exists, to correct the pull request I can either:

  • Delete it all together,
  • Write a test for that function.

For the moment, I opted for deleting it.

@RobinVogel RobinVogel requested a review from bellet November 29, 2019 10:38
Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. There is a problem in one of the tests. Otherwise LGTM. Indeed adjacency_matrix is not used anywhere anymore since we rewrote the API. I do not think we will ever need it again so it is safe to delete it.

@bellet
Copy link
Member

bellet commented Dec 5, 2019

I wrote, the one on RCA in tests/metric_learn_test.py in the class TestRCA, and
the one on the constraints in tests/constraints.py. I am interested in understanding why you chose these tests.

The test on RCA is a natural one given the change made in this PR: it checks that unlabeled points are ignored. Note also that the test would fail in the previous version (since RCA would crash in the presence of unlabeled points). Generally when fixing a bug one should always write a test that would fail before to make sure the bug is not introduced again later without notice.

The test to check that no unlabeled points are assigned to any chunks is related to the fact that the change in this PR keeps entries corresponding to unlabeled points in the chunk array. Therefore it makes sense to check that there is no indexing issue, i.e. no such point is actually assigned to a chunk.

@RobinVogel
Copy link
Contributor Author

RobinVogel commented Dec 10, 2019

I spent some thinking to understand why I always have a covariance matrix that is composed of NaN's in the context of my tests. This is caused by two points stacked up:

  • The default parameters for scikit-learn's make_classification involve a n_redundant=2 parameter, which means that one of the features is a linear combination of the others. So in that case, the algorithm will fit the data, but throw a useless NaN matrix. Using nothing but informative parameters solves this problem, see the documentation for more details.

  • The default value of chunk_size in RCA_supervised is 2, and at some point, one substract the average to the points in each chunks, e.g. if I have x1 and x_2 I replace them with (x1-x2)/2 and (x2-x1)/2. It means that with chunks of size 2, the maximum rank of my covariance matrix is num_chunks, which might be below the number of features when doing small scale tests.

I think both points should be addressed with proper errors. Should I raise an issue, or try and address those in this PR ?

@bellet
Copy link
Member

bellet commented Dec 10, 2019

I spent some thinking to understand why I always have a covariance matrix that is composed of NaN's in the context of my tests. This is caused by two points stacked up:

* The default parameters for scikit-learn's `make_classification` involve a `n_redundant=2` parameter, which means that one of the features is a linear combination of the others. So in that case, the algorithm will fit the data, but throw a useless NaN matrix. Using nothing but informative parameters solves this problem, see [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) for more details.

* The default value of `chunk_size` in `RCA_supervised` is 2, and at some point, one substract the average to the points in each chunks, e.g. if I have `x1` and `x_2` I replace them with `(x1-x2)/2` and `(x2-x1)/2`. It means that with chunks of size 2, the maximum rank of my covariance matrix is `num_chunks`, which might be below the number of features when doing small scale tests.

I think both points should be addressed with proper errors. Should I raise an issue, or try and address those in this PR ?

Good observations.

Note that we already have a warning which is raised in this case, advising the user to first reduce the dimension of the data:

UserWarning: The inner covariance matrix is not invertible, so the transformation matrix may contain Nan values. You should reduce the dimensionality of your input,for instance using `sklearn.decomposition.PCA` as a preprocessing step.

Regarding how to improve the errors:

  • IMO we should not check whether some features are linear combinations of others, this is too costly (and can be fixed by reducing the dimension). This is the responsability of the user.
  • It seems that what you describe for chunk_size=2 generalizes to larger chunk sizes, in the sense that we can have a general bound on the rank of the matrix as a function of chunk_size and num_chunks, i.e. when this is the case we know for sure it will fail based only on input parameters. If this is indeed the case then I agree we should throw a specific error and explain the bound to the user so she can adjust the parameters.

@bellet
Copy link
Member

bellet commented Dec 10, 2019

For the first point, we could simply adapt the current error:

UserWarning: The inner covariance matrix is not invertible, so the transformation matrix may contain Nan values. You should remove any linearly dependent features and/or reduce the dimensionality of your input, for instance using `sklearn.decomposition.PCA` as a preprocessing step.

@RobinVogel RobinVogel requested a review from bellet December 23, 2019 09:03
@RobinVogel
Copy link
Contributor Author

RobinVogel commented Dec 23, 2019

I did all the expected modifications and added the remark with the bound on the rank.

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@bellet bellet merged commit 0370198 into scikit-learn-contrib:master Jan 3, 2020
@RobinVogel RobinVogel deleted the issue_260 branch January 13, 2020 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] RCA_Supervised crashes when fit on dataset with unlabeled points
2 participants