Skip to content

Break chunks generation in RCA when not enough possible chunks, fixes issue #200 #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 13, 2019

Conversation

RobinVogel
Copy link
Contributor

@RobinVogel RobinVogel commented Oct 29, 2019

Fixes #200
See comment #198 (comment)

I count for each set of instances of each label how many chunks it holds, and sum those.
If it is lower than the number of requested chunks, I raise an error.

@RobinVogel RobinVogel changed the title fixes issue 200 Break chunks generation in RCA when not enough possible chunks, fixes issue #200 Oct 29, 2019
@bellet
Copy link
Member

bellet commented Oct 29, 2019

Thanks for getting started on this. The solution looks fine but you need to write some tests to ensure that everything works as intended. You should test that the error is raised when we expect it, but it would also be useful to make sure everything goes through in cases when there is enough points. Generally it is good to test edge cases (e.g., just enough points, one point missing, etc).

@bellet
Copy link
Member

bellet commented Oct 29, 2019

Regarding where these tests should be put, I think they could either go to the class Test_RCA in metric_learn_test.py (as RCA_Supervised is currently the only algorithm using chunks), or to a new file test_constraints.py.

I think the second option is better for two reasons:

  • You can test the behavior of the method chunks directly, which will cover all other algorithms using chunks if we add any in the future
  • test_constraints.py would be the place where we can add more tests in the future for the methods in constraints.py (currently I think we only test the other methods indirectly through calls to supervised variants of specific algorithms, which is not ideal)

@RobinVogel
Copy link
Contributor Author

I took some time to get familiar with tests and wrote one for the two untested cases in the chunk generation that you discussed. I tried to model what I wrote on the existing tests, and wrote something relatively general, at the risk of being verbose.

I separated the generation of labels from the tests and wrote it myself, since it needs to satisfy some constraints for edge cases.

I hesitated to group the tests in a class, as in test_sklearn_compat.py, since test_constraints.py is supposed to test the class Constraints.

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Indeed you could put the tests into a class, maybe simply TestChunks or something

constraints = Constraints(labels)
chunks = constraints.chunks(num_chunks=num_chunks, chunk_size=chunk_size,
random_state=SEED)
return chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of simply returning chunks, it would be nice to test that chunks contains num_chunks chunks, each composed of chunk_size points

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it.

A remark, I expected chunks to map the datapoint index (input order) to the chunk number, but chunks removes -1 instances (unknown labels, according to line 21), which means that one has to consider datapoint indexes when removing -1's. But it must be taken into account in the implementation of RCA.

Copy link
Member

@bellet bellet Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this? Not sure I understand what you mean.

Do we test at all RCA_Supervised in the presence of unlabeled points? At first sight, it looks like it might fail due to unlabeled points being removed from the chunk array as pointed out by @RobinVogel. @wdevazelhes, any comment?

Copy link
Contributor Author

@RobinVogel RobinVogel Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just add an example to make it clearer. Below is a comparison in the presence of unknown labels between the output that I expected from Constraints.chunks, i.e. expected_chunk and the one that I obtain from Constraints.chunks, i.e. chunks:

In [1]: from metric_learn.constraints import Constraints                                                                                                                                                        

In [2]: partial_labels = [1, 2, 2, 1, -1, 3, 3]                                                        

In [3]: cons = Constraints(partial_labels)                                                             

In [4]: cons.chunks(num_chunks=3, chunk_size=2)                                                        
Out[4]: array([0, 1, 1, 0, 2, 2])

In [5]: chunks = cons.chunks(num_chunks=3, chunk_size=2)                                               

In [6]: len(chunks), len(partial_labels)                                                               
Out[6]: (6, 7)

In [7]: expected_chunk = [0, 1, 1, 0, -1, 2, 2]

The output is not a map such that expected_chunk[i] is the chunk of element i, which I expected, but might not matter in RCA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. And this is indeed a problem for RCA_Supervised, see the issue #260 that I just added. I have also added another issue (#261) to make clear in the doc that -1 is interpreted as unlabeled. You're very welcome to address those @RobinVogel while you have everything well in mind :-)

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RobinVogel ! I think this will be good to merge when the two small points above are fixed

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, merging

@bellet bellet merged commit 7700529 into scikit-learn-contrib:master Nov 13, 2019
@wdevazelhes
Copy link
Member

This looks great !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Break chunks generation in RCA when needed
3 participants