-
Notifications
You must be signed in to change notification settings - Fork 228
Break chunks generation in RCA when not enough possible chunks, fixes issue #200 #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for getting started on this. The solution looks fine but you need to write some tests to ensure that everything works as intended. You should test that the error is raised when we expect it, but it would also be useful to make sure everything goes through in cases when there is enough points. Generally it is good to test edge cases (e.g., just enough points, one point missing, etc). |
Regarding where these tests should be put, I think they could either go to the class I think the second option is better for two reasons:
|
I took some time to get familiar with tests and wrote one for the two untested cases in the chunk generation that you discussed. I tried to model what I wrote on the existing tests, and wrote something relatively general, at the risk of being verbose. I separated the generation of labels from the tests and wrote it myself, since it needs to satisfy some constraints for edge cases. I hesitated to group the tests in a class, as in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Indeed you could put the tests into a class, maybe simply TestChunks
or something
test/test_constraints.py
Outdated
constraints = Constraints(labels) | ||
chunks = constraints.chunks(num_chunks=num_chunks, chunk_size=chunk_size, | ||
random_state=SEED) | ||
return chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of simply returning chunks
, it would be nice to test that chunks
contains num_chunks
chunks, each composed of chunk_size
points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add it.
A remark, I expected chunks to map the datapoint index (input order) to the chunk number, but chunks removes -1 instances (unknown labels, according to line 21), which means that one has to consider datapoint indexes when removing -1's. But it must be taken into account in the implementation of RCA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this? Not sure I understand what you mean.
Do we test at all RCA_Supervised
in the presence of unlabeled points? At first sight, it looks like it might fail due to unlabeled points being removed from the chunk array as pointed out by @RobinVogel. @wdevazelhes, any comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just add an example to make it clearer. Below is a comparison in the presence of unknown labels between the output that I expected from Constraints.chunks
, i.e. expected_chunk
and the one that I obtain from Constraints.chunks
, i.e. chunks
:
In [1]: from metric_learn.constraints import Constraints
In [2]: partial_labels = [1, 2, 2, 1, -1, 3, 3]
In [3]: cons = Constraints(partial_labels)
In [4]: cons.chunks(num_chunks=3, chunk_size=2)
Out[4]: array([0, 1, 1, 0, 2, 2])
In [5]: chunks = cons.chunks(num_chunks=3, chunk_size=2)
In [6]: len(chunks), len(partial_labels)
Out[6]: (6, 7)
In [7]: expected_chunk = [0, 1, 1, 0, -1, 2, 2]
The output is not a map such that expected_chunk[i]
is the chunk of element i
, which I expected, but might not matter in RCA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. And this is indeed a problem for RCA_Supervised
, see the issue #260 that I just added. I have also added another issue (#261) to make clear in the doc that -1
is interpreted as unlabeled. You're very welcome to address those @RobinVogel while you have everything well in mind :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RobinVogel ! I think this will be good to merge when the two small points above are fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, merging
This looks great ! |
Fixes #200
See comment #198 (comment)
I count for each set of instances of each label how many chunks it holds, and sum those.
If it is lower than the number of requested chunks, I raise an error.