-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Array API support for LabelEncoder #27381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Array API support for LabelEncoder #27381
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I don't have time to finalize my review now but here is a first few feedback:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more feedback.
@OmarManzoor sorry I had not seen you had taken my first pass of review into account. This PR now needs a bunch of conflict resolution because of concurrent changes merged in |
I would not mind working on it but I might need some time since I would have to revisit what I had done here and what further needs to be done after conflict resolution. 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more feedback below:
@betatim I think the tests are failing because of array-api-strict. Should we handle array-api-strict too? |
@ogrisel It seems that the tests are failing because of array-api-strict. It doesn't have anything called argsort which we are using in the isin function. From your comment above it seems we don't need to include array-api-strict in this but then how do we ensure that the tests don't fail? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a few commits to fix some of the failures (including failures related to devices that are only possible to trigger with torch and non-CPU devices). But we still need to work with libraries such as array-api-strict
that do not yet implement xp.searchsorted
, see below:
Yes we should. It's the easiest and fastest way to detect non-compliance problems in our code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also another problem with CuPy that I cannot debug because I temporarily lost my access to my CUDA host, but the short error message was:
FAILED sklearn/preprocessing/tests/test_label.py::test_label_encoder_array_api_compliance[y0-cupy-None-None] - TypeError: unhashable type: 'ndarray'
unfortunately, I no longer have access to the full traceback. I will need to wait for the machine to be free again to re-run that tests.
In the mean time:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this PR on all supported namespaces and device combinations and all tests pass.
@betatim I think this is ready for another round of review on your end. |
Hi @betatim does this looks okay now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Another round of comments. Sorry for not including them last time already
xp_label_fit = xp_label.fit(xp_y) | ||
xp_transformed = xp_label_fit.transform(xp_y) | ||
xp_inv_transformed = xp_label_fit.inverse_transform(xp_transformed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't work out why the xp_label_fit
is needed, so suggesting to remove it (same for np_label_fit
later on)
xp_label_fit = xp_label.fit(xp_y) | |
xp_transformed = xp_label_fit.transform(xp_y) | |
xp_inv_transformed = xp_label_fit.inverse_transform(xp_transformed) | |
xp_label = xp_label.fit(xp_y) | |
xp_transformed = xp_label.transform(xp_y) | |
xp_inv_transformed = xp_label.inverse_transform(xp_transformed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xp_label is defined above as
xp_label = LabelEncoder()
we are using it below to test fit_transform
xp_label.fit_transform(xp_y)
If we set it to xp_label again it would override the original estimator and set it to the fitted one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I think that this is already modified when we run fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@betatim Could you kindly check if this looks okay now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it is used again but like you found out est.fit(X, y)
modifies (and returns) est
. The reason it works to reuse it later is that est.fit()
and est.fit_transform()
reset the state of the estimator before fitting/after calling fit()
again the state is as if the first fit()
call never happened (there are some exceptions with warm starting, but not here).
Thanks a lot! |
Reference Issues/PRs
Towards #26024
Related to #27369
What does this implement/fix? Explain your changes.
Any other comments?
CC: @betatim @ogrisel