-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Fix ColumnTransformer in parallel with joblib's auto memmapping #28822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I opened #28824 to discuss the read-only situation more globally. |
with parallel_backend("loky", max_nbytes=1): | ||
Xt = transformer.fit_transform(X) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is failing because this is only doable in joblib>=1.13 and our min is 1.12.
I can use a bigger array for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could skip the test on that joblib though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this might be a bigger issue, but this is a minimal change that fixes a few cases. So LGTM.
with parallel_backend("loky", max_nbytes=1): | ||
Xt = transformer.fit_transform(X) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could skip the test on that joblib though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well.
Fixes #28781
When running in parallel, ColumnTransformer will crash if joblib's auto memmap triggers and copies are not made in time.
Currently we index X when declaring the jobs. It means we have copy then read-only memmap. Then if the transformer fails to do inplace transfo, or fails earlier in case of dataframe (see #28781 (comment)).
The fix here proposes to index X within each job instead. This way we have read-only memmap then copy, and the transformer can do inplace transfo.
Disclaimer: it doesn't solve the underlying problem completely. If you select columns by slice it still fails because it creates a view and not a copy. I'm starting to think that the issue is more profound, and lies between the
copy
parameter andcheck_array
, for all estimators. I thinkcheck_array
should always make a copy if the array is read-only, even ifcopy=False
because when an estimator has acopy
parameter, it's because it wants to do inplace modifications.