-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
CI Fix pandas copy-on-write issues in scipy-dev #28348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This has been added in 3.9.17
1c0a47c
to
c2c22f5
Compare
c564835
to
22f0d69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would really rather have at least the ignore
ones in setup.cfg
since I rung tests with -Werror::DeprecationWarning -Werror:FutureWarning
. But I guess I'm losing this one.
I need to test a few things, but there is probably a way to put all the warnings definition in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should allow inplace modification of the input dataframe in either case, see the motivation below:
sklearn/utils/validation.py
Outdated
# pandas dataframe or series, see | ||
# https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html#read-only-numpy-arrays | ||
# for more details about pandas copy-on-write | ||
if _is_pandas_df_or_series(array_orig) and copy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _is_pandas_df_or_series(array_orig) and copy: | |
if _is_pandas_df_or_series(array_orig): |
If the user has set copy=False
explicitly in the estimator, we also have his/her permission to write to the original dataframe as well via the numpy view.
…ray is dense array
CI is green, but Also for some weird unknown reasons the scipy-dev CI in this PR takes a lot longer than usual (~50 minutes instead of ~25 minutes). This seems to be mostly the datasets download before tests actually run but I am not 100% sure ... and I don't know how the changes in this PR could cause this. |
Maybe this is related to the failing codecov uploads (e.g. network issue maybe?). Can you try to trigger the scipy-dev CI on more time on this PR to check if it's resolved? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming CI will become green again, LGTM.
I don't think so, looking at the raw log for example of this build log you have timestamps for each line of output and somehow there is 15 minutes between the time pytest is executed until the "Downloading the file 'face.dat'". Since
Compare this to a scipy-dev in main build log. There is about 45s between the pytest launch and the Downloading face.dat line. The test run itself from the "test session starts" to the end takes about ~12 minutes.
|
The EDIT: |
By looking at the diff of the PR one could have expected the change to use So maybe this is related to the use of the change from Python 3.11 to Python 3.12 from the defaults channel? Let me push a commit to check if we can still observe this problem by using conda-forge to install Python 3.12. |
Oops, I forgot to update the lock file, let me do that now. |
…s (update lock files)
It's still very slow to run the tests for the scipy-dev job... |
Regarding the timing, did we by hazard cached already the dataset in the past and the modification of |
I don't think there is a cache for |
Maybe we can split this PR in two: one for Python 3.12 / tarball filtering and another for pandas' CoW update? This way, merge at least one incremental fix and narrow down on the cause of the slowdown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as it is, it's already a net improvement. Let's investigate the slow tests with Python 3.12 in a dedicated PR.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
It seems the 2 remaining errors are due to a change in pandas dev that makes
np.asarray(dataframe)
read-only:This passes in pandas 2.2 and fails on pandas dev:
They are actually related to copy-on-write in pandas dev that started to be enabled by default recently in pandas-dev/pandas#56633.
The reason it was not caught in common tests seems to be #26272. It is not clear whether estimator_checks is non-test code:
check_estimator
By the way these two remaining errors were exactly the same ones as noted some time ago in #27879 (comment).