Skip to content

CI Fix pandas copy-on-write issues in scipy-dev #28348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Feb 6, 2024

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Feb 2, 2024

It seems the 2 remaining errors are due to a change in pandas dev that makes np.asarray(dataframe) read-only:

This passes in pandas 2.2 and fails on pandas dev:

import pandas as pd
df = pd.DataFrame({'col': [1, 2, 3]})

from sklearn.utils import check_array
assert check_array(df).flags.writeable

They are actually related to copy-on-write in pandas dev that started to be enabled by default recently in pandas-dev/pandas#56633.

The reason it was not caught in common tests seems to be #26272. It is not clear whether estimator_checks is non-test code:

  • it is non-test code if you are a 3rd party library that uses check_estimator
  • it is test code if you are a scikit-learn developer running the common tests

By the way these two remaining errors were exactly the same ones as noted some time ago in #27879 (comment).

Copy link

github-actions bot commented Feb 2, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4ab3f24. Link to the linter CI: here

@lesteve lesteve force-pushed the scipy-dev-python-3.12 branch from 1c0a47c to c2c22f5 Compare February 2, 2024 08:48
@lesteve lesteve marked this pull request as ready for review February 2, 2024 16:52
@lesteve lesteve force-pushed the scipy-dev-python-3.12 branch from c564835 to 22f0d69 Compare February 2, 2024 17:21
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would really rather have at least the ignore ones in setup.cfg since I rung tests with -Werror::DeprecationWarning -Werror:FutureWarning. But I guess I'm losing this one.

@lesteve
Copy link
Member Author

lesteve commented Feb 2, 2024

But I guess I'm losing this one.

I need to test a few things, but there is probably a way to put all the warnings definition in sklearn/conftest.py and either do something like SKLEARN_CHECK_WARNINGS=true pytest or maybe something like pytest --check-warnings. Not sure that would make you perfectly happy, but that would at least make it easier to run locally with the same "warnings as errors" settings as in the CI.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should allow inplace modification of the input dataframe in either case, see the motivation below:

# pandas dataframe or series, see
# https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html#read-only-numpy-arrays
# for more details about pandas copy-on-write
if _is_pandas_df_or_series(array_orig) and copy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if _is_pandas_df_or_series(array_orig) and copy:
if _is_pandas_df_or_series(array_orig):

If the user has set copy=False explicitly in the estimator, we also have his/her permission to write to the original dataframe as well via the numpy view.

@lesteve lesteve changed the title CI Unpin Python in scipy-dev CI Fix pandas copy-on-write issues and use Python 3.12 in scipy-dev Feb 3, 2024
@lesteve
Copy link
Member Author

lesteve commented Feb 5, 2024

CI is green, but filter argument to tarfile.TarFile.extractall has been added in 3.9.17 see https://docs.python.org/3.9/library/tarfile.html#tarfile.TarFile.extractall. I guess we need a fixes for this since we support Python 3.9 so we can not make assumptions about x in 3.9.x.

Also for some weird unknown reasons the scipy-dev CI in this PR takes a lot longer than usual (~50 minutes instead of ~25 minutes). This seems to be mostly the datasets download before tests actually run but I am not 100% sure ... and I don't know how the changes in this PR could cause this.

@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2024

Also for some weird unknown reasons the scipy-dev CI in this PR takes a lot longer than usual (~50 minutes instead of ~25 minutes). This seems to be mostly the datasets download before tests actually run but I am not 100% sure ... and I don't know how the changes in this PR could cause this.

Maybe this is related to the failing codecov uploads (e.g. network issue maybe?). Can you try to trigger the scipy-dev CI on more time on this PR to check if it's resolved?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming CI will become green again, LGTM.

@lesteve
Copy link
Member Author

lesteve commented Feb 5, 2024

Maybe this is related to the failing codecov uploads (e.g. network issue maybe?). Can you try to trigger the scipy-dev CI on more time on this PR to check if it's resolved?

I don't think so, looking at the raw log for example of this build log you have timestamps for each line of output and somehow there is 15 minutes between the time pytest is executed until the "Downloading the file 'face.dat'". Since SKLEARN_SKIP_NETWORK_TESTS=0 in this build, datasets are downloaded in sklearn/conftest.py. Locally this takes ~10 minutes so maybe part of the reason? The test run itself from the "test session starts" takes about ~20 minutes.

2024-02-03T10:28:59.3787629Z + eval 'python -m pytest --showlocals --durations=20 --junitxml=test-data.xml --cov-config='\''/home/vsts/work/1/s/.coveragerc'\'' --cov sklearn --cov-report= -Werror::DeprecationWarning -Werror::FutureWarning -Werror::sklearn.utils.fixes.VisibleDeprecationWarning -W '\''ignore:pkg_resources is deprecated as an API:DeprecationWarning'\'' -W '\''ignore:Deprecated call to `pkg_resources:DeprecationWarning'\'' -W '\''ignore:The --rsyncdir command line argument and rsyncdirs config variable are deprecated.:DeprecationWarning'\'' -Werror::pytest.PytestUnraisableExceptionWarning -Wignore:datetime.datetime.utcfromtimestamp:DeprecationWarning -W '\''ignore:ast.Num is deprecated:DeprecationWarning'\'' -W '\''ignore:Attribute n is deprecated:DeprecationWarning'\'' -n2 --maxfail=10 --pyargs sklearn'
2024-02-03T10:28:59.3791897Z ++ python -m pytest --showlocals --durations=20 --junitxml=test-data.xml --cov-config=/home/vsts/work/1/s/.coveragerc --cov sklearn --cov-report= -Werror::DeprecationWarning -Werror::FutureWarning -Werror::sklearn.utils.fixes.VisibleDeprecationWarning -W 'ignore:pkg_resources is deprecated as an API:DeprecationWarning' -W 'ignore:Deprecated call to `pkg_resources:DeprecationWarning' -W 'ignore:The --rsyncdir command line argument and rsyncdirs config variable are deprecated.:DeprecationWarning' -Werror::pytest.PytestUnraisableExceptionWarning -Wignore:datetime.datetime.utcfromtimestamp:DeprecationWarning -W 'ignore:ast.Num is deprecated:DeprecationWarning' -W 'ignore:Attribute n is deprecated:DeprecationWarning' -n2 --maxfail=10 --pyargs sklearn
2024-02-03T10:43:28.6573448Z Downloading file 'face.dat' from 'https://raw.githubusercontent.com/scipy/dataset-face/main/face.dat' to '/home/vsts/.cache/scipy-data'.
2024-02-03T10:48:19.4658006Z �[1m============================= test session starts ==============================�[0m
2024-02-03T10:48:19.4712683Z platform linux -- Python 3.12.1, pytest-8.0.0, pluggy-1.4.0
2024-02-03T10:48:19.4713193Z rootdir: /home/vsts/work/tmp_folder
2024-02-03T10:48:19.4713491Z configfile: setup.cfg
2024-02-03T10:48:19.4713806Z plugins: xdist-3.5.0, cov-4.1.0
2024-02-03T10:48:19.4714061Z created: 2/2 workers
2024-02-03T10:48:19.4714339Z 2 workers [35953 items]
2024-02-03T10:48:19.4714598Z 
2024-02-03T10:48:19.7751336Z �[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[33m [  0%]
.
.
.
2024-02-03T11:09:53.2867063Z �[31m= �[31m�[1m21 failed�[0m, �[32m33072 passed�[0m, �[33m2600 skipped�[0m, �[33m92 xfailed�[0m, �[33m47 xpassed�[0m, �[33m3666 warnings�[0m�[31m in 2452.71s (0:40:52)�[0m�[31m =�[0m

Compare this to a scipy-dev in main build log. There is about 45s between the pytest launch and the Downloading face.dat line. The test run itself from the "test session starts" to the end takes about ~12 minutes.

2024-02-04T02:39:03.9678190Z + eval 'python -m pytest --showlocals --durations=20 --junitxml=test-data.xml --cov-config='\''/home/vsts/work/1/s/.coveragerc'\'' --cov sklearn --cov-report= -Werror::DeprecationWarning -Werror::FutureWarning -Werror::sklearn.utils.fixes.VisibleDeprecationWarning -W '\''ignore:pkg_resources is deprecated as an API:DeprecationWarning'\'' -W '\''ignore:Deprecated call to `pkg_resources:DeprecationWarning'\'' -W '\''ignore:The --rsyncdir command line argument and rsyncdirs config variable are deprecated.:DeprecationWarning'\'' -Werror::pytest.PytestUnraisableExceptionWarning -n2 --maxfail=10 --pyargs sklearn'
2024-02-04T02:39:03.9681406Z ++ python -m pytest --showlocals --durations=20 --junitxml=test-data.xml --cov-config=/home/vsts/work/1/s/.coveragerc --cov sklearn --cov-report= -Werror::DeprecationWarning -Werror::FutureWarning -Werror::sklearn.utils.fixes.VisibleDeprecationWarning -W 'ignore:pkg_resources is deprecated as an API:DeprecationWarning' -W 'ignore:Deprecated call to `pkg_resources:DeprecationWarning' -W 'ignore:The --rsyncdir command line argument and rsyncdirs config variable are deprecated.:DeprecationWarning' -Werror::pytest.PytestUnraisableExceptionWarning -n2 --maxfail=10 --pyargs sklearn
2024-02-04T02:39:48.0536940Z Downloading file 'face.dat' from 'https://raw.githubusercontent.com/scipy/dataset-face/main/face.dat' to '/home/vsts/.cache/scipy-data'.
2024-02-04T02:44:29.0459575Z �[1m============================= test session starts ==============================�[0m
2024-02-04T02:44:29.0577313Z platform linux -- Python 3.11.7, pytest-8.0.0, pluggy-1.4.0
2024-02-04T02:44:29.0578169Z To reproduce this test run, set the following environment variable:
2024-02-04T02:44:29.0578506Z     SKLEARN_TESTS_GLOBAL_RANDOM_SEED="70"
2024-02-04T02:44:29.0578988Z See: https://scikit-learn.org/dev/computing/parallelism.html#sklearn-tests-global-random-seed
2024-02-04T02:44:29.0579339Z rootdir: /home/vsts/work/tmp_folder
2024-02-04T02:44:29.0579623Z configfile: setup.cfg
2024-02-04T02:44:29.0579969Z plugins: xdist-3.5.0, cov-4.1.0
2024-02-04T02:44:29.0580229Z created: 2/2 workers
2024-02-04T02:44:29.0580484Z 2 workers [35954 items]
2024-02-04T02:44:29.0580647Z 
2024-02-04T02:44:29.2534268Z �[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[32m.�
.
.
.
2024-02-04T02:56:24.0518797Z �[31m= �[31m�[1m2 failed�[0m, �[32m33910 passed�[0m, �[33m1903 skipped�[0m, �[33m92 xfailed�[0m, �[33m47 xpassed�[0m, �[33m4011 warnings�[0m�[31m in 1039.15s (0:17:19)�[0m�[31m =�[0m

@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2024

The [scipy-dev] tests are still very slow on the last commit. I opened #28363 to check if this is really related to a change in this PR or not.

EDIT: [scipy-dev] tests are still much faster to run on main so there is definitely something wrong in this PR.

@ogrisel ogrisel mentioned this pull request Feb 5, 2024
@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2024

By looking at the diff of the PR one could have expected the change to use filter="data" to be the culprit but it's not consistent with the fact that the non scipy-dev jobs do not seem to have slowed down (even those running Python > 3.9).

So maybe this is related to the use of the change from Python 3.11 to Python 3.12 from the defaults channel?

Let me push a commit to check if we can still observe this problem by using conda-forge to install Python 3.12.

@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2024

Oops, I forgot to update the lock file, let me do that now.

@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2024

It's still very slow to run the tests for the scipy-dev job...

@glemaitre
Copy link
Member

Regarding the timing, did we by hazard cached already the dataset in the past and the modification of extract_all invalidate the cache somehow?

@lesteve
Copy link
Member Author

lesteve commented Feb 5, 2024

Regarding the timing, did we by hazard cached already the dataset in the past and the modification of extract_all invalidate the cache somehow?

I don't think there is a cache for ~/scikit_learn_data in the tests (at least I looked and I have not found it in configuration files). There is a cache for ~/scikit_learn_data for doc builds.

@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2024

Maybe we can split this PR in two: one for Python 3.12 / tarball filtering and another for pandas' CoW update?

This way, merge at least one incremental fix and narrow down on the cause of the slowdown.

@lesteve lesteve changed the title CI Fix pandas copy-on-write issues and use Python 3.12 in scipy-dev CI Fix pandas copy-on-write issues in scipy-dev Feb 5, 2024
@lesteve
Copy link
Member Author

lesteve commented Feb 6, 2024

I reverted the Python 3.12 changes, scipy-dev tests take ~24 minutes which seems in the same ball-park as some builds in main for example this build and this build.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as it is, it's already a net improvement. Let's investigate the slow tests with Python 3.12 in a dedicated PR.

@ogrisel ogrisel merged commit b0edfdf into scikit-learn:main Feb 6, 2024
@lesteve lesteve deleted the scipy-dev-python-3.12 branch February 6, 2024 09:23
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 13, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
glemaitre pushed a commit that referenced this pull request Feb 13, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants