CI Fix pandas copy-on-write issues in scipy-dev #28348

lesteve · 2024-02-02T05:31:20Z

It seems the 2 remaining errors are due to a change in pandas dev that makes np.asarray(dataframe) read-only:

This passes in pandas 2.2 and fails on pandas dev:

import pandas as pd
df = pd.DataFrame({'col': [1, 2, 3]})

from sklearn.utils import check_array
assert check_array(df).flags.writeable

They are actually related to copy-on-write in pandas dev that started to be enabled by default recently in pandas-dev/pandas#56633.

The reason it was not caught in common tests seems to be #26272. It is not clear whether estimator_checks is non-test code:

it is non-test code if you are a 3rd party library that uses check_estimator
it is test code if you are a scikit-learn developer running the common tests

By the way these two remaining errors were exactly the same ones as noted some time ago in #27879 (comment).

github-actions · 2024-02-02T05:32:59Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 4ab3f24. Link to the linter CI: here}

This has been added in 3.9.17

adrinjalali

Would really rather have at least the ignore ones in setup.cfg since I rung tests with -Werror::DeprecationWarning -Werror:FutureWarning. But I guess I'm losing this one.

lesteve · 2024-02-02T19:06:28Z

But I guess I'm losing this one.

I need to test a few things, but there is probably a way to put all the warnings definition in sklearn/conftest.py and either do something like SKLEARN_CHECK_WARNINGS=true pytest or maybe something like pytest --check-warnings. Not sure that would make you perfectly happy, but that would at least make it easier to run locally with the same "warnings as errors" settings as in the CI.

ogrisel

I think we should allow inplace modification of the input dataframe in either case, see the motivation below:

ogrisel · 2024-02-02T19:50:58Z

sklearn/utils/validation.py

+    # pandas dataframe or series, see
+    # https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html#read-only-numpy-arrays
+    # for more details about pandas copy-on-write
+    if _is_pandas_df_or_series(array_orig) and copy:


Suggested change

if _is_pandas_df_or_series(array_orig) and copy:

if _is_pandas_df_or_series(array_orig):

If the user has set copy=False explicitly in the estimator, we also have his/her permission to write to the original dataframe as well via the numpy view.

…ray is dense array

lesteve · 2024-02-05T07:23:16Z

CI is green, but filter argument to tarfile.TarFile.extractall has been added in 3.9.17 see https://docs.python.org/3.9/library/tarfile.html#tarfile.TarFile.extractall. I guess we need a fixes for this since we support Python 3.9 so we can not make assumptions about x in 3.9.x.

Also for some weird unknown reasons the scipy-dev CI in this PR takes a lot longer than usual (~50 minutes instead of ~25 minutes). This seems to be mostly the datasets download before tests actually run but I am not 100% sure ... and I don't know how the changes in this PR could cause this.

ogrisel · 2024-02-05T08:45:07Z

Also for some weird unknown reasons the scipy-dev CI in this PR takes a lot longer than usual (~50 minutes instead of ~25 minutes). This seems to be mostly the datasets download before tests actually run but I am not 100% sure ... and I don't know how the changes in this PR could cause this.

Maybe this is related to the failing codecov uploads (e.g. network issue maybe?). Can you try to trigger the scipy-dev CI on more time on this PR to check if it's resolved?

ogrisel

Assuming CI will become green again, LGTM.

sklearn/utils/validation.py

lesteve · 2024-02-05T09:18:19Z

Maybe this is related to the failing codecov uploads (e.g. network issue maybe?). Can you try to trigger the scipy-dev CI on more time on this PR to check if it's resolved?

I don't think so, looking at the raw log for example of this build log you have timestamps for each line of output and somehow there is 15 minutes between the time pytest is executed until the "Downloading the file 'face.dat'". Since SKLEARN_SKIP_NETWORK_TESTS=0 in this build, datasets are downloaded in sklearn/conftest.py. Locally this takes ~10 minutes so maybe part of the reason? The test run itself from the "test session starts" takes about ~20 minutes.

2024-02-03T10:28:59.3787629Z + eval 'python -m pytest --showlocals --durations=20 --junitxml=test-data.xml --cov-config='\''/home/vsts/work/1/s/.coveragerc'\'' --cov sklearn --cov-report= -Werror::DeprecationWarning -Werror::FutureWarning -Werror::sklearn.utils.fixes.VisibleDeprecationWarning -W '\''ignore:pkg_resources is deprecated as an API:DeprecationWarning'\'' -W '\''ignore:Deprecated call to `pkg_resources:DeprecationWarning'\'' -W '\''ignore:The --rsyncdir command line argument and rsyncdirs config variable are deprecated.:DeprecationWarning'\'' -Werror::pytest.PytestUnraisableExceptionWarning -Wignore:datetime.datetime.utcfromtimestamp:DeprecationWarning -W '\''ignore:ast.Num is deprecated:DeprecationWarning'\'' -W '\''ignore:Attribute n is deprecated:DeprecationWarning'\'' -n2 --maxfail=10 --pyargs sklearn'
2024-02-03T10:28:59.3791897Z ++ python -m pytest --showlocals --durations=20 --junitxml=test-data.xml --cov-config=/home/vsts/work/1/s/.coveragerc --cov sklearn --cov-report= -Werror::DeprecationWarning -Werror::FutureWarning -Werror::sklearn.utils.fixes.VisibleDeprecationWarning -W 'ignore:pkg_resources is deprecated as an API:DeprecationWarning' -W 'ignore:Deprecated call to `pkg_resources:DeprecationWarning' -W 'ignore:The --rsyncdir command line argument and rsyncdirs config variable are deprecated.:DeprecationWarning' -Werror::pytest.PytestUnraisableExceptionWarning -Wignore:datetime.datetime.utcfromtimestamp:DeprecationWarning -W 'ignore:ast.Num is deprecated:DeprecationWarning' -W 'ignore:Attribute n is deprecated:DeprecationWarning' -n2 --maxfail=10 --pyargs sklearn
2024-02-03T10:43:28.6573448Z Downloading file 'face.dat' from 'https://raw.githubusercontent.com/scipy/dataset-face/main/face.dat' to '/home/vsts/.cache/scipy-data'.
2024-02-03T10:48:19.4658006Z �[1m============================= test session starts ==============================�[0m
2024-02-03T10:48:19.4712683Z platform linux -- Python 3.12.1, pytest-8.0.0, pluggy-1.4.0
2024-02-03T10:48:19.4713193Z rootdir: /home/vsts/work/tmp_folder
2024-02-03T10:48:19.4713491Z configfile: setup.cfg
2024-02-03T10:48:19.4713806Z plugins: xdist-3.5.0, cov-4.1.0
2024-02-03T10:48:19.4714061Z created: 2/2 workers
2024-02-03T10:48:19.4714339Z 2 workers [35953 items]
2024-02-03T10:48:19.4714598Z 
2024-02-03T10:48:19.7751336Z �[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[32m.�[0m�[33m [  0%]
.
.
.
2024-02-03T11:09:53.2867063Z �[31m= �[31m�[1m21 failed�[0m, �[32m33072 passed�[0m, �[33m2600 skipped�[0m, �[33m92 xfailed�[0m, �[33m47 xpassed�[0m, �[33m3666 warnings�[0m�[31m in 2452.71s (0:40:52)�[0m�[31m =�[0m

Compare this to a scipy-dev in main build log. There is about 45s between the pytest launch and the Downloading face.dat line. The test run itself from the "test session starts" to the end takes about ~12 minutes.

2024-02-04T02:39:03.9678190Z + eval 'python -m pytest --showlocals --durations=20 --junitxml=test-data.xml --cov-config='\''/home/vsts/work/1/s/.coveragerc'\'' --cov sklearn --cov-report= -Werror::DeprecationWarning -Werror::FutureWarning -Werror::sklearn.utils.fixes.VisibleDeprecationWarning -W '\''ignore:pkg_resources is deprecated as an API:DeprecationWarning'\'' -W '\''ignore:Deprecated call to `pkg_resources:DeprecationWarning'\'' -W '\''ignore:The --rsyncdir command line argument and rsyncdirs config variable are deprecated.:DeprecationWarning'\'' -Werror::pytest.PytestUnraisableExceptionWarning -n2 --maxfail=10 --pyargs sklearn'
2024-02-04T02:39:03.9681406Z ++ python -m pytest --showlocals --durations=20 --junitxml=test-data.xml --cov-config=/home/vsts/work/1/s/.coveragerc --cov sklearn --cov-report= -Werror::DeprecationWarning -Werror::FutureWarning -Werror::sklearn.utils.fixes.VisibleDeprecationWarning -W 'ignore:pkg_resources is deprecated as an API:DeprecationWarning' -W 'ignore:Deprecated call to `pkg_resources:DeprecationWarning' -W 'ignore:The --rsyncdir command line argument and rsyncdirs config variable are deprecated.:DeprecationWarning' -Werror::pytest.PytestUnraisableExceptionWarning -n2 --maxfail=10 --pyargs sklearn
2024-02-04T02:39:48.0536940Z Downloading file 'face.dat' from 'https://raw.githubusercontent.com/scipy/dataset-face/main/face.dat' to '/home/vsts/.cache/scipy-data'.
2024-02-04T02:44:29.0459575Z �[1m============================= test session starts ==============================�[0m
2024-02-04T02:44:29.0577313Z platform linux -- Python 3.11.7, pytest-8.0.0, pluggy-1.4.0
2024-02-04T02:44:29.0578169Z To reproduce this test run, set the following environment variable:
2024-02-04T02:44:29.0578506Z     SKLEARN_TESTS_GLOBAL_RANDOM_SEED="70"
2024-02-04T02:44:29.0578988Z See: https://scikit-learn.org/dev/computing/parallelism.html#sklearn-tests-global-random-seed
2024-02-04T02:44:29.0579339Z rootdir: /home/vsts/work/tmp_folder
2024-02-04T02:44:29.0579623Z configfile: setup.cfg
2024-02-04T02:44:29.0579969Z plugins: xdist-3.5.0, cov-4.1.0
2024-02-04T02:44:29.0580229Z created: 2/2 workers
2024-02-04T02:44:29.0580484Z 2 workers [35954 items]
2024-02-04T02:44:29.0580647Z 
2024-02-04T02:44:29.2534268Z �[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[32m.�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[33ms�[0m�[32m.�
.
.
.
2024-02-04T02:56:24.0518797Z �[31m= �[31m�[1m2 failed�[0m, �[32m33910 passed�[0m, �[33m1903 skipped�[0m, �[33m92 xfailed�[0m, �[33m47 xpassed�[0m, �[33m4011 warnings�[0m�[31m in 1039.15s (0:17:19)�[0m�[31m =�[0m

ogrisel · 2024-02-05T13:30:39Z

The [scipy-dev] tests are still very slow on the last commit. I opened #28363 to check if this is really related to a change in this PR or not.

EDIT: [scipy-dev] tests are still much faster to run on main so there is definitely something wrong in this PR.

ogrisel · 2024-02-05T14:04:58Z

By looking at the diff of the PR one could have expected the change to use filter="data" to be the culprit but it's not consistent with the fact that the non scipy-dev jobs do not seem to have slowed down (even those running Python > 3.9).

So maybe this is related to the use of the change from Python 3.11 to Python 3.12 from the defaults channel?

Let me push a commit to check if we can still observe this problem by using conda-forge to install Python 3.12.

ogrisel · 2024-02-05T14:08:26Z

Oops, I forgot to update the lock file, let me do that now.

…s (update lock files)

ogrisel · 2024-02-05T15:06:08Z

It's still very slow to run the tests for the scipy-dev job...

glemaitre · 2024-02-05T15:42:50Z

Regarding the timing, did we by hazard cached already the dataset in the past and the modification of extract_all invalidate the cache somehow?

lesteve · 2024-02-05T15:48:56Z

Regarding the timing, did we by hazard cached already the dataset in the past and the modification of extract_all invalidate the cache somehow?

I don't think there is a cache for ~/scikit_learn_data in the tests (at least I looked and I have not found it in configuration files). There is a cache for ~/scikit_learn_data for doc builds.

ogrisel · 2024-02-05T17:06:34Z

Maybe we can split this PR in two: one for Python 3.12 / tarball filtering and another for pandas' CoW update?

This way, merge at least one incremental fix and narrow down on the cause of the slowdown.

lesteve · 2024-02-06T06:06:58Z

I reverted the Python 3.12 changes, scipy-dev tests take ~24 minutes which seems in the same ball-park as some builds in main for example this build and this build.

ogrisel

LGTM as it is, it's already a net improvement. Let's investigate the slow tests with Python 3.12 in a dedicated PR.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

lesteve added 2 commits February 2, 2024 06:30

CI Unpin Python in scipy-dev

8aa6096

[scipy-dev]

1843188

github-actions bot added the Build / CI label Feb 2, 2024

lesteve added 2 commits February 2, 2024 07:20

[scipy-dev] Use filter="data" for tarfile extraction.

bb0051d

This has been added in 3.9.17

[scipy-dev] Ignore dateutil warning

c2c22f5

lesteve force-pushed the scipy-dev-python-3.12 branch from 1c0a47c to c2c22f5 Compare February 2, 2024 08:48

lesteve added 3 commits February 2, 2024 17:11

[scipy-dev] fix

e4ece94

[scipy-dev] fix

c606a2d

[scipy-dev] tweak comment

874e35a

lesteve marked this pull request as ready for review February 2, 2024 16:52

[scipy-dev] Handle Python 3.12 DeprecationWarning in joblib release

22f0d69

lesteve force-pushed the scipy-dev-python-3.12 branch from c564835 to 22f0d69 Compare February 2, 2024 17:21

adrinjalali approved these changes Feb 2, 2024

View reviewed changes

ogrisel reviewed Feb 2, 2024

View reviewed changes

[scipy-dev] make resulting array writeable even when copy=False

a2de6c1

lesteve changed the title ~~CI Unpin Python in scipy-dev~~ CI Fix pandas copy-on-write issues and use Python 3.12 in scipy-dev Feb 3, 2024

[scipy-dev] only set array flags when they exist, typically output ar…

810a80c

…ray is dense array

[scipy-dev] [doc build] add filter="data" to sklearn.utils.fixes

92f7a58

ogrisel approved these changes Feb 5, 2024

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

Typo [scipy-dev]

326d52d

lesteve mentioned this pull request Feb 5, 2024

CI Update codecov uploader to 0.7.1 #28361

Merged

[scipy-dev] SettingWithCopyWarning has been removed in pandas 3.0.0.dev

3010408

lesteve added the No Changelog Needed label Feb 5, 2024

ogrisel mentioned this pull request Feb 5, 2024

DEBUG [scipy-dev] #28363

Closed

Install [scipy-dev]'s Python 3.12 from conda-forge instead of defaults

9c31187

Install [scipy-dev]'s Python 3.12 from conda-forge instead of default…

bf8d86f

…s (update lock files)

[scipy-dev] try without pytest-xdist

02b5dd0

Revert Python 3.12 changes

8acd832

lesteve changed the title ~~CI Fix pandas copy-on-write issues and use Python 3.12 in scipy-dev~~ CI Fix pandas copy-on-write issues in scipy-dev Feb 5, 2024

[scipy-dev]

4ab3f24

ogrisel approved these changes Feb 6, 2024

View reviewed changes

ogrisel merged commit b0edfdf into scikit-learn:main Feb 6, 2024

lesteve deleted the scipy-dev-python-3.12 branch February 6, 2024 09:23

lesteve mentioned this pull request Feb 6, 2024

CI Use environment variable to turn warnings into errors in tests and doc build #28372

Merged

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024

CI Fix pandas copy-on-write issues in scipy-dev (scikit-learn#28348)

f69fff3

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 13, 2024

CI Fix pandas copy-on-write issues in scipy-dev (scikit-learn#28348)

589fc41

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre pushed a commit that referenced this pull request Feb 13, 2024

CI Fix pandas copy-on-write issues in scipy-dev (#28348)

cd2e47f

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ankatiyar mentioned this pull request Mar 1, 2024

Investigate why Spaceflights project failing with ParallelRunner kedro-org/kedro#3674

Closed

glemaitre mentioned this pull request Apr 9, 2024

ColumnTransformer throws error with n_jobs > 1 input dataframes and joblib auto-memmapping (regression in 1.4.1.post1) #28781

Closed

thomasjpfan mentioned this pull request May 7, 2024

RFC Trigger a copy when copy=False and X is read-only #28824

Closed

jeremiedbb mentioned this pull request May 14, 2024

FEA Add writeable parameter to check_array #29018

Merged

	if _is_pandas_df_or_series(array_orig) and copy:
	if _is_pandas_df_or_series(array_orig):

Uh oh!

CI Fix pandas copy-on-write issues in scipy-dev #28348

CI Fix pandas copy-on-write issues in scipy-dev #28348

Uh oh!

Conversation

lesteve commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

lesteve commented Feb 2, 2024

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

lesteve commented Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 5, 2024

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lesteve commented Feb 5, 2024

Uh oh!

ogrisel commented Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 5, 2024

Uh oh!

ogrisel commented Feb 5, 2024

Uh oh!

ogrisel commented Feb 5, 2024

Uh oh!

glemaitre commented Feb 5, 2024

Uh oh!

lesteve commented Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 5, 2024

Uh oh!

lesteve commented Feb 6, 2024

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lesteve commented Feb 2, 2024 •

edited

Loading

github-actions bot commented Feb 2, 2024 •

edited

Loading

lesteve commented Feb 5, 2024 •

edited

Loading

ogrisel commented Feb 5, 2024 •

edited

Loading

lesteve commented Feb 5, 2024 •

edited

Loading