Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values #29314

MarcoGorelli · 2024-06-20T13:16:24Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

This simplifies things a bit, but does involve allocating

arr = np.array(params_list)

Given that lists people are doing grid-search over are presumably small-ish (not in the millions / billions?), and that this logic is getting complicated enough, it's probably worth it?

github-actions · 2024-06-20T13:17:44Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 4b89d5d. Link to the linter CI: here}

… of different sizes as parameter values

MarcoGorelli · 2024-06-21T11:02:29Z

@lesteve @thomasjpfan @GridSearchCVFriends fancy taking a look?

The logic is now simpler so my hope is that this'll fix it for good

EDIT: will look at increasing test coverage

lesteve · 2024-06-21T13:07:39Z

Given that lists people are doing grid-search over are presumably small-ish (not in the millions / billions?), and that this logic is getting complicated enough, it's probably worth it?

About performance, I think the array was created with np.result_type anyway, at least that is what I understand from @seberg's comment numpy/numpy#26612 (comment)

result_type is badly overloaded, note that it converts to arrays anyway internally if you pass a non dtype.

adrinjalali · 2024-06-21T19:15:58Z

This is what happens when you poke at VERY old code lol

thomasjpfan

Thank you for the PR!

thomasjpfan · 2024-06-22T20:25:42Z

sklearn/model_selection/_search.py

+                except ValueError:
+                    # Fall back to iterating over `param_result.items()` below
+                    pass
+                else:


Can this whole block be written as:

if len(param_list) == n_candidates: with suppress(ValueError): ma = MaskedArray(param_list, mask=False, dtype=arr_dtype) if ma.ndim <= 1: results[key] = ma continue arr_dtype = object

(I'm trying to have less indenting)

sklearn/model_selection/_search.py

lesteve · 2024-06-24T06:41:39Z

As I mentioned in #29179 (comment), I am wondering whether it is time to do move this tricky code to its own function so that it can be more easily tested?

lesteve · 2024-06-27T09:29:59Z

sklearn/model_selection/tests/test_get_params_masked_array.py

+
+# If we construct this directly via `MaskedArray`, the list of tuples
+# gets auto-converted to a 2D array.
+ma = np.ma.MaskedArray(np.empty(2), mask=True, dtype=object)


It looks like something like this works:

my_iter = iter([(1, 2), (3, 4)]) masked_array = np.fromiter(my_iter, dtype=object)

but it's definitely not that easy to tell numpy to stick to 1d array with dtype object ...

...but not in old versions of numpy 😭

sklearn/model_selection/tests/test_yield_masked_array_for_each_param.py

…nto oh-no-not-another-one

lesteve · 2024-06-28T12:14:38Z

Hmmm I pushed a commit to simplify + a few tweaks but it fails on Windows and debian 32 probably something about the default int dtype ... the test can likely be relaxed a bit to make it pass for this particular edge case.

I'll try to have another fresh look on Monday.

MarcoGorelli · 2024-06-28T14:13:39Z

probably something about the default int dtype

yup, just pushed a commit to address it 👍

lesteve

LGTM, thanks a lot!

I agree with Adrin that is what happens when you touch very old code but at the same time, we are in a better situation now since the code has better tests and arguably the code is slightly more readable.

Oh well, future will tell if this was too optimistic or not 😉

jeremiedbb · 2024-07-01T12:12:22Z

sklearn/model_selection/_search.py

+def _yield_masked_array_for_each_param(
+    candidate_params: Sequence[dict[str, Any]],
+) -> Iterator[tuple[str, MaskedArray]]:
+    """


We've decided to not add type hints so far. Since the input and output types of this function are quite simple I'd not add them here.

jeremiedbb

LGTM. Thanks @MarcoGorelli

… of different sizes as parameter values (scikit-learn#29314) Co-authored-by: Loïc Estève <loic.esteve@ymail.com> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

… of different sizes as parameter values (#29314) Co-authored-by: Loïc Estève <loic.esteve@ymail.com> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

… of different sizes as parameter values (scikit-learn#29314) Co-authored-by: Loïc Estève <loic.esteve@ymail.com> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

github-actions bot added the module:model_selection label Jun 20, 2024

MarcoGorelli force-pushed the oh-no-not-another-one branch from 08b6b27 to ce49cb0 Compare June 20, 2024 13:39

MarcoGorelli changed the title ~~fix cv_results_ in GridSearch when params are arrays of varying sizes~~ fix: fix cv_results_ in GridSearch when params are arrays of varying sizes Jun 20, 2024

MarcoGorelli changed the title ~~fix: fix cv_results_ in GridSearch when params are arrays of varying sizes~~ FIX cv_results_ in GridSearch when params are arrays of varying sizes Jun 20, 2024

MarcoGorelli changed the title ~~FIX cv_results_ in GridSearch when params are arrays of varying sizes~~ Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values Jun 21, 2024

MarcoGorelli force-pushed the oh-no-not-another-one branch 2 times, most recently from 4d97ec6 to f2f4eeb Compare June 21, 2024 10:07

Fix a regression in GridSearchCV for parameter grids that have arrays…

e7357e2

… of different sizes as parameter values

MarcoGorelli force-pushed the oh-no-not-another-one branch from f2f4eeb to e7357e2 Compare June 21, 2024 10:08

MarcoGorelli marked this pull request as ready for review June 21, 2024 10:20

jeremiedbb added this to the 1.5.1 milestone Jun 21, 2024

thomasjpfan reviewed Jun 22, 2024

View reviewed changes

lesteve reviewed Jun 23, 2024

View reviewed changes

sklearn/model_selection/_search.py Outdated Show resolved Hide resolved

it gets simpler

36e5ba2

MarcoGorelli added 4 commits June 25, 2024 20:33

refactor, add unit tests

3d4cc14

Merge remote-tracking branch 'upstream/main' into oh-no-not-another-one

ae3ca59

actually add the test file

1048434

🎨

087b355

lesteve reviewed Jun 27, 2024

View reviewed changes

MarcoGorelli added 6 commits June 27, 2024 11:22

Merge remote-tracking branch 'upstream/main' into oh-no-not-another-one

f264528

simplify

3c455d0

rename, add docstring

1af52de

rename test too

8acde65

rename

1e21fd0

write it so that old versions of numpy can still run the test

44d996e

lesteve reviewed Jun 27, 2024

View reviewed changes

sklearn/model_selection/tests/test_yield_masked_array_for_each_param.py Outdated Show resolved Hide resolved

lesteve added 3 commits June 28, 2024 11:21

Move test to test_search.py

4215640

Simplify

1a2849b

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

af0e584

…nto oh-no-not-another-one

remove some unnecessary dtype pinning

4555635

Simplify test

9069c0c

lesteve approved these changes Jul 1, 2024

View reviewed changes

fix what's new + add test docstring

d61a963

jeremiedbb reviewed Jul 1, 2024

View reviewed changes

remove type hints

4b89d5d

jeremiedbb approved these changes Jul 1, 2024

View reviewed changes

jeremiedbb enabled auto-merge (squash) July 1, 2024 13:17

jeremiedbb merged commit bf08cb3 into scikit-learn:main Jul 1, 2024
28 checks passed

jeremiedbb mentioned this pull request Jul 2, 2024

Release 1.5.1 #29382

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values #29314

Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values #29314

MarcoGorelli commented Jun 20, 2024 •

edited

Loading

github-actions bot commented Jun 20, 2024 •

edited

Loading

MarcoGorelli commented Jun 21, 2024 •

edited

Loading

lesteve commented Jun 21, 2024 •

edited

Loading

adrinjalali commented Jun 21, 2024

thomasjpfan left a comment

thomasjpfan Jun 22, 2024

lesteve commented Jun 24, 2024

lesteve Jun 27, 2024 •

edited

Loading

MarcoGorelli Jun 27, 2024

lesteve commented Jun 28, 2024

MarcoGorelli commented Jun 28, 2024

lesteve left a comment

jeremiedbb Jul 1, 2024

jeremiedbb left a comment

Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values #29314

Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values #29314

Conversation

MarcoGorelli commented Jun 20, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Jun 20, 2024 • edited Loading

✔️ Linting Passed

MarcoGorelli commented Jun 21, 2024 • edited Loading

lesteve commented Jun 21, 2024 • edited Loading

adrinjalali commented Jun 21, 2024

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Jun 22, 2024

Choose a reason for hiding this comment

lesteve commented Jun 24, 2024

lesteve Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Jun 27, 2024

Choose a reason for hiding this comment

lesteve commented Jun 28, 2024

MarcoGorelli commented Jun 28, 2024

lesteve left a comment

Choose a reason for hiding this comment

jeremiedbb Jul 1, 2024

Choose a reason for hiding this comment

jeremiedbb left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Jun 20, 2024 •

edited

Loading

github-actions bot commented Jun 20, 2024 •

edited

Loading

MarcoGorelli commented Jun 21, 2024 •

edited

Loading

lesteve commented Jun 21, 2024 •

edited

Loading

lesteve Jun 27, 2024 •

edited

Loading