Skip to content

Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values #29314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jul 1, 2024

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Jun 20, 2024

Reference Issues/PRs

fixes #29277

What does this implement/fix? Explain your changes.

Any other comments?

This simplifies things a bit, but does involve allocating

arr = np.array(params_list)

Given that lists people are doing grid-search over are presumably small-ish (not in the millions / billions?), and that this logic is getting complicated enough, it's probably worth it?

Copy link

github-actions bot commented Jun 20, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4b89d5d. Link to the linter CI: here

@MarcoGorelli MarcoGorelli force-pushed the oh-no-not-another-one branch from 08b6b27 to ce49cb0 Compare June 20, 2024 13:39
@MarcoGorelli MarcoGorelli changed the title fix cv_results_ in GridSearch when params are arrays of varying sizes fix: fix cv_results_ in GridSearch when params are arrays of varying sizes Jun 20, 2024
@MarcoGorelli MarcoGorelli changed the title fix: fix cv_results_ in GridSearch when params are arrays of varying sizes FIX cv_results_ in GridSearch when params are arrays of varying sizes Jun 20, 2024
@MarcoGorelli MarcoGorelli changed the title FIX cv_results_ in GridSearch when params are arrays of varying sizes Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values Jun 21, 2024
@MarcoGorelli MarcoGorelli force-pushed the oh-no-not-another-one branch 2 times, most recently from 4d97ec6 to f2f4eeb Compare June 21, 2024 10:07
@MarcoGorelli MarcoGorelli force-pushed the oh-no-not-another-one branch from f2f4eeb to e7357e2 Compare June 21, 2024 10:08
@MarcoGorelli MarcoGorelli marked this pull request as ready for review June 21, 2024 10:20
@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Jun 21, 2024

@lesteve @thomasjpfan @GridSearchCVFriends fancy taking a look?

The logic is now simpler so my hope is that this'll fix it for good

EDIT: will look at increasing test coverage

@lesteve
Copy link
Member

lesteve commented Jun 21, 2024

Given that lists people are doing grid-search over are presumably small-ish (not in the millions / billions?), and that this logic is getting complicated enough, it's probably worth it?

About performance, I think the array was created with np.result_type anyway, at least that is what I understand from @seberg's comment numpy/numpy#26612 (comment)

result_type is badly overloaded, note that it converts to arrays anyway internally if you pass a non dtype.

@jeremiedbb jeremiedbb added this to the 1.5.1 milestone Jun 21, 2024
@adrinjalali
Copy link
Member

This is what happens when you poke at VERY old code lol

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

except ValueError:
# Fall back to iterating over `param_result.items()` below
pass
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this whole block be written as:

if len(param_list) == n_candidates:
    with suppress(ValueError):
        ma = MaskedArray(param_list, mask=False, dtype=arr_dtype)

    if ma.ndim <= 1:
        results[key] = ma
        continue

    arr_dtype = object

(I'm trying to have less indenting)

@lesteve
Copy link
Member

lesteve commented Jun 24, 2024

As I mentioned in #29179 (comment), I am wondering whether it is time to do move this tricky code to its own function so that it can be more easily tested?


# If we construct this directly via `MaskedArray`, the list of tuples
# gets auto-converted to a 2D array.
ma = np.ma.MaskedArray(np.empty(2), mask=True, dtype=object)
Copy link
Member

@lesteve lesteve Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like something like this works:

my_iter = iter([(1, 2), (3, 4)])
masked_array = np.fromiter(my_iter, dtype=object)

but it's definitely not that easy to tell numpy to stick to 1d array with dtype object ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but not in old versions of numpy 😭

@lesteve
Copy link
Member

lesteve commented Jun 28, 2024

Hmmm I pushed a commit to simplify + a few tweaks but it fails on Windows and debian 32 probably something about the default int dtype ... the test can likely be relaxed a bit to make it pass for this particular edge case.

I'll try to have another fresh look on Monday.

@MarcoGorelli
Copy link
Contributor Author

probably something about the default int dtype

yup, just pushed a commit to address it 👍

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot!

I agree with Adrin that is what happens when you touch very old code but at the same time, we are in a better situation now since the code has better tests and arguably the code is slightly more readable.

Oh well, future will tell if this was too optimistic or not 😉

Comment on lines 384 to 387
def _yield_masked_array_for_each_param(
candidate_params: Sequence[dict[str, Any]],
) -> Iterator[tuple[str, MaskedArray]]:
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've decided to not add type hints so far. Since the input and output types of this function are quite simple I'd not add them here.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @MarcoGorelli

@jeremiedbb jeremiedbb enabled auto-merge (squash) July 1, 2024 13:17
@jeremiedbb jeremiedbb merged commit bf08cb3 into scikit-learn:main Jul 1, 2024
28 checks passed
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024
… of different sizes as parameter values (scikit-learn#29314)

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
@jeremiedbb jeremiedbb mentioned this pull request Jul 2, 2024
11 tasks
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024
… of different sizes as parameter values (scikit-learn#29314)

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
jeremiedbb added a commit that referenced this pull request Jul 2, 2024
… of different sizes as parameter values (#29314)

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
snath-xoc pushed a commit to snath-xoc/scikit-learn that referenced this pull request Jul 5, 2024
… of different sizes as parameter values (scikit-learn#29314)

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GridSearchCV fails when parameters are arrays with different sizes
5 participants