-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values #29314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values #29314
Conversation
08b6b27
to
ce49cb0
Compare
4d97ec6
to
f2f4eeb
Compare
… of different sizes as parameter values
f2f4eeb
to
e7357e2
Compare
@lesteve @thomasjpfan @GridSearchCVFriends fancy taking a look? The logic is now simpler so my hope is that this'll fix it for good EDIT: will look at increasing test coverage |
About performance, I think the array was created with
|
This is what happens when you poke at VERY old code lol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
sklearn/model_selection/_search.py
Outdated
except ValueError: | ||
# Fall back to iterating over `param_result.items()` below | ||
pass | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this whole block be written as:
if len(param_list) == n_candidates:
with suppress(ValueError):
ma = MaskedArray(param_list, mask=False, dtype=arr_dtype)
if ma.ndim <= 1:
results[key] = ma
continue
arr_dtype = object
(I'm trying to have less indenting)
As I mentioned in #29179 (comment), I am wondering whether it is time to do move this tricky code to its own function so that it can be more easily tested? |
|
||
# If we construct this directly via `MaskedArray`, the list of tuples | ||
# gets auto-converted to a 2D array. | ||
ma = np.ma.MaskedArray(np.empty(2), mask=True, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like something like this works:
my_iter = iter([(1, 2), (3, 4)])
masked_array = np.fromiter(my_iter, dtype=object)
but it's definitely not that easy to tell numpy to stick to 1d array with dtype object ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...but not in old versions of numpy 😭
sklearn/model_selection/tests/test_yield_masked_array_for_each_param.py
Outdated
Show resolved
Hide resolved
…nto oh-no-not-another-one
Hmmm I pushed a commit to simplify + a few tweaks but it fails on Windows and debian 32 probably something about the default int dtype ... the test can likely be relaxed a bit to make it pass for this particular edge case. I'll try to have another fresh look on Monday. |
yup, just pushed a commit to address it 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot!
I agree with Adrin that is what happens when you touch very old code but at the same time, we are in a better situation now since the code has better tests and arguably the code is slightly more readable.
Oh well, future will tell if this was too optimistic or not 😉
sklearn/model_selection/_search.py
Outdated
def _yield_masked_array_for_each_param( | ||
candidate_params: Sequence[dict[str, Any]], | ||
) -> Iterator[tuple[str, MaskedArray]]: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've decided to not add type hints so far. Since the input and output types of this function are quite simple I'd not add them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @MarcoGorelli
… of different sizes as parameter values (scikit-learn#29314) Co-authored-by: Loïc Estève <loic.esteve@ymail.com> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
… of different sizes as parameter values (scikit-learn#29314) Co-authored-by: Loïc Estève <loic.esteve@ymail.com> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
… of different sizes as parameter values (#29314) Co-authored-by: Loïc Estève <loic.esteve@ymail.com> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
… of different sizes as parameter values (scikit-learn#29314) Co-authored-by: Loïc Estève <loic.esteve@ymail.com> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Reference Issues/PRs
fixes #29277
What does this implement/fix? Explain your changes.
Any other comments?
This simplifies things a bit, but does involve allocating
Given that lists people are doing grid-search over are presumably small-ish (not in the millions / billions?), and that this logic is getting complicated enough, it's probably worth it?