-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] Make ParameterSampler sample without replacement #3850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Make ParameterSampler sample without replacement #3850
Conversation
grid_size = np.prod([len(v) for v in self.param_distributions.values()]) | ||
if grid_size < self.n_iter: | ||
raise ValueError("The total space of parameters passed is smaller than n_iter. " | ||
"For exhaustive searches, use GridSearchCV.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be worth mentioning the concrete values for the size of the space of the parameters and the value of n_iter
in the error message.
Done. Do you think we should introduce an option? |
# do sampling without replacement only if all_lists | ||
# otherwise distributions with finite support might | ||
# cause infinite loops | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you think that this way of doing sampling without replacement might be too costly for values of n_iter
close to the size of the parameter space? The last few samples will have have a very high likelihood of being rejected. Maybe we should check the case self.n_iter > 0.1 * grid_size
and special purpose that by returning the n_iter
first element a permutation of the materialized list of parameter combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, in the worst case a single sample could have time complexity n_iter
. I thought that probably is not an issue but I can just special case the whole finite grid-size path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up implementing your heuristic ;)
I have no strong opinion. Maybe a warning is enough. |
How would you warn so that it doesn't pop up all the time? |
5f22d5e
to
133f3fd
Compare
if all_lists and self.n_iter > 0.1 * grid_size: | ||
# get complete grid and yield from it | ||
if grid_size < self.n_iter: | ||
raise ValueError("The total space of parameters %d is smaller than n_iter=%d. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
80 columns :)
Apart from my cosmetics comments, +1 on my side. |
any other reviews? |
I think this probably should be an option. Not altogether certain, though. |
There is not really a benefit in doing sampling with replacement, but maybe it would be less magic. |
Okay. I recall the issue of ParameterSampler sampling with replacement On 6 January 2015 at 04:01, Andreas Mueller notifications@github.com
|
param_grid = list(ParameterGrid(self.param_distributions)) | ||
grid_size = len(param_grid) | ||
|
||
if all_lists and self.n_iter > 0.1 * grid_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use sample_without_replacement
at this point? It uses a similar heuristic to decide algorithm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh never mind. I didn't read properly that you still do sampling without replacement when RVs are involved, which may be necessary for discrete RVs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, could just use that function for the all_lists
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
c9b4094
to
302feed
Compare
param_grid = list(ParameterGrid(self.param_distributions)) | ||
grid_size = len(param_grid) | ||
|
||
if all_lists: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if block can be collapsed with the previous one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Should the new behaviour be documented in |
…e given as lists.
297a9d0
to
2e2eebf
Compare
+1 for merge (again). |
@jnothman Documented it in both places. |
Do we think sampling without replacement is more appropriate for e.g. poisson sampling than for lists? |
Sorry I don't understand your question. For distributions we do sampling with replacement as we can't know if the space was exhausted yet. |
Of course, wasn't thinking (although the alternative is to promise the user On 18 January 2015 at 07:00, Andreas Mueller notifications@github.com
|
Merging. |
[MRG] Make ParameterSampler sample without replacement
Thanks @amueller. |
Thanks for the reviews :) |
Thanks for this @amueller ! |
np |
Hi @amueller @adinh and I ran into an issue as a result of this change. It makes the ParameterSampler super memory hungry even if you have a small number of categorical parameters to search on, because it now tries to generate all possible combination in memory. This code below uses 600MB just for setting up the parameter search, as a couple more parameters and you quickly hit many-of-gigabytes. from sklearn.grid_search import ParameterSampler
from memory_profiler import profile
@profile
def sample_params(params):
my_sampler = ParameterSampler(params, 10)
for sample in my_sampler:
print sample
if __name__ == '__main__':
parameter_distribution = {
'param_a': range(10),
'param_b': range(10),
'param_c': range(10),
'param_d': range(10),
'param_e': range(10),
'param_f': range(10),
}
sample_params(parameter_distribution) Output
You mentioned somewhere above about adding the ability to choose whether or not to sample with replacement, I think that would be a good option. Or at least we should update the doc string to point out whats going on here. I'm not sure whether to file a bug report for this or is it expected behaviour? Thoughts? |
I think this is a "bug" or at least an issue. A parameter to fix this might help, but really the problem is in realising the parameter grid in def __getitem__(self, ind):
"""
>>> grid = ParameterGrid([{'kernel': ['linear']},
... {'kernel': ['rbf'], 'gamma': [1, 10]}])
>>> [grid[i] for i in range(len(grid))] == list(grid)
True
"""
for sub_grid in self.param_grid:
# XXX: could memoize information used here
keys, values = zip(*sorted(sub_grid.items()))
sizes = [len(v) for v in values]
modulo = np.cumprod(np.hstack([1, sizes[::-1]]))[::-1]
if ind >= modulo[0]:
ind -= modulo[0]
else:
offsets = ind // modulo[1:] % sizes
return {k: v[o] for k, v, o in zip(keys, values, offsets)}
raise ValueError and below change param_grid = list(ParameterGrid(self.param_distributions)) to param_grid = ParameterGrid(self.param_distributions) |
Fixes #3792.
I think this issue came up before so maybe it is worth addressing.
This makes the ParametersSampler sample without replacement if all parameters are lists.
This has a couple of gotchas:
n_iter
.'bla' : bernoulli(0.5)
is not equivalent any more to'bla': [0, 1]
(the first does sampling with replacement, the second without)Possibilities to resolve these would be:
n_iter
>grid_size
and produce a smaller grid (or sample with replacement in this case?). Might invalidate the__len__
implementation.with_replacement
, set it toNone
, and go to a deprecation cycle to set it toFalse
. The parameter would only do something in the case all parameters are lists, though.