Skip to content

[MRG+1] Make ParameterSampler sample without replacement #3850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

amueller
Copy link
Member

Fixes #3792.
I think this issue came up before so maybe it is worth addressing.
This makes the ParametersSampler sample without replacement if all parameters are lists.
This has a couple of gotchas:

  • It changes current behavior, in particular I raise an error if there are less possible settings than n_iter.
  • I don't know if there is a way to detect finite support in a scipy distribution, so we can not do it there. That means that 'bla' : bernoulli(0.5) is not equivalent any more to 'bla': [0, 1] (the first does sampling with replacement, the second without)

Possibilities to resolve these would be:

  • give a warning instead of a value error when n_iter > grid_size and produce a smaller grid (or sample with replacement in this case?). Might invalidate the __len__ implementation.
  • add a parameter with_replacement, set it to None, and go to a deprecation cycle to set it to False. The parameter would only do something in the case all parameters are lists, though.
  • Hack into the scipy dists, check if something is discrete and use lower and upper to get the support. If a user tries to implement his own distribution with finite support, we would still be screwed.

grid_size = np.prod([len(v) for v in self.param_distributions.values()])
if grid_size < self.n_iter:
raise ValueError("The total space of parameters passed is smaller than n_iter. "
"For exhaustive searches, use GridSearchCV.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be worth mentioning the concrete values for the size of the space of the parameters and the value of n_iter in the error message.

@amueller
Copy link
Member Author

Done. Do you think we should introduce an option?

# do sampling without replacement only if all_lists
# otherwise distributions with finite support might
# cause infinite loops
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think that this way of doing sampling without replacement might be too costly for values of n_iter close to the size of the parameter space? The last few samples will have have a very high likelihood of being rejected. Maybe we should check the case self.n_iter > 0.1 * grid_size and special purpose that by returning the n_iter first element a permutation of the materialized list of parameter combinations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in the worst case a single sample could have time complexity n_iter. I thought that probably is not an issue but I can just special case the whole finite grid-size path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up implementing your heuristic ;)

@ogrisel
Copy link
Member

ogrisel commented Nov 27, 2014

Do you think we should introduce an option?

I have no strong opinion. Maybe a warning is enough.

@amueller
Copy link
Member Author

How would you warn so that it doesn't pop up all the time?

@amueller amueller force-pushed the randomized_search_no_replacement branch 3 times, most recently from 5f22d5e to 133f3fd Compare December 17, 2014 16:38
if all_lists and self.n_iter > 0.1 * grid_size:
# get complete grid and yield from it
if grid_size < self.n_iter:
raise ValueError("The total space of parameters %d is smaller than n_iter=%d. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80 columns :)

@ogrisel
Copy link
Member

ogrisel commented Dec 18, 2014

Apart from my cosmetics comments, +1 on my side.

@ogrisel ogrisel changed the title MRG Make ParameterSampler sample without replacement [MRG+1] Make ParameterSampler sample without replacement Dec 18, 2014
@amueller
Copy link
Member Author

any other reviews?

@ogrisel
Copy link
Member

ogrisel commented Jan 3, 2015

ping @jnothman @larsmans @arjoly @mblondel.

@jnothman
Copy link
Member

jnothman commented Jan 4, 2015

I think this probably should be an option. Not altogether certain, though.

@amueller
Copy link
Member Author

amueller commented Jan 5, 2015

There is not really a benefit in doing sampling with replacement, but maybe it would be less magic.

@jnothman
Copy link
Member

jnothman commented Jan 5, 2015

Okay. I recall the issue of ParameterSampler sampling with replacement
being brought up a year or so ago, and I think Gaël suggested that multiple
runs of settings might be useful. I do think it's likely a weak argument in
the context of parameter search...

On 6 January 2015 at 04:01, Andreas Mueller notifications@github.com
wrote:

There is not really a benefit in doing sampling with replacement, but
maybe it would be less magic.


Reply to this email directly or view it on GitHub
#3850 (comment)
.

param_grid = list(ParameterGrid(self.param_distributions))
grid_size = len(param_grid)

if all_lists and self.n_iter > 0.1 * grid_size:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use sample_without_replacement at this point? It uses a similar heuristic to decide algorithm...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh never mind. I didn't read properly that you still do sampling without replacement when RVs are involved, which may be necessary for discrete RVs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, could just use that function for the all_lists case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@amueller amueller force-pushed the randomized_search_no_replacement branch 3 times, most recently from c9b4094 to 302feed Compare January 9, 2015 18:24
param_grid = list(ParameterGrid(self.param_distributions))
grid_size = len(param_grid)

if all_lists:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if block can be collapsed with the previous one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jnothman
Copy link
Member

Should the new behaviour be documented in ParameterSampler and/or RandomizedSearchCV docstrings?

@amueller amueller force-pushed the randomized_search_no_replacement branch from 297a9d0 to 2e2eebf Compare January 15, 2015 15:55
@ogrisel
Copy link
Member

ogrisel commented Jan 16, 2015

+1 for merge (again).

@amueller
Copy link
Member Author

@jnothman Documented it in both places.

@jnothman
Copy link
Member

Do we think sampling without replacement is more appropriate for e.g. poisson sampling than for lists?

@amueller
Copy link
Member Author

Sorry I don't understand your question. For distributions we do sampling with replacement as we can't know if the space was exhausted yet.

@jnothman
Copy link
Member

Of course, wasn't thinking (although the alternative is to promise the user
they will land in an infinite loop if they provide bad parameters!). Well,
the PR LGTM then.

On 18 January 2015 at 07:00, Andreas Mueller notifications@github.com
wrote:

Sorry I don't understand your question. For distributions we do sampling
with replacement as we can't know if the space was exhausted yet.


Reply to this email directly or view it on GitHub
#3850 (comment)
.

@jnothman
Copy link
Member

Merging.

jnothman added a commit that referenced this pull request Jan 18, 2015
[MRG] Make ParameterSampler sample without replacement
@jnothman jnothman merged commit 6c07cd4 into scikit-learn:master Jan 18, 2015
@jnothman
Copy link
Member

Thanks @amueller.

@amueller
Copy link
Member Author

Thanks for the reviews :)

@amueller amueller deleted the randomized_search_no_replacement branch January 19, 2015 19:05
@antoine-lizee
Copy link

Thanks for this @amueller !

@amueller
Copy link
Member Author

np

@chausler
Copy link

chausler commented Jun 9, 2015

Hi @amueller

@adinh and I ran into an issue as a result of this change. It makes the ParameterSampler super memory hungry even if you have a small number of categorical parameters to search on, because it now tries to generate all possible combination in memory.

This code below uses 600MB just for setting up the parameter search, as a couple more parameters and you quickly hit many-of-gigabytes.

from sklearn.grid_search import ParameterSampler
from memory_profiler import profile


@profile
def sample_params(params): 
  my_sampler = ParameterSampler(params, 10)
  for sample in my_sampler:
    print sample

if __name__ == '__main__':
  parameter_distribution = {
      'param_a': range(10),
      'param_b': range(10),
      'param_c': range(10),
      'param_d': range(10),
      'param_e': range(10),
      'param_f': range(10),
      }
  sample_params(parameter_distribution)

Output

$ python param_search.py                                                                                                 [15:54:29]
{'param_e': 5, 'param_d': 2, 'param_f': 1, 'param_a': 2, 'param_c': 4, 'param_b': 6}
{'param_e': 9, 'param_d': 1, 'param_f': 2, 'param_a': 9, 'param_c': 4, 'param_b': 9}
{'param_e': 9, 'param_d': 3, 'param_f': 3, 'param_a': 0, 'param_c': 2, 'param_b': 8}
{'param_e': 0, 'param_d': 8, 'param_f': 3, 'param_a': 3, 'param_c': 4, 'param_b': 0}
{'param_e': 8, 'param_d': 1, 'param_f': 5, 'param_a': 7, 'param_c': 1, 'param_b': 5}
{'param_e': 1, 'param_d': 9, 'param_f': 7, 'param_a': 3, 'param_c': 1, 'param_b': 0}
{'param_e': 5, 'param_d': 6, 'param_f': 3, 'param_a': 3, 'param_c': 9, 'param_b': 4}
{'param_e': 2, 'param_d': 2, 'param_f': 9, 'param_a': 5, 'param_c': 2, 'param_b': 3}
{'param_e': 1, 'param_d': 9, 'param_f': 8, 'param_a': 5, 'param_c': 1, 'param_b': 6}
{'param_e': 1, 'param_d': 5, 'param_f': 9, 'param_a': 4, 'param_c': 1, 'param_b': 1}
Filename: param_search.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     32.9 MiB      0.0 MiB   @profile
     6                             def sample_params(params):
     7     32.9 MiB      0.0 MiB     my_sampler = ParameterSampler(params, 10)
     8    633.6 MiB    600.7 MiB     for sample in my_sampler:
     9    633.6 MiB      0.0 MiB       print sample

You mentioned somewhere above about adding the ability to choose whether or not to sample with replacement, I think that would be a good option. Or at least we should update the doc string to point out whats going on here.

I'm not sure whether to file a bug report for this or is it expected behaviour? Thoughts?
We're happy to help out with a fix

@jnothman
Copy link
Member

jnothman commented Jun 9, 2015

I think this is a "bug" or at least an issue. A parameter to fix this might help, but really the problem is in realising the parameter grid in list(ParameterGrid(...)). We can instead dynamically compute a parameter setting given an index. With a bit of polishing, I think this might suffice:

    def __getitem__(self, ind):
        """
        >>> grid = ParameterGrid([{'kernel': ['linear']},
        ...                       {'kernel': ['rbf'], 'gamma': [1, 10]}])
        >>> [grid[i] for i in range(len(grid))] == list(grid)
        True
        """
        for sub_grid in self.param_grid:
            # XXX: could memoize information used here
            keys, values = zip(*sorted(sub_grid.items()))
            sizes = [len(v) for v in values]
            modulo = np.cumprod(np.hstack([1, sizes[::-1]]))[::-1]
            if ind >= modulo[0]:
                ind -= modulo[0]
            else:
                offsets = ind // modulo[1:] % sizes
                return {k: v[o] for k, v, o in zip(keys, values, offsets)}
        raise ValueError

and below change

param_grid = list(ParameterGrid(self.param_distributions))

to

param_grid = ParameterGrid(self.param_distributions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ParameterSampler launches several times a CV for the same sets of parameters when the distributions are discrete.
5 participants