[MRG+1] Make ParameterSampler sample without replacement #3850

amueller · 2014-11-13T16:16:31Z

Fixes #3792.
I think this issue came up before so maybe it is worth addressing.
This makes the ParametersSampler sample without replacement if all parameters are lists.
This has a couple of gotchas:

It changes current behavior, in particular I raise an error if there are less possible settings than n_iter.
I don't know if there is a way to detect finite support in a scipy distribution, so we can not do it there. That means that 'bla' : bernoulli(0.5) is not equivalent any more to 'bla': [0, 1] (the first does sampling with replacement, the second without)

Possibilities to resolve these would be:

give a warning instead of a value error when n_iter > grid_size and produce a smaller grid (or sample with replacement in this case?). Might invalidate the __len__ implementation.
add a parameter with_replacement, set it to None, and go to a deprecation cycle to set it to False. The parameter would only do something in the case all parameters are lists, though.
Hack into the scipy dists, check if something is discrete and use lower and upper to get the support. If a user tries to implement his own distribution with finite support, we would still be screwed.

ogrisel · 2014-11-24T13:08:15Z

sklearn/grid_search.py

+            grid_size = np.prod([len(v) for v in self.param_distributions.values()])
+            if grid_size < self.n_iter:
+                raise ValueError("The total space of parameters passed is smaller than n_iter. "
+                                 "For exhaustive searches, use GridSearchCV.")


Would be worth mentioning the concrete values for the size of the space of the parameters and the value of n_iter in the error message.

amueller · 2014-11-24T21:57:01Z

Done. Do you think we should introduce an option?

ogrisel · 2014-11-27T08:50:58Z

sklearn/grid_search.py

+                # do sampling without replacement only if all_lists
+                # otherwise distributions with finite support might
+                # cause infinite loops
+                continue


Don't you think that this way of doing sampling without replacement might be too costly for values of n_iter close to the size of the parameter space? The last few samples will have have a very high likelihood of being rejected. Maybe we should check the case self.n_iter > 0.1 * grid_size and special purpose that by returning the n_iter first element a permutation of the materialized list of parameter combinations.

Well, in the worst case a single sample could have time complexity n_iter. I thought that probably is not an issue but I can just special case the whole finite grid-size path.

Ended up implementing your heuristic ;)

ogrisel · 2014-11-27T08:54:14Z

Do you think we should introduce an option?

I have no strong opinion. Maybe a warning is enough.

amueller · 2014-12-17T15:39:54Z

How would you warn so that it doesn't pop up all the time?

ogrisel · 2014-12-18T20:16:49Z

sklearn/grid_search.py

+        if all_lists and self.n_iter > 0.1 * grid_size:
+            # get complete grid and yield from it
+            if grid_size < self.n_iter:
+                raise ValueError("The total space of parameters %d is smaller than n_iter=%d. "


80 columns :)

ogrisel · 2014-12-18T20:20:56Z

Apart from my cosmetics comments, +1 on my side.

amueller · 2014-12-29T18:55:38Z

any other reviews?

ogrisel · 2015-01-03T18:28:52Z

ping @jnothman @larsmans @arjoly @mblondel.

jnothman · 2015-01-04T03:30:39Z

I think this probably should be an option. Not altogether certain, though.

amueller · 2015-01-05T17:01:50Z

There is not really a benefit in doing sampling with replacement, but maybe it would be less magic.

jnothman · 2015-01-05T21:02:23Z

Okay. I recall the issue of ParameterSampler sampling with replacement
being brought up a year or so ago, and I think Gaël suggested that multiple
runs of settings might be useful. I do think it's likely a weak argument in
the context of parameter search...

On 6 January 2015 at 04:01, Andreas Mueller notifications@github.com
wrote:

There is not really a benefit in doing sampling with replacement, but
maybe it would be less magic.

—
Reply to this email directly or view it on GitHub
#3850 (comment)
.

jnothman · 2015-01-07T13:21:16Z

sklearn/grid_search.py

+            param_grid = list(ParameterGrid(self.param_distributions))
+            grid_size = len(param_grid)
+
+        if all_lists and self.n_iter > 0.1 * grid_size:


Why not just use sample_without_replacement at this point? It uses a similar heuristic to decide algorithm...

Oh never mind. I didn't read properly that you still do sampling without replacement when RVs are involved, which may be necessary for discrete RVs.

Still, could just use that function for the all_lists case.

ogrisel · 2015-01-15T08:29:44Z

sklearn/grid_search.py

+            param_grid = list(ParameterGrid(self.param_distributions))
+            grid_size = len(param_grid)
+
+        if all_lists:


This if block can be collapsed with the previous one.

jnothman · 2015-01-15T09:57:43Z

Should the new behaviour be documented in ParameterSampler and/or RandomizedSearchCV docstrings?

…e given as lists.

ogrisel · 2015-01-16T10:49:51Z

+1 for merge (again).

amueller · 2015-01-16T17:15:12Z

@jnothman Documented it in both places.

jnothman · 2015-01-17T13:23:20Z

Do we think sampling without replacement is more appropriate for e.g. poisson sampling than for lists?

amueller · 2015-01-17T20:00:47Z

Sorry I don't understand your question. For distributions we do sampling with replacement as we can't know if the space was exhausted yet.

jnothman · 2015-01-17T21:49:09Z

Of course, wasn't thinking (although the alternative is to promise the user
they will land in an infinite loop if they provide bad parameters!). Well,
the PR LGTM then.

On 18 January 2015 at 07:00, Andreas Mueller notifications@github.com
wrote:

Sorry I don't understand your question. For distributions we do sampling
with replacement as we can't know if the space was exhausted yet.

—
Reply to this email directly or view it on GitHub
#3850 (comment)
.

jnothman · 2015-01-18T10:29:58Z

Merging.

[MRG] Make ParameterSampler sample without replacement

jnothman · 2015-01-18T10:30:21Z

Thanks @amueller.

amueller · 2015-01-19T19:05:52Z

Thanks for the reviews :)

antoine-lizee · 2015-05-11T17:53:25Z

Thanks for this @amueller !

amueller · 2015-05-11T18:08:55Z

np

chausler · 2015-06-09T06:05:49Z

Hi @amueller

@adinh and I ran into an issue as a result of this change. It makes the ParameterSampler super memory hungry even if you have a small number of categorical parameters to search on, because it now tries to generate all possible combination in memory.

This code below uses 600MB just for setting up the parameter search, as a couple more parameters and you quickly hit many-of-gigabytes.

from sklearn.grid_search import ParameterSampler
from memory_profiler import profile


@profile
def sample_params(params): 
  my_sampler = ParameterSampler(params, 10)
  for sample in my_sampler:
    print sample

if __name__ == '__main__':
  parameter_distribution = {
      'param_a': range(10),
      'param_b': range(10),
      'param_c': range(10),
      'param_d': range(10),
      'param_e': range(10),
      'param_f': range(10),
      }
  sample_params(parameter_distribution)

Output

$ python param_search.py                                                                                                 [15:54:29]
{'param_e': 5, 'param_d': 2, 'param_f': 1, 'param_a': 2, 'param_c': 4, 'param_b': 6}
{'param_e': 9, 'param_d': 1, 'param_f': 2, 'param_a': 9, 'param_c': 4, 'param_b': 9}
{'param_e': 9, 'param_d': 3, 'param_f': 3, 'param_a': 0, 'param_c': 2, 'param_b': 8}
{'param_e': 0, 'param_d': 8, 'param_f': 3, 'param_a': 3, 'param_c': 4, 'param_b': 0}
{'param_e': 8, 'param_d': 1, 'param_f': 5, 'param_a': 7, 'param_c': 1, 'param_b': 5}
{'param_e': 1, 'param_d': 9, 'param_f': 7, 'param_a': 3, 'param_c': 1, 'param_b': 0}
{'param_e': 5, 'param_d': 6, 'param_f': 3, 'param_a': 3, 'param_c': 9, 'param_b': 4}
{'param_e': 2, 'param_d': 2, 'param_f': 9, 'param_a': 5, 'param_c': 2, 'param_b': 3}
{'param_e': 1, 'param_d': 9, 'param_f': 8, 'param_a': 5, 'param_c': 1, 'param_b': 6}
{'param_e': 1, 'param_d': 5, 'param_f': 9, 'param_a': 4, 'param_c': 1, 'param_b': 1}
Filename: param_search.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     32.9 MiB      0.0 MiB   @profile
     6                             def sample_params(params):
     7     32.9 MiB      0.0 MiB     my_sampler = ParameterSampler(params, 10)
     8    633.6 MiB    600.7 MiB     for sample in my_sampler:
     9    633.6 MiB      0.0 MiB       print sample

You mentioned somewhere above about adding the ability to choose whether or not to sample with replacement, I think that would be a good option. Or at least we should update the doc string to point out whats going on here.

I'm not sure whether to file a bug report for this or is it expected behaviour? Thoughts?
We're happy to help out with a fix

jnothman · 2015-06-09T06:58:50Z

I think this is a "bug" or at least an issue. A parameter to fix this might help, but really the problem is in realising the parameter grid in list(ParameterGrid(...)). We can instead dynamically compute a parameter setting given an index. With a bit of polishing, I think this might suffice:

    def __getitem__(self, ind):
        """
        >>> grid = ParameterGrid([{'kernel': ['linear']},
        ...                       {'kernel': ['rbf'], 'gamma': [1, 10]}])
        >>> [grid[i] for i in range(len(grid))] == list(grid)
        True
        """
        for sub_grid in self.param_grid:
            # XXX: could memoize information used here
            keys, values = zip(*sorted(sub_grid.items()))
            sizes = [len(v) for v in values]
            modulo = np.cumprod(np.hstack([1, sizes[::-1]]))[::-1]
            if ind >= modulo[0]:
                ind -= modulo[0]
            else:
                offsets = ind // modulo[1:] % sizes
                return {k: v[o] for k, v, o in zip(keys, values, offsets)}
        raise ValueError

and below change

param_grid = list(ParameterGrid(self.param_distributions))

to

param_grid = ParameterGrid(self.param_distributions)

ogrisel reviewed Nov 24, 2014
View reviewed changes

ogrisel reviewed Nov 27, 2014
View reviewed changes

amueller force-pushed the randomized_search_no_replacement branch 3 times, most recently from 5f22d5e to 133f3fd Compare December 17, 2014 16:38

ogrisel reviewed Dec 18, 2014
View reviewed changes

ogrisel changed the title ~~MRG Make ParameterSampler sample without replacement~~ [MRG+1] Make ParameterSampler sample without replacement Dec 18, 2014

jnothman reviewed Jan 7, 2015
View reviewed changes

amueller force-pushed the randomized_search_no_replacement branch 3 times, most recently from c9b4094 to 302feed Compare January 9, 2015 18:24

ogrisel reviewed Jan 15, 2015
View reviewed changes

Make ParameterSampler sample without replacement if all parameters ar…

2e2eebf

…e given as lists.

amueller force-pushed the randomized_search_no_replacement branch from 297a9d0 to 2e2eebf Compare January 15, 2015 15:55

jnothman added a commit that referenced this pull request Jan 18, 2015

Merge pull request #3850 from amueller/randomized_search_no_replacement

6c07cd4

[MRG] Make ParameterSampler sample without replacement

jnothman merged commit 6c07cd4 into scikit-learn:master Jan 18, 2015

jnothman mentioned this pull request Jan 18, 2015

Random Grid Search invalid configurations #2786

Closed

amueller deleted the randomized_search_no_replacement branch January 19, 2015 19:05

jnothman mentioned this pull request Jun 9, 2015

[MRG + 2] FIX avoid memory cost when sampling from large parameter grids #4840

Merged

3 tasks

Uh oh!

[MRG+1] Make ParameterSampler sample without replacement #3850

[MRG+1] Make ParameterSampler sample without replacement #3850

Uh oh!

Conversation

amueller commented Nov 13, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Nov 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Nov 27, 2014

Uh oh!

amueller commented Dec 17, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Dec 18, 2014

Uh oh!

amueller commented Dec 29, 2014

Uh oh!

ogrisel commented Jan 3, 2015

Uh oh!

jnothman commented Jan 4, 2015

Uh oh!

amueller commented Jan 5, 2015

Uh oh!

jnothman commented Jan 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 15, 2015

Uh oh!

ogrisel commented Jan 16, 2015

Uh oh!

amueller commented Jan 16, 2015

Uh oh!

jnothman commented Jan 17, 2015

Uh oh!

amueller commented Jan 17, 2015

Uh oh!

jnothman commented Jan 17, 2015

Uh oh!

jnothman commented Jan 18, 2015

Uh oh!

jnothman commented Jan 18, 2015

Uh oh!

amueller commented Jan 19, 2015

Uh oh!

antoine-lizee commented May 11, 2015

Uh oh!

amueller commented May 11, 2015

Uh oh!

chausler commented Jun 9, 2015

Uh oh!

jnothman commented Jun 9, 2015

Uh oh!

Uh oh!