[MRG] Expose random seed in Hashingvectorizer #14605

TwsThomas · 2019-08-08T12:35:53Z

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

Expose the seed from murmur_hash in HashingVectorizer to allow minhash techniques (https://en.wikipedia.org/wiki/MinHash).

Any other comments?

Ping @glemaitre, @GaelVaroquaux

rth

Thanks! I understand the motivation, this PR does enforce scikit-learn API conventions. My concern is that previously HashingVectorizer was stateless, we repeatedly say so in the documentation and for better or worse illustrate its use as HashingVectorizer().transform(X) in the documentation (i.e. without calling .fit). Now this PR would error if fit isn't called which is technically backward incompatible change.

So I'm wondering if we couldn't just default random_state=0 to be backward compatible without a warning, and possibly do this checking in transform.

Not asking to do these changes now, let's hear some more opinions about it.

(BTW, tests are failing because we fail on uncaught FutureWarnings and this PR produces some in tests).

sklearn/feature_extraction/hashing.py

sklearn/feature_extraction/text.py

TwsThomas · 2019-08-09T08:22:27Z

Thank you for the feedback.

So I'm wondering if we couldn't just default random_state=0 to be backward compatible without a warning, and possibly do this checking in transform.

Yes, it might be counter-productive to enforce a fit in HashingVectorizer only for a temporary warning.

NicolasHug

A first pass.

@TwsThomas for catching warnings while testing you can chekc out https://scikit-learn.org/stable/developers/tips.html#useful-pytest-aliases-and-flags

Basically I agree with @rth suggestion.

We can't enforce using fit since we explicitly just call transform in the docs. Validating self.random_state in transform seems to be a good solution.

I'm also OK with setting the default to 0 to preserve backward compatibility. It's inconsistent with other estimators (None is arguably a not-so-good default anyway, cf #14042), but if the default is 'warn' then every single user would get a warning.

How useful is it for users to set the random state though? If the ultimate goal is only to implement minhash techniques, then maybe we could make the random_state part of a private API, or have a private child class that implements it?

sklearn/feature_extraction/hashing.py

NicolasHug · 2019-08-09T12:46:29Z

sklearn/feature_extraction/hashing.py

-        This method doesn't do anything. It exists purely for compatibility
-        with the scikit-learn transformer API.
+        This method doesn't do anything except validate self.random_state.
+        And it compute the seed (int) from the random_state


No need to go into the details of the random state IMO.

NicolasHug · 2019-08-09T12:47:31Z

sklearn/feature_extraction/hashing.py

@@ -123,6 +131,20 @@ def fit(self, X=None, y=None):
        """
        # repeat input validation for grid search (which calls set_params)
        self._validate_params(self.n_features, self.input_type)
+
+        # set self.random_state


You shouldn't set parameters that were passed to init. Use e.g. self._random_state instead.

NicolasHug · 2019-08-09T12:47:42Z

sklearn/feature_extraction/hashing.py

+            self.random_state = check_random_state(self.random_state)
+
+        # extract the seed (int) from the RandomState instance
+        self.seed = self.random_state.get_state()[1][0]


make it private?

TwsThomas · 2019-08-13T12:34:31Z

According to @NicolasHug, the seed is now private.
Here the precedent behaviour is unchanged.

glemaitre

Otherwise LGTM

sklearn/feature_extraction/tests/test_text.py

rth · 2019-09-11T09:53:41Z

In the end I think I would be +1 to just expose random_state=0 as a public parameter. Undocumented private attribute is not ideal.

How useful is it for users to set the random state though?

Well aside minhash it's useful to know that HashingVectorizer output does depend on an arbitrarily fixed random_state. For instance, checking that the classifier output is robust to this random state is something that could be worthwhile. Since it will change the words for which the collision happen, the effect can be non negligible particularly for low n_features.

GaelVaroquaux · 2019-09-11T13:25:08Z

In the end I think I would be +1 to just expose random_state=0 as a public parameter.

I agree. Good thinking, @rth!

NicolasHug · 2019-09-11T14:29:20Z

For what it's worth, we don't need a private attribute, we could just introduce a new private class.

If we expose the random state, we need to make sure that None is forbidden, and RandomStateInstance is forbidden too. Else, calling transform twice on the same data gives different embeddings.

For this reason, I'd rather introduce a seed parameter, rather than random_state. No need to worry about API inconsistency anymore.

rth · 2019-09-11T14:31:23Z

If we expose the random state, we need to make sure that None is forbidden, and RandomStateInstance is forbidden too. Else, calling transform twice on the same data gives different embeddings.

Why? Can't we just clone it in transform and use as a local variable?

rth · 2019-09-11T14:42:25Z

Maybe something like the following in transform

if isinstance(self.random_state, int):
    seed = self.random_state
else:
    random_state = check_random_state(self.random_state)

    # Create a seed from the internal state of the RandomState
    # object
    seed = random_state.get_state()[1].sum()

? random_state=0 would be backward compatible but then indeed it would not be equivalent to random_state=RandomState(0). Not sure we make such promises anywhere though.

NicolasHug · 2019-09-11T14:50:30Z

As far as I understand, it does not make any sense for the FeatureHasher to give different outputs for the same X?

rth · 2019-09-11T14:58:07Z

As far as I understand, it does not make any sense for the FeatureHasher to give different outputs for the same X?

So with the above code #14605 (comment) I think providing RandomState(int) should return consistently the same output.

random_state=None is indeed an issue. It's not possible to return consistent output without storing the random_state in .fit. OK, maybe disallow random_state=None then?

I'd rather introduce a seed parameter, rather than random_state.

We can still call it random_state even if the None option is not supported, can't we?

NicolasHug · 2019-09-11T15:04:56Z

We can still call it random_state even if the None option is not supported, can't we?

That + your snippet above seems like a lot of work only for the benefit of calling the parameter random_state, which would have an inconsistent behaviour w.r.t. all the other random_state parameters of scikit-learn.

What we need is a random seed, why not just call it seed? It's simple, and it does what users will expect.

rth · 2019-09-11T15:33:47Z

A simplified version that should be more consistent (about random_state=int vs random_state=RandomState(int) cf SO) while preserving backward compatibility for random_state=0,

def get_seed(random_state)
   """Get a seed from the internal state of the RandomState"""
    return int(random_state.get_state()[1][0]))

if self.random_state is None:
    raise ValueError('random_state=None is not supported.')
elif isinstance(self.random_state, int):
    seed = self.random_state
else:
    random_state = check_random_state(self.random_state)
	seed = get_seed(random_state)

What we need is a random seed, why not just call it seed? It's simple, and it does what users will expect.

Well because in >80% of case users pass an int as the random_state, and in that use case it behaves exactly as seed, so there is no reason to call it differently.

I agree some work is needed to extract the seed from the RandomState but I would argue that the additional few lines are negligible with respect to keeping the parameter naming consistent.

rth · 2019-09-11T15:38:05Z

Although, initializing the RandomState seems to allocate a numpy array of size 624 which is non negligible, so an optimized path for random_state=int is still probably worth it. Updated the above comment.

NicolasHug · 2019-09-11T15:41:39Z

I'm sorry but I really don't understand why you want to push for random_state here.

GaelVaroquaux · 2019-09-11T15:47:10Z

I'd rather introduce a seed parameter, rather than random_state. We can still call it random_state even if the None option is not supported, can't we?

Now that I think of it, I think that the semantics of the parameter is that of a "seed", and not a RandomState. Indeed, as mentioned by Nicolas, we want entropy but not randomness. Two calls on the same input should return the same output. Hence I am in favor of renaming it to "seed" (or "salt"). WDYT?

rth · 2019-09-11T16:10:58Z

I'm sorry but I really don't understand why you want to push for random_state here.

Well seed is not a great name, because it doesn't indicate anything about how it impacts the output. You could call it random_seed but that's confusing with the existing random_state variable. Then if we call it random_state it kind of implies that RandomState instances should be supported. But maybe you are right about not supporting that use case, I don't have a strong opinion about. My point is about keeping a consistent name.

An important point though is that we have a number of common tests that check how an estimator should behave when random_state is provided. If we call it something different, it means that we need to implement those checks manually. For instance I wonder if the common tests would pass with the above implementation.

Indeed, as mentioned by Nicolas,
we want entropy but not randomness. Two calls on the same input should
return the same output.

As far as I understood the mutability of self.random_state is considered a bug and #14042 aims to fix it? In which case repeated calls would always return the same result in both cases, or am I missing something?

Besides aren't there some analogies between FeatureHasher and random projections, in terms of reducing the dimensionality of the input? The latter does use the random_state parameter.

NicolasHug · 2019-09-11T16:28:59Z

Well seed is not a great name, because it doesn't indicate anything about how it impacts the output.

I 100% agree on this and this is also true for the great majority of the random_state parameters. They are poorly documented. It's very easy to fix this problem, we just need a good docstring for seed.

As you noted, #14042 points out that passing a RandomState instance has usually undesirable effects, just like passing None. I see that as an argument in favor of using seed.

jeremiedbb

1 comment. Otherwise looks good

sklearn/feature_extraction/hashing.py

glemaitre

You will need to add an entry in doc/whats_new/v0.22.rst

glemaitre · 2019-09-24T12:18:10Z

sklearn/feature_extraction/_hashing.pyx

@@ -17,7 +17,8 @@ from ..utils.fixes import sp_version
 np.import_array()


-def transform(raw_X, Py_ssize_t n_features, dtype, bint alternate_sign=1):
+def transform(raw_X, Py_ssize_t n_features, dtype,
+             bint alternate_sign=1, unsigned int seed=0):


Suggested change

bint alternate_sign=1, unsigned int seed=0):

bint alternate_sign=1, unsigned int seed=0):

glemaitre · 2019-09-24T12:22:11Z

sklearn/feature_extraction/hashing.py

@@ -123,6 +128,11 @@ def fit(self, X=None, y=None):
        """
        # repeat input validation for grid search (which calls set_params)
        self._validate_params(self.n_features, self.input_type)
+
+        # optional if random_state is left to an integer seed


optional is weird since there is nothing preventing the process of checking even if this is an integer.
Shall we just remove the comment then.

glemaitre · 2019-09-24T12:24:41Z

sklearn/feature_extraction/hashing.py

@@ -148,9 +158,19 @@ def transform(self, raw_X):
            raw_X = (_iteritems(d) for d in raw_X)
        elif self.input_type == "string":
            raw_X = (((f, 1) for f in x) for x in raw_X)
+        # expose the seed for HashingVectorizer
+        if isinstance(self.random_state, int):


I don't think that it will work with np.int32 or another integer type.

from numbers import Integral

Suggested change

if isinstance(self.random_state, int):

if isinstance(self.random_state, Integral):

glemaitre · 2019-09-24T12:25:23Z

sklearn/feature_extraction/hashing.py

@@ -148,9 +158,19 @@ def transform(self, raw_X):
            raw_X = (_iteritems(d) for d in raw_X)
        elif self.input_type == "string":
            raw_X = (((f, 1) for f in x) for x in raw_X)
+        # expose the seed for HashingVectorizer
+        if isinstance(self.random_state, int):
+            seed = self.random_state   # stateless estimator


to force it to a 64 bits integer for sure

Suggested change

seed = self.random_state # stateless estimator

seed = int(self.random_state) # stateless estimator

glemaitre · 2019-09-24T12:26:14Z

sklearn/feature_extraction/hashing.py

+            seed = self.seed_
+        else:
+            raise NotFittedError("FeaturHasher needs to be fitted"
+                                 "when random_state is not a fixed integer"


Suggested change

"when random_state is not a fixed integer"

"when random_state is not a fixed integer."

glemaitre · 2019-09-24T12:36:52Z

sklearn/feature_extraction/tests/test_text.py

+    X = getattr(hv, method)(text)
+    assert_array_equal(X.indices, [639749, 784967, 784967, 784967, 945982])
+
+    # assert seed influence


remove comment

instead of the comment, you can test the following which is what you try to explain

hashing_vectorizer = HashingVectorizer() X_default = getattr(hashing_vectorizer, method)(text) hashing_vectorizer = HashingVectorizer(random_state=0) X_seed_0 = getattr(hashing_vectorizer, method)(text) hashing_vectorizer = HashingVectorizer(random_state=1) X_seed_1 = getattr(hashing_vectorizer, method)(text) assert_allclose_dense_sparse(X_default, X_seed_0) with pytest.raises(AssertionError, match='Not equal'): assert_allclose_dense_sparse(X_default, X_seed_1)

I would probably use the same test for FeatureHasher. It avoids to hard-code the output.

glemaitre · 2019-09-24T12:42:01Z

sklearn/feature_extraction/tests/test_text.py

+def test_hashingvectorizer_random_state_not_int(random_state):
+    text = ['hello world', 'hello hello', 'hello goodbye']
+    # assert random_seed=None should be fitted
+    hv = HashingVectorizer(random_state=random_state)


check my comment for the FeatureHasher

glemaitre · 2019-09-24T12:42:12Z

sklearn/feature_extraction/text.py

@@ -715,6 +720,10 @@ def fit(self, X, y=None):
        self._warn_for_unused_params()
        self._validate_params()

+        # optional if random_state is left to an integer seed


glemaitre · 2019-09-24T12:42:27Z

sklearn/feature_extraction/text.py

@@ -741,6 +750,14 @@ def transform(self, X):
        self._validate_params()

        analyzer = self.build_analyzer()
+
+        if isinstance(self.random_state, int):


Suggested change

if isinstance(self.random_state, int):

if isinstance(self.random_state, Integral):

glemaitre · 2019-09-24T12:42:35Z

sklearn/feature_extraction/text.py

@@ -741,6 +750,14 @@ def transform(self, X):
        self._validate_params()

        analyzer = self.build_analyzer()
+
+        if isinstance(self.random_state, int):
+            self.seed_ = self.random_state   # stateless estimator


Suggested change

self.seed_ = self.random_state # stateless estimator

self.seed_ = int(self.random_state) # stateless estimator

NicolasHug · 2019-09-24T13:04:43Z

Sorry but I really think the logic we are implementing here is overkill, confusing and inconsistent.

random_state doesn't behave like the other random_state parameters
It's not clear at all what a fitted FeatureHasher is. Users will not understand unless we very carefully explain all this (complex) random_state logic. We're adding a lot of cognitive overload for users and for us in the code. I don't think it's worth it.

I don't even see the point of tuning the seed of a FeatureHasher. This is encouraging users to treat the random seed as a hyper parameter.

Again, can we just make all this logic private? At least until we figure out a general solution regarding #14042.

GaelVaroquaux · 2019-09-24T13:31:22Z

Again, can we just make all this logic private? At least until we figure out a general solution regarding #14042.

I'm fine with that. For purposes of implementing min-hash (which is our main driver here), I believe that only the modifications to sklearn/feature_extraction/_hashing.pyx are needed, and not the modifications to HashingVectorizer.

rth · 2019-09-24T13:45:27Z

Fine if it's a blocker we can make it private again for now.

I don't even see the point of tuning the seed of a FeatureHasher. This is encouraging users to treat the random seed as a hyper parameter.

Aside from the minhash usecase, I have explained the motivation in
#14605 (comment) Do you disagree with that use case or find it not compelling enough?

GaelVaroquaux · 2019-09-24T14:32:06Z

Aside from the minhash usecase, I have explained the motivation in #14605 (comment) Do you disagree with that use case or find it not compelling enough?

I completely agree with you. However, I am trying to move forward and avoid controversies. Maybe I'm impatient. I just do not like to have too many balls in the air, and we have a lot :).

glemaitre · 2019-09-24T14:39:34Z

I just do not like to have too many balls in the air, and we have a lot

:). Let's shoot at them then :). I'll review the changes are done.

…

On Tue, 24 Sep 2019 at 16:32, Gael Varoquaux ***@***.***> wrote: > Aside from the minhash usecase, I have explained the motivation in #14605 (comment) Do you disagree with that use case or find it not compelling enough? I completely agree with you. However, I am trying to move forward and avoid controversies. Maybe I'm impatient. I just do not like to have too many balls in the air, and we have a lot :). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14605?email_source=notifications&email_token=ABY32P3Z4HZJUJE4W5CCK3TQLIQJPA5CNFSM4IKJ4FR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7OSKQQ#issuecomment-534586690>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY32PY6AOBLRKW5LUBNOTDQLIQJPANCNFSM4IKJ4FRQ> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

sklearn/feature_extraction/tests/test_feature_hasher.py

glemaitre

LGTM.

sklearn/feature_extraction/tests/test_feature_hasher.py

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

rth

LGTM, thanks for your patience @TwsThomas !

@NicolasHug good to merge?

NicolasHug · 2019-10-07T13:11:11Z

Looks good, thanks @TwsThomas

GaelVaroquaux · 2019-10-07T14:04:40Z

Bravo @TwsThomas !

glemaitre · 2019-10-07T15:14:18Z

Great @TwsThomas

rth · 2019-10-10T10:18:52Z

sklearn/feature_extraction/tests/test_feature_hasher.py

@@ -4,6 +4,7 @@
 import pytest

 from sklearn.feature_extraction import FeatureHasher
+from sklearn.feature_extraction._hashing import transform as _hashing_transform


This actually broke PyPy CI, because in PyPy the _hashing submodule is not built/supported and so we can't import it at top level. Should be fixed in 4ef679b on master.

TwsThomas changed the title ~~[WIP] Expose random seed in Hashingvectorizer~~ [MRG] Expose random seed in Hashingvectorizer Aug 8, 2019

rth reviewed Aug 8, 2019

View reviewed changes

sklearn/feature_extraction/hashing.py Outdated Show resolved Hide resolved

sklearn/feature_extraction/hashing.py Outdated Show resolved Hide resolved

sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved

NicolasHug reviewed Aug 9, 2019

View reviewed changes

add self._seed and restore hashing.py

c51fade

TwsThomas force-pushed the expose_seed branch from 638a661 to c51fade Compare August 13, 2019 12:28

TwsThomas added 2 commits August 13, 2019 14:29

correct typo test

65787bd

typo

bd2c7e2

update seed in HashingVectorizer

6901332

glemaitre self-requested a review August 13, 2019 14:11

glemaitre requested changes Aug 13, 2019

View reviewed changes

glemaitre reviewed Aug 13, 2019

View reviewed changes

sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved

TwsThomas added 3 commits August 13, 2019 17:40

add pytest fit

650dd0b

typo

2310f19

typo

5c937ae

docs

d188e19

jeremiedbb reviewed Sep 19, 2019

View reviewed changes

sklearn/feature_extraction/hashing.py Outdated Show resolved Hide resolved

relative import

5acf2d7

jeremiedbb added the Waiting for Reviewer label Sep 20, 2019

jeremiedbb approved these changes Sep 20, 2019

View reviewed changes

glemaitre requested changes Sep 24, 2019

View reviewed changes

TwsThomas added 5 commits September 24, 2019 17:57

only seed in _hashing.pyx

3cece1a

typo

9506dec

typo

5704336

typo

30eb225

minor

e9880b8

glemaitre reviewed Oct 2, 2019

View reviewed changes

sklearn/feature_extraction/tests/test_feature_hasher.py Show resolved Hide resolved

glemaitre approved these changes Oct 2, 2019

View reviewed changes

sklearn/feature_extraction/tests/test_feature_hasher.py Outdated Show resolved Hide resolved

TwsThomas and others added 3 commits October 2, 2019 16:30

Update sklearn/feature_extraction/tests/test_feature_hasher.py

68de5b1

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update test_feature_hasher.py

290d3fa

rewrite test

006b2eb

rth approved these changes Oct 7, 2019

View reviewed changes

NicolasHug merged commit 9b045d4 into scikit-learn:master Oct 7, 2019

rth added a commit that referenced this pull request Oct 10, 2019

TST Fix PyPy CI due to _hashing import in #14605

4ef679b

rth reviewed Oct 10, 2019

View reviewed changes

rth mentioned this pull request Aug 11, 2020

[MRG+3] FEA Add PolynomialCountSketch to Kernel Approximation module #13003

Merged

3 tasks

glemaitre mentioned this pull request Aug 31, 2024

Expose Seed in FeatureHasher and HashingVectorizer #29748

Open

	bint alternate_sign=1, unsigned int seed=0):
	bint alternate_sign=1, unsigned int seed=0):

	if isinstance(self.random_state, int):
	if isinstance(self.random_state, Integral):

	seed = self.random_state # stateless estimator
	seed = int(self.random_state) # stateless estimator

	"when random_state is not a fixed integer"
	"when random_state is not a fixed integer."

	self.seed_ = self.random_state # stateless estimator
	self.seed_ = int(self.random_state) # stateless estimator

Uh oh!

[MRG] Expose random seed in Hashingvectorizer #14605

[MRG] Expose random seed in Hashingvectorizer #14605

Uh oh!

Conversation

TwsThomas commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TwsThomas commented Aug 9, 2019

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TwsThomas commented Aug 13, 2019

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rth commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GaelVaroquaux commented Sep 11, 2019 via email

Uh oh!

NicolasHug commented Sep 11, 2019

Uh oh!

rth commented Sep 11, 2019

Uh oh!

rth commented Sep 11, 2019

Uh oh!

NicolasHug commented Sep 11, 2019

Uh oh!

rth commented Sep 11, 2019

Uh oh!

NicolasHug commented Sep 11, 2019

Uh oh!

rth commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Sep 11, 2019

Uh oh!

NicolasHug commented Sep 11, 2019

Uh oh!

GaelVaroquaux commented Sep 11, 2019 via email

Uh oh!

rth commented Sep 11, 2019

Uh oh!

NicolasHug commented Sep 11, 2019

Uh oh!

jeremiedbb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

TwsThomas commented Aug 8, 2019 •

edited

Loading

rth left a comment •

edited

Loading

rth commented Sep 11, 2019 •

edited

Loading

rth commented Sep 11, 2019 •

edited

Loading

jeremiedbb left a comment •

edited

Loading

rth commented Sep 24, 2019 •

edited

Loading