Skip to content

[MRG] Expose random seed in Hashingvectorizer #14605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Oct 7, 2019

Conversation

TwsThomas
Copy link
Contributor

@TwsThomas TwsThomas commented Aug 8, 2019

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

Expose the seed from murmur_hash in HashingVectorizer to allow minhash techniques (https://en.wikipedia.org/wiki/MinHash).

Any other comments?

Ping @glemaitre, @GaelVaroquaux

@TwsThomas TwsThomas changed the title [WIP] Expose random seed in Hashingvectorizer [MRG] Expose random seed in Hashingvectorizer Aug 8, 2019
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I understand the motivation, this PR does enforce scikit-learn API conventions. My concern is that previously HashingVectorizer was stateless, we repeatedly say so in the documentation and for better or worse illustrate its use as HashingVectorizer().transform(X) in the documentation (i.e. without calling .fit). Now this PR would error if fit isn't called which is technically backward incompatible change.

So I'm wondering if we couldn't just default random_state=0 to be backward compatible without a warning, and possibly do this checking in transform.

Not asking to do these changes now, let's hear some more opinions about it.

(BTW, tests are failing because we fail on uncaught FutureWarnings and this PR produces some in tests).

@TwsThomas
Copy link
Contributor Author

Thank you for the feedback.

So I'm wondering if we couldn't just default random_state=0 to be backward compatible without a warning, and possibly do this checking in transform.

Yes, it might be counter-productive to enforce a fit in HashingVectorizer only for a temporary warning.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first pass.

@TwsThomas for catching warnings while testing you can chekc out https://scikit-learn.org/stable/developers/tips.html#useful-pytest-aliases-and-flags

Basically I agree with @rth suggestion.

We can't enforce using fit since we explicitly just call transform in the docs. Validating self.random_state in transform seems to be a good solution.

I'm also OK with setting the default to 0 to preserve backward compatibility. It's inconsistent with other estimators (None is arguably a not-so-good default anyway, cf #14042), but if the default is 'warn' then every single user would get a warning.


How useful is it for users to set the random state though? If the ultimate goal is only to implement minhash techniques, then maybe we could make the random_state part of a private API, or have a private child class that implements it?

This method doesn't do anything. It exists purely for compatibility
with the scikit-learn transformer API.
This method doesn't do anything except validate self.random_state.
And it compute the seed (int) from the random_state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to go into the details of the random state IMO.

@@ -123,6 +131,20 @@ def fit(self, X=None, y=None):
"""
# repeat input validation for grid search (which calls set_params)
self._validate_params(self.n_features, self.input_type)

# set self.random_state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't set parameters that were passed to init. Use e.g. self._random_state instead.

self.random_state = check_random_state(self.random_state)

# extract the seed (int) from the RandomState instance
self.seed = self.random_state.get_state()[1][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it private?

@TwsThomas
Copy link
Contributor Author

According to @NicolasHug, the seed is now private.
Here the precedent behaviour is unchanged.

@glemaitre glemaitre self-requested a review August 13, 2019 14:11
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@rth
Copy link
Member

rth commented Sep 11, 2019

In the end I think I would be +1 to just expose random_state=0 as a public parameter. Undocumented private attribute is not ideal.

How useful is it for users to set the random state though?

Well aside minhash it's useful to know that HashingVectorizer output does depend on an arbitrarily fixed random_state. For instance, checking that the classifier output is robust to this random state is something that could be worthwhile. Since it will change the words for which the collision happen, the effect can be non negligible particularly for low n_features.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 11, 2019 via email

@NicolasHug
Copy link
Member

For what it's worth, we don't need a private attribute, we could just introduce a new private class.

If we expose the random state, we need to make sure that None is forbidden, and RandomStateInstance is forbidden too. Else, calling transform twice on the same data gives different embeddings.

For this reason, I'd rather introduce a seed parameter, rather than random_state. No need to worry about API inconsistency anymore.

@rth
Copy link
Member

rth commented Sep 11, 2019

If we expose the random state, we need to make sure that None is forbidden, and RandomStateInstance is forbidden too. Else, calling transform twice on the same data gives different embeddings.

Why? Can't we just clone it in transform and use as a local variable?

@rth
Copy link
Member

rth commented Sep 11, 2019

Maybe something like the following in transform

if isinstance(self.random_state, int):
    seed = self.random_state
else:
    random_state = check_random_state(self.random_state)

    # Create a seed from the internal state of the RandomState
    # object
    seed = random_state.get_state()[1].sum()

? random_state=0 would be backward compatible but then indeed it would not be equivalent to random_state=RandomState(0). Not sure we make such promises anywhere though.

@NicolasHug
Copy link
Member

As far as I understand, it does not make any sense for the FeatureHasher to give different outputs for the same X?

@rth
Copy link
Member

rth commented Sep 11, 2019

As far as I understand, it does not make any sense for the FeatureHasher to give different outputs for the same X?

So with the above code #14605 (comment) I think providing RandomState(int) should return consistently the same output.

random_state=None is indeed an issue. It's not possible to return consistent output without storing the random_state in .fit. OK, maybe disallow random_state=None then?

I'd rather introduce a seed parameter, rather than random_state.

We can still call it random_state even if the None option is not supported, can't we?

@NicolasHug
Copy link
Member

We can still call it random_state even if the None option is not supported, can't we?

That + your snippet above seems like a lot of work only for the benefit of calling the parameter random_state, which would have an inconsistent behaviour w.r.t. all the other random_state parameters of scikit-learn.

What we need is a random seed, why not just call it seed? It's simple, and it does what users will expect.

@rth
Copy link
Member

rth commented Sep 11, 2019

A simplified version that should be more consistent (about random_state=int vs random_state=RandomState(int) cf SO) while preserving backward compatibility for random_state=0,

def get_seed(random_state)
   """Get a seed from the internal state of the RandomState"""
    return int(random_state.get_state()[1][0]))

if self.random_state is None:
    raise ValueError('random_state=None is not supported.')
elif isinstance(self.random_state, int):
    seed = self.random_state
else:
    random_state = check_random_state(self.random_state)
	seed = get_seed(random_state)

What we need is a random seed, why not just call it seed? It's simple, and it does what users will expect.

Well because in >80% of case users pass an int as the random_state, and in that use case it behaves exactly as seed, so there is no reason to call it differently.

I agree some work is needed to extract the seed from the RandomState but I would argue that the additional few lines are negligible with respect to keeping the parameter naming consistent.

@rth
Copy link
Member

rth commented Sep 11, 2019

Although, initializing the RandomState seems to allocate a numpy array of size 624 which is non negligible, so an optimized path for random_state=int is still probably worth it. Updated the above comment.

@NicolasHug
Copy link
Member

I'm sorry but I really don't understand why you want to push for random_state here.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 11, 2019 via email

@rth
Copy link
Member

rth commented Sep 11, 2019

I'm sorry but I really don't understand why you want to push for random_state here.

Well seed is not a great name, because it doesn't indicate anything about how it impacts the output. You could call it random_seed but that's confusing with the existing random_state variable. Then if we call it random_state it kind of implies that RandomState instances should be supported. But maybe you are right about not supporting that use case, I don't have a strong opinion about. My point is about keeping a consistent name.

An important point though is that we have a number of common tests that check how an estimator should behave when random_state is provided. If we call it something different, it means that we need to implement those checks manually. For instance I wonder if the common tests would pass with the above implementation.

Indeed, as mentioned by Nicolas,
we want entropy but not randomness. Two calls on the same input should
return the same output.

As far as I understood the mutability of self.random_state is considered a bug and #14042 aims to fix it? In which case repeated calls would always return the same result in both cases, or am I missing something?

Besides aren't there some analogies between FeatureHasher and random projections, in terms of reducing the dimensionality of the input? The latter does use the random_state parameter.

@NicolasHug
Copy link
Member

Well seed is not a great name, because it doesn't indicate anything about how it impacts the output.

I 100% agree on this and this is also true for the great majority of the random_state parameters. They are poorly documented. It's very easy to fix this problem, we just need a good docstring for seed.

As you noted, #14042 points out that passing a RandomState instance has usually undesirable effects, just like passing None. I see that as an argument in favor of using seed.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 comment. Otherwise looks good

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to add an entry in doc/whats_new/v0.22.rst

@@ -17,7 +17,8 @@ from ..utils.fixes import sp_version
np.import_array()


def transform(raw_X, Py_ssize_t n_features, dtype, bint alternate_sign=1):
def transform(raw_X, Py_ssize_t n_features, dtype,
bint alternate_sign=1, unsigned int seed=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bint alternate_sign=1, unsigned int seed=0):
bint alternate_sign=1, unsigned int seed=0):

@@ -123,6 +128,11 @@ def fit(self, X=None, y=None):
"""
# repeat input validation for grid search (which calls set_params)
self._validate_params(self.n_features, self.input_type)

# optional if random_state is left to an integer seed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional is weird since there is nothing preventing the process of checking even if this is an integer.
Shall we just remove the comment then.

@@ -148,9 +158,19 @@ def transform(self, raw_X):
raw_X = (_iteritems(d) for d in raw_X)
elif self.input_type == "string":
raw_X = (((f, 1) for f in x) for x in raw_X)
# expose the seed for HashingVectorizer
if isinstance(self.random_state, int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it will work with np.int32 or another integer type.

from numbers import Integral
Suggested change
if isinstance(self.random_state, int):
if isinstance(self.random_state, Integral):

@@ -148,9 +158,19 @@ def transform(self, raw_X):
raw_X = (_iteritems(d) for d in raw_X)
elif self.input_type == "string":
raw_X = (((f, 1) for f in x) for x in raw_X)
# expose the seed for HashingVectorizer
if isinstance(self.random_state, int):
seed = self.random_state # stateless estimator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to force it to a 64 bits integer for sure

Suggested change
seed = self.random_state # stateless estimator
seed = int(self.random_state) # stateless estimator

seed = self.seed_
else:
raise NotFittedError("FeaturHasher needs to be fitted"
"when random_state is not a fixed integer"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"when random_state is not a fixed integer"
"when random_state is not a fixed integer."

X = getattr(hv, method)(text)
assert_array_equal(X.indices, [639749, 784967, 784967, 784967, 945982])

# assert seed influence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of the comment, you can test the following which is what you try to explain

hashing_vectorizer = HashingVectorizer()
X_default = getattr(hashing_vectorizer, method)(text)
hashing_vectorizer = HashingVectorizer(random_state=0)
X_seed_0 = getattr(hashing_vectorizer, method)(text)
hashing_vectorizer = HashingVectorizer(random_state=1)
X_seed_1 = getattr(hashing_vectorizer, method)(text)

assert_allclose_dense_sparse(X_default, X_seed_0)
with pytest.raises(AssertionError, match='Not equal'):
    assert_allclose_dense_sparse(X_default, X_seed_1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably use the same test for FeatureHasher. It avoids to hard-code the output.

def test_hashingvectorizer_random_state_not_int(random_state):
text = ['hello world', 'hello hello', 'hello goodbye']
# assert random_seed=None should be fitted
hv = HashingVectorizer(random_state=random_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check my comment for the FeatureHasher

@@ -715,6 +720,10 @@ def fit(self, X, y=None):
self._warn_for_unused_params()
self._validate_params()

# optional if random_state is left to an integer seed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need

@@ -741,6 +750,14 @@ def transform(self, X):
self._validate_params()

analyzer = self.build_analyzer()

if isinstance(self.random_state, int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(self.random_state, int):
if isinstance(self.random_state, Integral):

@@ -741,6 +750,14 @@ def transform(self, X):
self._validate_params()

analyzer = self.build_analyzer()

if isinstance(self.random_state, int):
self.seed_ = self.random_state # stateless estimator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.seed_ = self.random_state # stateless estimator
self.seed_ = int(self.random_state) # stateless estimator

@NicolasHug
Copy link
Member

Sorry but I really think the logic we are implementing here is overkill, confusing and inconsistent.

  • random_state doesn't behave like the other random_state parameters
  • It's not clear at all what a fitted FeatureHasher is. Users will not understand unless we very carefully explain all this (complex) random_state logic. We're adding a lot of cognitive overload for users and for us in the code. I don't think it's worth it.

I don't even see the point of tuning the seed of a FeatureHasher. This is encouraging users to treat the random seed as a hyper parameter.

Again, can we just make all this logic private? At least until we figure out a general solution regarding #14042.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 24, 2019 via email

@rth
Copy link
Member

rth commented Sep 24, 2019

Fine if it's a blocker we can make it private again for now.

I don't even see the point of tuning the seed of a FeatureHasher. This is encouraging users to treat the random seed as a hyper parameter.

Aside from the minhash usecase, I have explained the motivation in
#14605 (comment) Do you disagree with that use case or find it not compelling enough?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 24, 2019 via email

@glemaitre
Copy link
Member

glemaitre commented Sep 24, 2019 via email

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your patience @TwsThomas !

@NicolasHug good to merge?

@NicolasHug NicolasHug merged commit 9b045d4 into scikit-learn:master Oct 7, 2019
@NicolasHug
Copy link
Member

Looks good, thanks @TwsThomas

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 7, 2019 via email

@glemaitre
Copy link
Member

glemaitre commented Oct 7, 2019 via email

rth added a commit that referenced this pull request Oct 10, 2019
@@ -4,6 +4,7 @@
import pytest

from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_extraction._hashing import transform as _hashing_transform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually broke PyPy CI, because in PyPy the _hashing submodule is not built/supported and so we can't import it at top level. Should be fixed in 4ef679b on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants