-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Expose random seed in Hashingvectorizer #14605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I understand the motivation, this PR does enforce scikit-learn API conventions. My concern is that previously HashingVectorizer
was stateless, we repeatedly say so in the documentation and for better or worse illustrate its use as HashingVectorizer().transform(X)
in the documentation (i.e. without calling .fit). Now this PR would error if fit isn't called which is technically backward incompatible change.
So I'm wondering if we couldn't just default random_state=0
to be backward compatible without a warning, and possibly do this checking in transform.
Not asking to do these changes now, let's hear some more opinions about it.
(BTW, tests are failing because we fail on uncaught FutureWarnings and this PR produces some in tests).
Thank you for the feedback.
Yes, it might be counter-productive to enforce a fit in HashingVectorizer only for a temporary warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first pass.
@TwsThomas for catching warnings while testing you can chekc out https://scikit-learn.org/stable/developers/tips.html#useful-pytest-aliases-and-flags
Basically I agree with @rth suggestion.
We can't enforce using fit
since we explicitly just call transform
in the docs. Validating self.random_state
in transform
seems to be a good solution.
I'm also OK with setting the default to 0 to preserve backward compatibility. It's inconsistent with other estimators (None is arguably a not-so-good default anyway, cf #14042), but if the default is 'warn' then every single user would get a warning.
How useful is it for users to set the random state though? If the ultimate goal is only to implement minhash techniques, then maybe we could make the random_state part of a private API, or have a private child class that implements it?
This method doesn't do anything. It exists purely for compatibility | ||
with the scikit-learn transformer API. | ||
This method doesn't do anything except validate self.random_state. | ||
And it compute the seed (int) from the random_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to go into the details of the random state IMO.
@@ -123,6 +131,20 @@ def fit(self, X=None, y=None): | |||
""" | |||
# repeat input validation for grid search (which calls set_params) | |||
self._validate_params(self.n_features, self.input_type) | |||
|
|||
# set self.random_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't set parameters that were passed to init. Use e.g. self._random_state
instead.
self.random_state = check_random_state(self.random_state) | ||
|
||
# extract the seed (int) from the RandomState instance | ||
self.seed = self.random_state.get_state()[1][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it private?
According to @NicolasHug, the seed is now private. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
In the end I think I would be +1 to just expose
Well aside minhash it's useful to know that |
In the end I think I would be +1 to just expose random_state=0 as a public parameter.
I agree. Good thinking, @rth!
|
For what it's worth, we don't need a private attribute, we could just introduce a new private class. If we expose the random state, we need to make sure that None is forbidden, and RandomStateInstance is forbidden too. Else, calling transform twice on the same data gives different embeddings. For this reason, I'd rather introduce a |
Why? Can't we just clone it in transform and use as a local variable? |
Maybe something like the following in if isinstance(self.random_state, int):
seed = self.random_state
else:
random_state = check_random_state(self.random_state)
# Create a seed from the internal state of the RandomState
# object
seed = random_state.get_state()[1].sum() ? |
As far as I understand, it does not make any sense for the FeatureHasher to give different outputs for the same X? |
So with the above code #14605 (comment) I think providing
We can still call it |
That + your snippet above seems like a lot of work only for the benefit of calling the parameter What we need is a random seed, why not just call it |
A simplified version that should be more consistent (about def get_seed(random_state)
"""Get a seed from the internal state of the RandomState"""
return int(random_state.get_state()[1][0]))
if self.random_state is None:
raise ValueError('random_state=None is not supported.')
elif isinstance(self.random_state, int):
seed = self.random_state
else:
random_state = check_random_state(self.random_state)
seed = get_seed(random_state)
Well because in >80% of case users pass an int as the I agree some work is needed to extract the seed from the |
Although, initializing the RandomState seems to allocate a numpy array of size 624 which is non negligible, so an optimized path for |
I'm sorry but I really don't understand why you want to push for |
I'd rather introduce a seed parameter, rather than random_state.
We can still call it random_state even if the None option is not supported, can't we?
Now that I think of it, I think that the semantics of the parameter is
that of a "seed", and not a RandomState. Indeed, as mentioned by Nicolas,
we want entropy but not randomness. Two calls on the same input should
return the same output.
Hence I am in favor of renaming it to "seed" (or "salt").
WDYT?
|
Well An important point though is that we have a number of common tests that check how an estimator should behave when
As far as I understood the mutability of Besides aren't there some analogies between FeatureHasher and random projections, in terms of reducing the dimensionality of the input? The latter does use the |
I 100% agree on this and this is also true for the great majority of the random_state parameters. They are poorly documented. It's very easy to fix this problem, we just need a good docstring for As you noted, #14042 points out that passing a RandomState instance has usually undesirable effects, just like passing None. I see that as an argument in favor of using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 comment. Otherwise looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need to add an entry in doc/whats_new/v0.22.rst
@@ -17,7 +17,8 @@ from ..utils.fixes import sp_version | |||
np.import_array() | |||
|
|||
|
|||
def transform(raw_X, Py_ssize_t n_features, dtype, bint alternate_sign=1): | |||
def transform(raw_X, Py_ssize_t n_features, dtype, | |||
bint alternate_sign=1, unsigned int seed=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bint alternate_sign=1, unsigned int seed=0): | |
bint alternate_sign=1, unsigned int seed=0): |
@@ -123,6 +128,11 @@ def fit(self, X=None, y=None): | |||
""" | |||
# repeat input validation for grid search (which calls set_params) | |||
self._validate_params(self.n_features, self.input_type) | |||
|
|||
# optional if random_state is left to an integer seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional
is weird since there is nothing preventing the process of checking even if this is an integer.
Shall we just remove the comment then.
@@ -148,9 +158,19 @@ def transform(self, raw_X): | |||
raw_X = (_iteritems(d) for d in raw_X) | |||
elif self.input_type == "string": | |||
raw_X = (((f, 1) for f in x) for x in raw_X) | |||
# expose the seed for HashingVectorizer | |||
if isinstance(self.random_state, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it will work with np.int32
or another integer type.
from numbers import Integral
if isinstance(self.random_state, int): | |
if isinstance(self.random_state, Integral): |
@@ -148,9 +158,19 @@ def transform(self, raw_X): | |||
raw_X = (_iteritems(d) for d in raw_X) | |||
elif self.input_type == "string": | |||
raw_X = (((f, 1) for f in x) for x in raw_X) | |||
# expose the seed for HashingVectorizer | |||
if isinstance(self.random_state, int): | |||
seed = self.random_state # stateless estimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to force it to a 64 bits integer for sure
seed = self.random_state # stateless estimator | |
seed = int(self.random_state) # stateless estimator |
seed = self.seed_ | ||
else: | ||
raise NotFittedError("FeaturHasher needs to be fitted" | ||
"when random_state is not a fixed integer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"when random_state is not a fixed integer" | |
"when random_state is not a fixed integer." |
X = getattr(hv, method)(text) | ||
assert_array_equal(X.indices, [639749, 784967, 784967, 784967, 945982]) | ||
|
||
# assert seed influence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of the comment, you can test the following which is what you try to explain
hashing_vectorizer = HashingVectorizer()
X_default = getattr(hashing_vectorizer, method)(text)
hashing_vectorizer = HashingVectorizer(random_state=0)
X_seed_0 = getattr(hashing_vectorizer, method)(text)
hashing_vectorizer = HashingVectorizer(random_state=1)
X_seed_1 = getattr(hashing_vectorizer, method)(text)
assert_allclose_dense_sparse(X_default, X_seed_0)
with pytest.raises(AssertionError, match='Not equal'):
assert_allclose_dense_sparse(X_default, X_seed_1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably use the same test for FeatureHasher
. It avoids to hard-code the output.
def test_hashingvectorizer_random_state_not_int(random_state): | ||
text = ['hello world', 'hello hello', 'hello goodbye'] | ||
# assert random_seed=None should be fitted | ||
hv = HashingVectorizer(random_state=random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check my comment for the FeatureHasher
sklearn/feature_extraction/text.py
Outdated
@@ -715,6 +720,10 @@ def fit(self, X, y=None): | |||
self._warn_for_unused_params() | |||
self._validate_params() | |||
|
|||
# optional if random_state is left to an integer seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need
sklearn/feature_extraction/text.py
Outdated
@@ -741,6 +750,14 @@ def transform(self, X): | |||
self._validate_params() | |||
|
|||
analyzer = self.build_analyzer() | |||
|
|||
if isinstance(self.random_state, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(self.random_state, int): | |
if isinstance(self.random_state, Integral): |
sklearn/feature_extraction/text.py
Outdated
@@ -741,6 +750,14 @@ def transform(self, X): | |||
self._validate_params() | |||
|
|||
analyzer = self.build_analyzer() | |||
|
|||
if isinstance(self.random_state, int): | |||
self.seed_ = self.random_state # stateless estimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.seed_ = self.random_state # stateless estimator | |
self.seed_ = int(self.random_state) # stateless estimator |
Sorry but I really think the logic we are implementing here is overkill, confusing and inconsistent.
I don't even see the point of tuning the seed of a FeatureHasher. This is encouraging users to treat the random seed as a hyper parameter. Again, can we just make all this logic private? At least until we figure out a general solution regarding #14042. |
Again, can we just make all this logic private? At least until we figure out a general solution regarding #14042.
I'm fine with that. For purposes of implementing min-hash (which is our main driver here), I believe that only the modifications to sklearn/feature_extraction/_hashing.pyx are needed, and not the modifications to HashingVectorizer.
|
Fine if it's a blocker we can make it private again for now.
Aside from the minhash usecase, I have explained the motivation in |
Aside from the minhash usecase, I have explained the motivation in #14605 (comment) Do you disagree with that use case or find it not compelling enough?
I completely agree with you. However, I am trying to move forward and avoid controversies.
Maybe I'm impatient. I just do not like to have too many balls in the air, and we have a lot :).
|
I just do not like to have too many balls in the air, and we have a lot
:).
Let's shoot at them then :). I'll review the changes are done.
…On Tue, 24 Sep 2019 at 16:32, Gael Varoquaux ***@***.***> wrote:
> Aside from the minhash usecase, I have explained the motivation in
#14605 (comment)
Do you disagree with that use case or find it not compelling enough?
I completely agree with you. However, I am trying to move forward and
avoid controversies.
Maybe I'm impatient. I just do not like to have too many balls in the air,
and we have a lot :).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14605?email_source=notifications&email_token=ABY32P3Z4HZJUJE4W5CCK3TQLIQJPA5CNFSM4IKJ4FR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7OSKQQ#issuecomment-534586690>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABY32PY6AOBLRKW5LUBNOTDQLIQJPANCNFSM4IKJ4FRQ>
.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for your patience @TwsThomas !
@NicolasHug good to merge?
Looks good, thanks @TwsThomas |
Bravo @TwsThomas !
|
Great @TwsThomas
|
@@ -4,6 +4,7 @@ | |||
import pytest | |||
|
|||
from sklearn.feature_extraction import FeatureHasher | |||
from sklearn.feature_extraction._hashing import transform as _hashing_transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually broke PyPy CI, because in PyPy the _hashing
submodule is not built/supported and so we can't import it at top level. Should be fixed in 4ef679b on master.
Reference Issues/PRs
None
What does this implement/fix? Explain your changes.
Expose the seed from murmur_hash in HashingVectorizer to allow minhash techniques (https://en.wikipedia.org/wiki/MinHash).
Any other comments?
Ping @glemaitre, @GaelVaroquaux