-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] idf_ setter for TfidfTransformer. #10899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @serega !
A few comments are below,
copy = TfidfTransformer() | ||
copy.idf_ = orig.idf_ | ||
assert_array_equal( | ||
copy.fit_transform(X).toarray(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fit_transform
will overwrite the idf_
..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I changed to transform.
@idf_.setter | ||
def idf_(self, value): | ||
value = np.asarray(value, dtype=np.float64) | ||
n_features = value.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_features
is the vocabulary size, we should check here that the provided shape is consistent with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TfidfTransformer does not have the vocabulary, TfidfVectorizer does. I added validation to TfidfVectorizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, right, thanks.
sklearn/feature_extraction/text.py
Outdated
---------- | ||
idf_ : array, shape = [n_features], or None | ||
The learned idf vector (global term weights) | ||
when ``use_idf`` is set to True, None otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idf_
is not defined when use_idf=False
sklearn/feature_extraction/text.py
Outdated
self._validate_vocabulary() | ||
if hasattr(self, 'vocabulary_'): | ||
if len(self.vocabulary_) != len(value): | ||
raise ValueError("idf length = %d must be equal to vocabulary size = %d" % (len(value), len(self.vocabulary))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flake8 fails here because the line is longer than 80 chars
sklearn/feature_extraction/text.py
Outdated
Attributes | ||
---------- | ||
idf_ : array, shape = [n_features], or None | ||
The learned idf vector (global term weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about 'global term weights', maybe just
The inverse document frequency (IDF) vector; only defined
if ``use_idf`` is True.
LGTM apart for #10899 (comment) |
sklearn/feature_extraction/text.py
Outdated
@@ -1062,6 +1062,12 @@ class TfidfTransformer(BaseEstimator, TransformerMixin): | |||
sublinear_tf : boolean, default=False | |||
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf). | |||
|
|||
Attributes | |||
---------- | |||
idf_ : array, shape = [n_features], or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here,
idf_ : array, shape [n_features]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. In fact, this was a copy and paste from TfidfVectorizer documentation, which is incorrect today. I changed it in both places to not defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
copy = TfidfVectorizer(vocabulary=vect.vocabulary_, use_idf=True) | ||
expected_idf_len = len(vect.idf_) | ||
invalid_idf = [1.0] * (expected_idf_len + 1) | ||
assert_raises(ValueError, copy.__setattr__, 'idf_', invalid_idf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use setattr(copy) instead of copy.setattr
I think there are comments from @rth that may still need addressing |
sklearn/feature_extraction/text.py
Outdated
@@ -1295,9 +1308,9 @@ class TfidfVectorizer(CountVectorizer): | |||
vocabulary_ : dict | |||
A mapping of terms to feature indices. | |||
|
|||
idf_ : array, shape = [n_features], or None | |||
idf_ : array, shape = [n_features], or not defined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not defined" is not a python object, so better to just leave it out here. It's sufficiently documented in the docsting below.
Also,
- please remove "=" after "shape" and make it a tuple while we are at it (cf. e.g. https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/linear_model/logistic.py#L1113)
- [MRG+1] idf_ setter for TfidfTransformer. #10899 (comment) still needs addressing
The same applies to the TfidfTransformer docstring. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that throughout the sklearn code base None is used when a variable is not set to any value. Should the idf_ be set to None when use_idf=False
Note, I don't have much python experience, and I am still learning it. Thanks for the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is frequently used in scikit-learn as a default value for parameters, but here this attribute will not return None, it will raise an AttributeError
instead when use_idf=False
Merging, thanks @serega ! |
Reference Issues/PRs
Fixes #7102.
What does this implement/fix? Explain your changes.
The existing property
idf_
, which returns array of learned idf values does not have the corresponding setter. Internally, idf vector is stored in diagonal matrix.The change transforms the setter value into the diaganal matrix using
spdiags
. The setter allows to restore the state of fitted TfidfTransformer using idf vector stored elsewhere.