Skip to content

[MRG+1] idf_ setter for TfidfTransformer. #10899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 5, 2018
Merged

Conversation

serega
Copy link
Contributor

@serega serega commented Apr 1, 2018

Reference Issues/PRs

Fixes #7102.

What does this implement/fix? Explain your changes.

The existing property idf_, which returns array of learned idf values does not have the corresponding setter. Internally, idf vector is stored in diagonal matrix.
The change transforms the setter value into the diaganal matrix using spdiags. The setter allows to restore the state of fitted TfidfTransformer using idf vector stored elsewhere.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @serega !

A few comments are below,

copy = TfidfTransformer()
copy.idf_ = orig.idf_
assert_array_equal(
copy.fit_transform(X).toarray(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fit_transform will overwrite the idf_..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I changed to transform.

@idf_.setter
def idf_(self, value):
value = np.asarray(value, dtype=np.float64)
n_features = value.shape[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_features is the vocabulary size, we should check here that the provided shape is consistent with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TfidfTransformer does not have the vocabulary, TfidfVectorizer does. I added validation to TfidfVectorizer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right, thanks.

----------
idf_ : array, shape = [n_features], or None
The learned idf vector (global term weights)
when ``use_idf`` is set to True, None otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idf_ is not defined when use_idf=False

self._validate_vocabulary()
if hasattr(self, 'vocabulary_'):
if len(self.vocabulary_) != len(value):
raise ValueError("idf length = %d must be equal to vocabulary size = %d" % (len(value), len(self.vocabulary)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flake8 fails here because the line is longer than 80 chars

Attributes
----------
idf_ : array, shape = [n_features], or None
The learned idf vector (global term weights)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about 'global term weights', maybe just

The inverse document frequency (IDF) vector; only defined
if  ``use_idf`` is True.

@rth
Copy link
Member

rth commented Apr 3, 2018

LGTM apart for #10899 (comment)

@rth rth changed the title idf_ setter for TfidfTransformer. Fixes #7102 [MRG+1] idf_ setter for TfidfTransformer. Fixes #7102 Apr 3, 2018
@rth rth changed the title [MRG+1] idf_ setter for TfidfTransformer. Fixes #7102 [MRG+1] idf_ setter for TfidfTransformer. Apr 3, 2018
@@ -1062,6 +1062,12 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
sublinear_tf : boolean, default=False
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

Attributes
----------
idf_ : array, shape = [n_features], or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here,

idf_ : array, shape [n_features]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. In fact, this was a copy and paste from TfidfVectorizer documentation, which is incorrect today. I changed it in both places to not defined.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

copy = TfidfVectorizer(vocabulary=vect.vocabulary_, use_idf=True)
expected_idf_len = len(vect.idf_)
invalid_idf = [1.0] * (expected_idf_len + 1)
assert_raises(ValueError, copy.__setattr__, 'idf_', invalid_idf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use setattr(copy) instead of copy.setattr

@jnothman
Copy link
Member

jnothman commented Apr 3, 2018

I think there are comments from @rth that may still need addressing

@@ -1295,9 +1308,9 @@ class TfidfVectorizer(CountVectorizer):
vocabulary_ : dict
A mapping of terms to feature indices.

idf_ : array, shape = [n_features], or None
idf_ : array, shape = [n_features], or not defined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not defined" is not a python object, so better to just leave it out here. It's sufficiently documented in the docsting below.

Also,

The same applies to the TfidfTransformer docstring. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that throughout the sklearn code base None is used when a variable is not set to any value. Should the idf_ be set to None when use_idf=False

Note, I don't have much python experience, and I am still learning it. Thanks for the comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is frequently used in scikit-learn as a default value for parameters, but here this attribute will not return None, it will raise an AttributeError instead when use_idf=False

@rth
Copy link
Member

rth commented Apr 5, 2018

Merging, thanks @serega !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Setting idf_ is impossible
3 participants