[MRG+1] idf_ setter for TfidfTransformer. #10899

serega · 2018-04-01T02:58:12Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The existing property idf_, which returns array of learned idf values does not have the corresponding setter. Internally, idf vector is stored in diagonal matrix.
The change transforms the setter value into the diaganal matrix using spdiags. The setter allows to restore the state of fitted TfidfTransformer using idf vector stored elsewhere.

rth

Thanks for this PR @serega !

A few comments are below,

rth · 2018-04-01T17:36:35Z

sklearn/feature_extraction/tests/test_text.py

+    copy = TfidfTransformer()
+    copy.idf_ = orig.idf_
+    assert_array_equal(
+        copy.fit_transform(X).toarray(),


A fit_transform will overwrite the idf_..

Right. I changed to transform.

rth · 2018-04-01T20:24:46Z

sklearn/feature_extraction/text.py

+    @idf_.setter
+    def idf_(self, value):
+        value = np.asarray(value, dtype=np.float64)
+        n_features = value.shape[0]


n_features is the vocabulary size, we should check here that the provided shape is consistent with it.

TfidfTransformer does not have the vocabulary, TfidfVectorizer does. I added validation to TfidfVectorizer

Yes, right, thanks.

rth · 2018-04-01T20:25:32Z

sklearn/feature_extraction/text.py

+    ----------
+    idf_ : array, shape = [n_features], or None
+        The learned idf vector (global term weights)
+        when ``use_idf`` is set to True, None otherwise.


idf_ is not defined when use_idf=False

rth · 2018-04-02T07:38:47Z

sklearn/feature_extraction/text.py

+        self._validate_vocabulary()
+        if hasattr(self, 'vocabulary_'):
+            if len(self.vocabulary_) != len(value):
+                raise ValueError("idf length = %d must be equal to vocabulary size = %d" % (len(value), len(self.vocabulary)))


flake8 fails here because the line is longer than 80 chars

rth · 2018-04-03T08:05:46Z

sklearn/feature_extraction/text.py

+    Attributes
+    ----------
+    idf_ : array, shape = [n_features], or None
+        The learned idf vector (global term weights)


I'm not sure about 'global term weights', maybe just

The inverse document frequency (IDF) vector; only defined if ``use_idf`` is True.

rth · 2018-04-03T09:07:43Z

LGTM apart for #10899 (comment)

rth · 2018-04-03T09:22:01Z

sklearn/feature_extraction/text.py

@@ -1062,6 +1062,12 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
    sublinear_tf : boolean, default=False
        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

+    Attributes
+    ----------
+    idf_ : array, shape = [n_features], or None


also here,

idf_ : array, shape [n_features]

Right. In fact, this was a copy and paste from TfidfVectorizer documentation, which is incorrect today. I changed it in both places to not defined.

jnothman

Otherwise LGTM

jnothman · 2018-04-03T12:53:47Z

sklearn/feature_extraction/tests/test_text.py

+    copy = TfidfVectorizer(vocabulary=vect.vocabulary_, use_idf=True)
+    expected_idf_len = len(vect.idf_)
+    invalid_idf = [1.0] * (expected_idf_len + 1)
+    assert_raises(ValueError, copy.__setattr__, 'idf_', invalid_idf)


Please use setattr(copy) instead of copy.setattr

jnothman · 2018-04-03T22:06:49Z

I think there are comments from @rth that may still need addressing

rth · 2018-04-04T09:59:30Z

sklearn/feature_extraction/text.py

@@ -1295,9 +1308,9 @@ class TfidfVectorizer(CountVectorizer):
    vocabulary_ : dict
        A mapping of terms to feature indices.

-    idf_ : array, shape = [n_features], or None
+    idf_ : array, shape = [n_features], or not defined


"not defined" is not a python object, so better to just leave it out here. It's sufficiently documented in the docsting below.

Also,

please remove "=" after "shape" and make it a tuple while we are at it (cf. e.g. https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/linear_model/logistic.py#L1113)

[MRG+1] idf_ setter for TfidfTransformer. #10899 (comment) still needs addressing

The same applies to the TfidfTransformer docstring. Thanks.

It seems that throughout the sklearn code base None is used when a variable is not set to any value. Should the idf_ be set to None when use_idf=False

Note, I don't have much python experience, and I am still learning it. Thanks for the comments.

None is frequently used in scikit-learn as a default value for parameters, but here this attribute will not return None, it will raise an AttributeError instead when use_idf=False

rth · 2018-04-05T10:44:51Z

Merging, thanks @serega !

idf_ setter for TfidfTransformer. Fixes scikit-learn#7102

639a6a0

rth reviewed Apr 1, 2018

View reviewed changes

serega added 2 commits April 1, 2018 16:43

Do not re-fit TfidfTransformer. Changed fit_transform to transform.

b753a56

Validate the length of idf argument in TfidfVectorizer.

8cb346b

rth reviewed Apr 2, 2018

View reviewed changes

serega added 2 commits April 2, 2018 18:52

Line length is less then 80 chars

fe2a303

Made the line even shorter

0ceafad

rth reviewed Apr 3, 2018

View reviewed changes

rth changed the title ~~idf_ setter for TfidfTransformer. Fixes #7102~~ [MRG+1] idf_ setter for TfidfTransformer. Fixes #7102 Apr 3, 2018

rth changed the title ~~[MRG+1] idf_ setter for TfidfTransformer. Fixes #7102~~ [MRG+1] idf_ setter for TfidfTransformer. Apr 3, 2018

rth reviewed Apr 3, 2018

View reviewed changes

jnothman approved these changes Apr 3, 2018

View reviewed changes

Using setattr(copy) instead of copy.__setattr__

78cecea

idf_ is not defined when use_idf is false.

cd0e19b

rth reviewed Apr 4, 2018

View reviewed changes

docs corrections as suggested by rth.

91460f1

rth merged commit f890144 into scikit-learn:master Apr 5, 2018

This was referenced May 23, 2018

[MRG] Add docs for TfidfTransformer.idf_ (Fixes #8267) #8528

Closed

TfidfTransformer.idf_ does not appear to be documented #8267

Closed

Uh oh!

[MRG+1] idf_ setter for TfidfTransformer. #10899

[MRG+1] idf_ setter for TfidfTransformer. #10899

Uh oh!

Conversation

serega commented Apr 1, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Apr 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Apr 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Apr 5, 2018

Uh oh!

Uh oh!