[MRG] improve tfidfvectorizer documentation #12822

reshamas · 2018-12-18T22:52:58Z

Referencing PR / Issue

This closes #12204
This also closes #6766 and closes #9369
This also closes #12811 (which is including the wrong file in my PR).

Note

This adds more information in the TfidfVectorizer documentation. It now includes comments about CountVectorizer and TfidfTransformer.

cc: @blooraspberry
#wimlds

…into doc_tdidf

adrinjalali · 2018-12-19T10:30:07Z

I think conventionally we put detailed information such as this one into the user guides. Having a lot of documentation here makes the source code pretty lengthy, but on the other side it makes nice class documentation pages, so, I'm personally torn whether we should put these here or in the guides.

I also know that our "User Guide" hyper link is kinda hidden there in the page and many people don't really notice it unless they know such a thing exists almost on all pages.

adrinjalali · 2018-12-19T10:30:54Z

sklearn/feature_extraction/text.py

+    relative to the appearance of a document.
+
+    The formula that is used to compute the tf-idf of term t is
+    tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as


I think it'll look much nicer if you put the formula in a math block, what do you think?

…into doc_tdidf

reshamas · 2018-12-20T02:31:39Z

I suspect there are formatting errors (though it did pass the flake8 test), but I am not sure how to determine where they are.

jnothman · 2018-12-20T10:38:33Z

I thought the point of the issues you are claiming to fix here is that you need to update the Parameters section, not the summary!

adrinjalali · 2018-12-20T10:46:47Z

@reshamas as for formatting, I found going through PEP8 for the code part, and numpydoc for the docstrings pretty helpful. Specially for the docstrings, we have some minor modifications on how we write them, but they're mostly numpydoc standards. It may even be useful to have these in the contributing guide if they're not there already.

reshamas · 2018-12-20T17:50:05Z

I thought the point of the issues you are claiming to fix here is that you need to update the Parameters section, not the summary!

@jnothman I am looking through the trail of old issues and PRs, and it's not clear to me what needs to be updated. I also don't entirely understand "parameters", "summary" and "docstring".
Is a "docstring" within a class, but "parameter" section is within a "function"? Or is the "summary" within a "class"??

And, it looks like the train has been moving on this track for a while now.

@adrinjalali thanks for sharing the reference. I will read it and sort it out.

jnothman · 2018-12-23T11:34:51Z

Docstring can be in a class or a function. Parameters are listed under the "Parameters" heading. TfidfVectorizer combines the parameters of CountVectorizer and TfidfTransformer, so should have similar descriptions to them.

…into doc_tdidf

adrinjalali · 2018-12-25T14:07:07Z

sklearn/feature_extraction/text.py

@@ -1440,6 +1440,28 @@ class TfidfVectorizer(CountVectorizer):
    sublinear_tf : boolean, default=False
        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

+    CountVectorizer : sparse matrix, (n_samples, n_features)


Putting these here would suggest that CountVectorizer is an __init__ parameter to the class, which it isn't. That's why travis is failing.

These information you have here are useful (IMO), but I guess the idea here is to also explain what these classes do by better explaining the parameters already listed in the docstring. For instance, the norm parameter was mentioned as badly documented in the issue. As an example, you can have a look at the parameter descriptions of LogisticRegression, I hope it helps.

@adrinjalali
I updated some descriptions. Is this better?

adrinjalali · 2019-01-06T16:36:36Z

I have no idea why that doesn't work, but an idea is to leave the Notes section formula free, and only keep an intuitive short high level description of TfidfTransformer there, and make sure that the docstring of TfidfTransformer itself is clear an accurate. WDYT?

adrinjalali · 2019-01-06T16:38:02Z

sklearn/feature_extraction/text.py

+    norm : 'l1', 'l2' or None, optional (default=’l2’)
+        Norm is used to normalize term vectors to have unit norm.
+        ``norm='l2'`` uses cosine similarity.
+        ``norm='l1'`` uses the Euclidean distance.


This is interesting. Intuitively I wouldn't think that the Euclidean distance would result in an l1 normalization. Is it really true?

Maybe it should be:
L2: Euclidean
L1: Manhattan

The norm is passed to the normalize function here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/data.py#L1518

That function also doesn't have a good explanation for the norm parameter. I suggest you figure out what exactly it does there, and fix the description in both places.

Is this the level of detail that is needed:

The vector type to be used for normalization of the matrix is specified by the user (column (feature) or row (data sample point)). The norms are computed as follows: The L1 norm: 1. converts each item in the vector to a positive value 2. computes the total sum of these absolute values 3. divides each item in the vector by total sum of absolute values The L2 norm: 1. converts each item in the vector to its square 2. computes the total sum of these squared values 3. divides each item in the vector by total sum of squared values

I don't think that level of detail is appropriate.

l2 does imply cosine similarity when the dot product between vectors is taken, but that context is not mentioned.

l1 does not have anything to do with euclidean and does not result in a unit vector, so both statements are wrong. l1 might be used in text processing as a simple form of length normalisation, in the case that raw counts (not tf-idf) are used. Otherwise it is not commonly used in text processing.

I think referencing normalize/Normalizer and mentioning cosine similarity are appropriate. No motivation needs to be given for l1.

(I think "vectors in matrices with unit norm" is a bit confusing.)

instead of each output row, should it be

each output row/column OR

each output vector

This parameter doesn't say anything about each column's distribution

Ok, I will make the change in the text you included and exclude :math: since it is not rendering with \log or \frac{}{}. Does that work?

reshamas · 2019-01-06T16:56:36Z

I have no idea why that doesn't work, but an idea is to leave the Notes section formula free, and only keep an intuitive short high level description of TfidfTransformer there, and make sure that the docstring of TfidfTransformer itself is clear an accurate. WDYT?

That's sounds fine. Was that part of the issue for this PR? To add in the formula?

jnothman · 2019-01-07T22:08:03Z

sklearn/feature_extraction/text.py

+    norm : 'l1', 'l2' or None, optional (default=’l2’)
+        Norm is used to normalize term vectors to have unit norm.
+        ``norm='l2'`` uses cosine similarity.
+        ``norm='l1'`` uses the Euclidean distance.


I don't think that level of detail is appropriate.

l2 does imply cosine similarity when the dot product between vectors is taken, but that context is not mentioned.

l1 does not have anything to do with euclidean and does not result in a unit vector, so both statements are wrong. l1 might be used in text processing as a simple form of length normalisation, in the case that raw counts (not tf-idf) are used. Otherwise it is not commonly used in text processing.

I think referencing normalize/Normalizer and mentioning cosine similarity are appropriate. No motivation needs to be given for l1.

jnothman · 2019-01-07T22:08:21Z

sklearn/feature_extraction/text.py

+
+        The formula that is used to compute the tf-idf of term t is
+        :math:`tf-idf(d, t) = tf(t) * idf(d, t)` and the idf is computed
+        as `idf(d, t) = log(\frac{n}{df(d, t)} + 1)`


should this also have :math:?

I can add :math: back in. It's that the html does not render wherever there is \log or \frac and I'm not sure what to do about that.

jnothman · 2019-01-07T22:08:38Z

sklearn/feature_extraction/text.py

+        occur in all documents in a training set, will not be entirely ignored.
+        (Note that the idf formula above differs from the standard textbook
+        notation that defines the idf as
+        `idf(d, t) = log{\frac{n}{df(d, t) + 1}}`


should this also have :math:?

…into doc_tdidf

jnothman · 2019-01-09T10:09:28Z

sklearn/feature_extraction/text.py

-        Norm used to normalize term vectors. None for no normalization.
+    norm : 'l1', 'l2' or None, optional (default=’l2’)
+        Each output row will have unit norm, either:
+        * 'l2': Sum of squares of vector elements is 1. The cosine


This needs a blank line before it. See broken rendering at https://42814-843222-gh.circle-artifacts.com/0/doc/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

jnothman · 2019-01-09T10:09:29Z

sklearn/feature_extraction/text.py

        Type of the matrix returned by fit_transform() or transform().

-    norm : 'l1', 'l2' or None, optional
-        Norm used to normalize term vectors. None for no normalization.
+    norm : 'l1', 'l2' or None, optional (default=’l2’)


If you're changing this in TfidfVectorizer, it also should be changed in TfidfTrasnformer. Isn't that consistency the main purpose of this PR?

jnothman · 2019-01-09T10:10:18Z

sklearn/feature_extraction/text.py

+    CountVectorizer : Produces a sparse representation of the counts using
+        scipy.sparse.csr_matrix.
+
+    TfidfTransformer : Converts the count matrix from CountVectorizer to a


In the context of TfidfVectorizer this is far too much detail.

Just say "Performs the TF-IDF transformation from a provided matrix of counts"

jnothman · 2019-01-09T10:11:23Z

sklearn/feature_extraction/text.py

-    TfidfTransformer
-        Apply Term Frequency Inverse Document Frequency normalization to a
-        sparse matrix of occurrence counts.
+    CountVectorizer : Produces a sparse representation of the counts using


In the context of TfidfVectorizer this is not explaining the difference well. Just "Transforms text into a sparse matrix of n-gram counts" is sufficient. So is the current wording.

reshamas · 2019-01-09T15:28:41Z

@jnothman
I was not sure whether I should shorten the description for class TfidfTransformer on lines 1137 to 1174.

jnothman · 2019-01-09T22:36:10Z

I was not sure whether I should shorten the description for class TfidfTransformer on lines 1137 to 1174.

I was commenting on See Also, not on object descriptions. See Also's purpose is to compare other objects to the present object, so it should often not be the same as the description. No, you should not shorten the description for TfidfTransformer; that is totally out of scope for this issue.

jnothman · 2019-01-09T22:38:10Z

sklearn/feature_extraction/text.py

@@ -1322,17 +1333,17 @@ class TfidfVectorizer(CountVectorizer):
        Otherwise the input is expected to be the sequence strings or
        bytes items are expected to be analyzed directly.

-    encoding : string, 'utf-8' by default.
+    encoding : string, 'utf-8' (default='utf-8')


This is no longer consistent with CountVectorizer. The point of this PR is to make TfidfVectorizer's parameter descriptions identical to those in either CountVectorizer or TfidfTransformer because TfidfVectorizer is a composition of those two.

…into doc_tdidf

jnothman

Otherwise LGTM

jnothman · 2019-01-13T10:18:59Z

sklearn/feature_extraction/text.py

@@ -1307,6 +1312,12 @@ class TfidfVectorizer(CountVectorizer):

    Equivalent to CountVectorizer followed by TfidfTransformer.

+    CountVectorizer : Transforms text into a sparse matrix of n-gram counts


These see also entries don't belong here. Just remove them.

jnothman · 2019-01-13T10:19:29Z

sklearn/feature_extraction/text.py

@@ -1307,6 +1312,12 @@ class TfidfVectorizer(CountVectorizer):

    Equivalent to CountVectorizer followed by TfidfTransformer.


please put backticks around the class names, so they will be hyperlinked in the online docs.

…into doc_tdidf

adrinjalali · 2019-01-13T17:01:19Z

sklearn/feature_extraction/text.py

@@ -1305,7 +1310,7 @@ def idf_(self, value):
 class TfidfVectorizer(CountVectorizer):
    """Convert a collection of raw documents to a matrix of TF-IDF features.

-    Equivalent to CountVectorizer followed by TfidfTransformer.
+    Equivalent to ``CountVectorizer`` followed by ``TfidfTransformer``.


Here you can use single backticks and reference the actual docs of the two classes.

Should it be:
Equivalent to :class:CountVectorizer followed by :class:TfidfTransformer

or can you point me to an example?

yes, if it doesn't work , this should do:

:class:`sklearn.feature_extraction.CountVectorizer`

git grep \` sklearn/ should also give you more examples.

jnothman · 2019-01-14T06:15:43Z

sklearn/feature_extraction/text.py

@@ -1310,7 +1310,8 @@ def idf_(self, value):
 class TfidfVectorizer(CountVectorizer):
    """Convert a collection of raw documents to a matrix of TF-IDF features.

-    Equivalent to ``CountVectorizer`` followed by ``TfidfTransformer``.
+    Equivalent to :class:`sklearn.feature_extraction.CountVectorizer` followed
+    by :class:`sklearn.feature_extraction.TfidfTransformer`.


It should be sufficient to just do `CountVectorizer`

jnothman · 2019-01-16T06:50:09Z

Thanks @reshamas

reshamas · 2019-01-16T13:21:38Z

thank you @jnothman and @adrinjalali

also, thanks to @nderzsy

This reverts commit a02c265.

reshamas added 4 commits December 17, 2018 23:07

cleaning up flake8 errors in text.py

7ee45a3

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn

8153489

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

2a34a37

…into doc_tdidf

fixed flake8 errors in tdidf description

453f1f4

adrinjalali reviewed Dec 19, 2018

View reviewed changes

reshamas added 2 commits December 19, 2018 15:44

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

25da828

…into doc_tdidf

put tf-idf formula in a math block'

d9cadb8

reshamas added 14 commits December 24, 2018 14:04

moved CountVec and TfidfT desc from class to 'parameter' section

df770cc

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

27907c1

…into doc_tdidf

remove blank spaces

1492f5c

adding space after colon

3a2c30f

formatting the :math: expressoin

3c74ea7

still fixing formatting

c020cbb

still fixing formatting

70cf279

still fixing formatting

a70893c

remove :math: and do step-by-step check of docstring error

5bf1808

added 1 :math: block in; this should work

72fe738

added 1 :math: block in; this should work

d4c14b3

delete line, move (if `smooth_id up)

14c5186

for CountVectorizer, added parameter type

4feee7b

for TfidfTransformer, added parameter type

c451532

adrinjalali reviewed Dec 25, 2018

View reviewed changes

reshamas added 2 commits December 25, 2018 19:25

moved CountV and TfidfT from 'Parameters' to 'See also'

4c6eaf0

expanded the definition of 'norm' in Parameters section

403acf3

adrinjalali reviewed Jan 6, 2019

View reviewed changes

jnothman reviewed Jan 7, 2019

View reviewed changes

reshamas added 2 commits January 8, 2019 20:50

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

5feaf73

…into doc_tdidf

removed math blocks; norm def in class TfidfV

674135c

reshamas changed the title ~~[MRG] improve tfidfvectorizer documentation (#wimlds)~~ [MRG] improve tfidfvectorizer documentation Jan 9, 2019

jnothman reviewed Jan 9, 2019

View reviewed changes

shortened definitions

42c4532

jnothman reviewed Jan 9, 2019

View reviewed changes

reshamas added 2 commits January 10, 2019 19:56

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

a403b1e

…into doc_tdidf

minor text edits

c957f69

jnothman reviewed Jan 13, 2019

View reviewed changes

reshamas added 2 commits January 13, 2019 09:40

cleaning up, formatting

9d5f842

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

c62539a

…into doc_tdidf

adrinjalali reviewed Jan 13, 2019

View reviewed changes

single backticks around the class names for hyperlinking'

83c534a

jnothman reviewed Jan 14, 2019

View reviewed changes

set to ':class:'

32c1eea

jnothman merged commit ff46f6e into scikit-learn:master Jan 16, 2019

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019

DOC improve tfidfvectorizer documentation (scikit-learn#12822)

df2979a

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

DOC improve tfidfvectorizer documentation (scikit-learn#12822)

a02c265

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC improve tfidfvectorizer documentation (scikit-learn#12822)"

098b2a3

This reverts commit a02c265.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC improve tfidfvectorizer documentation (scikit-learn#12822)"

fb1cf43

This reverts commit a02c265.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

DOC improve tfidfvectorizer documentation (scikit-learn#12822)

bd1351f

		@@ -1307,6 +1312,12 @@ class TfidfVectorizer(CountVectorizer):

		Equivalent to CountVectorizer followed by TfidfTransformer.

		CountVectorizer : Transforms text into a sparse matrix of n-gram counts

[MRG] improve tfidfvectorizer documentation #12822

[MRG] improve tfidfvectorizer documentation #12822

Conversation

reshamas commented Dec 18, 2018

Referencing PR / Issue

Note

adrinjalali commented Dec 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reshamas commented Dec 20, 2018

jnothman commented Dec 20, 2018

adrinjalali commented Dec 20, 2018

reshamas commented Dec 20, 2018 • edited Loading

jnothman commented Dec 23, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Jan 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reshamas Jan 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reshamas commented Jan 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reshamas Jan 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reshamas commented Jan 9, 2019

jnothman commented Jan 9, 2019

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jan 16, 2019

reshamas commented Jan 16, 2019

reshamas commented Dec 20, 2018 •

edited

Loading

reshamas Jan 6, 2019 •

edited

Loading

reshamas Jan 7, 2019 •

edited

Loading