Skip to content

[MRG] improve tfidfvectorizer documentation #12822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Jan 16, 2019
Merged

[MRG] improve tfidfvectorizer documentation #12822

merged 42 commits into from
Jan 16, 2019

Conversation

reshamas
Copy link
Member

Referencing PR / Issue

This closes #12204
This also closes #6766 and closes #9369
This also closes #12811 (which is including the wrong file in my PR).

Note

This adds more information in the TfidfVectorizer documentation. It now includes comments about CountVectorizer and TfidfTransformer.

cc: @blooraspberry
#wimlds

@adrinjalali
Copy link
Member

I think conventionally we put detailed information such as this one into the user guides. Having a lot of documentation here makes the source code pretty lengthy, but on the other side it makes nice class documentation pages, so, I'm personally torn whether we should put these here or in the guides.

I also know that our "User Guide" hyper link is kinda hidden there in the page and many people don't really notice it unless they know such a thing exists almost on all pages.

relative to the appearance of a document.

The formula that is used to compute the tf-idf of term t is
tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll look much nicer if you put the formula in a math block, what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ✅

@reshamas
Copy link
Member Author

I suspect there are formatting errors (though it did pass the flake8 test), but I am not sure how to determine where they are.

@jnothman
Copy link
Member

I thought the point of the issues you are claiming to fix here is that you need to update the Parameters section, not the summary!

@adrinjalali
Copy link
Member

@reshamas as for formatting, I found going through PEP8 for the code part, and numpydoc for the docstrings pretty helpful. Specially for the docstrings, we have some minor modifications on how we write them, but they're mostly numpydoc standards. It may even be useful to have these in the contributing guide if they're not there already.

@reshamas
Copy link
Member Author

reshamas commented Dec 20, 2018

I thought the point of the issues you are claiming to fix here is that you need to update the Parameters section, not the summary!

@jnothman I am looking through the trail of old issues and PRs, and it's not clear to me what needs to be updated. I also don't entirely understand "parameters", "summary" and "docstring".
Is a "docstring" within a class, but "parameter" section is within a "function"? Or is the "summary" within a "class"??

And, it looks like the train has been moving on this track for a while now.

@adrinjalali thanks for sharing the reference. I will read it and sort it out.

@jnothman
Copy link
Member

jnothman commented Dec 23, 2018 via email

@@ -1440,6 +1440,28 @@ class TfidfVectorizer(CountVectorizer):
sublinear_tf : boolean, default=False
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

CountVectorizer : sparse matrix, (n_samples, n_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting these here would suggest that CountVectorizer is an __init__ parameter to the class, which it isn't. That's why travis is failing.

These information you have here are useful (IMO), but I guess the idea here is to also explain what these classes do by better explaining the parameters already listed in the docstring. For instance, the norm parameter was mentioned as badly documented in the issue. As an example, you can have a look at the parameter descriptions of LogisticRegression, I hope it helps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrinjalali
I updated some descriptions. Is this better?

@adrinjalali
Copy link
Member

I have no idea why that doesn't work, but an idea is to leave the Notes section formula free, and only keep an intuitive short high level description of TfidfTransformer there, and make sure that the docstring of TfidfTransformer itself is clear an accurate. WDYT?

norm : 'l1', 'l2' or None, optional (default=’l2’)
Norm is used to normalize term vectors to have unit norm.
``norm='l2'`` uses cosine similarity.
``norm='l1'`` uses the Euclidean distance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. Intuitively I wouldn't think that the Euclidean distance would result in an l1 normalization. Is it really true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be:
L2: Euclidean
L1: Manhattan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The norm is passed to the normalize function here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/data.py#L1518

That function also doesn't have a good explanation for the norm parameter. I suggest you figure out what exactly it does there, and fix the description in both places.

Copy link
Member Author

@reshamas reshamas Jan 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the level of detail that is needed:

The vector type to be used for normalization of the matrix is specified by the user (column (feature) or row (data sample point)).  The norms are computed as follows:

The L1 norm:
1.  converts each item in the vector to a positive value
2.  computes the total sum of these absolute values 
3.  divides each item in the vector by total sum of absolute values

The L2 norm:
1.  converts each item in the vector to its square
2.  computes the total sum of these squared values  
3.  divides each item in the vector by total sum of squared values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that level of detail is appropriate.

l2 does imply cosine similarity when the dot product between vectors is taken, but that context is not mentioned.

l1 does not have anything to do with euclidean and does not result in a unit vector, so both statements are wrong. l1 might be used in text processing as a simple form of length normalisation, in the case that raw counts (not tf-idf) are used. Otherwise it is not commonly used in text processing.

I think referencing normalize/Normalizer and mentioning cosine similarity are appropriate. No motivation needs to be given for l1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think "vectors in matrices with unit norm" is a bit confusing.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of each output row, should it be

  • each output row/column OR
  • each output vector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter doesn't say anything about each column's distribution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will make the change in the text you included and exclude :math: since it is not rendering with \log or \frac{}{}. Does that work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@reshamas
Copy link
Member Author

reshamas commented Jan 6, 2019

I have no idea why that doesn't work, but an idea is to leave the Notes section formula free, and only keep an intuitive short high level description of TfidfTransformer there, and make sure that the docstring of TfidfTransformer itself is clear an accurate. WDYT?

That's sounds fine. Was that part of the issue for this PR? To add in the formula?

norm : 'l1', 'l2' or None, optional (default=’l2’)
Norm is used to normalize term vectors to have unit norm.
``norm='l2'`` uses cosine similarity.
``norm='l1'`` uses the Euclidean distance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that level of detail is appropriate.

l2 does imply cosine similarity when the dot product between vectors is taken, but that context is not mentioned.

l1 does not have anything to do with euclidean and does not result in a unit vector, so both statements are wrong. l1 might be used in text processing as a simple form of length normalisation, in the case that raw counts (not tf-idf) are used. Otherwise it is not commonly used in text processing.

I think referencing normalize/Normalizer and mentioning cosine similarity are appropriate. No motivation needs to be given for l1.


The formula that is used to compute the tf-idf of term t is​
:math:`tf-idf(d, t) = tf(t) * idf(d, t)` and the idf is computed
as​ `idf(d, t) = log(\frac{n}{df(d, t)} + 1)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also have :math:?

Copy link
Member Author

@reshamas reshamas Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add :math: back in. It's that the html does not render wherever there is \log or \frac and I'm not sure what to do about that.

occur in all documents in a training set, will not be entirely ignored.
(Note that the idf formula above differs from the standard textbook
notation that defines the idf as​
`idf(d, t) = log{\frac{n}{df(d, t) + 1}}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also have :math:?

@reshamas reshamas changed the title [MRG] improve tfidfvectorizer documentation (#wimlds) [MRG] improve tfidfvectorizer documentation Jan 9, 2019
Norm used to normalize term vectors. None for no normalization.
norm : 'l1', 'l2' or None, optional (default=’l2’)
Each output row will have unit norm, either:
* 'l2': Sum of squares of vector elements is 1. The cosine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type of the matrix returned by fit_transform() or transform().

norm : 'l1', 'l2' or None, optional
Norm used to normalize term vectors. None for no normalization.
norm : 'l1', 'l2' or None, optional (default=’l2’)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're changing this in TfidfVectorizer, it also should be changed in TfidfTrasnformer. Isn't that consistency the main purpose of this PR?

CountVectorizer : Produces a sparse representation of the counts using
scipy.sparse.csr_matrix.

TfidfTransformer : Converts the count matrix from CountVectorizer to a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of TfidfVectorizer this is far too much detail.

Just say "Performs the TF-IDF transformation from a provided matrix of counts"

TfidfTransformer
Apply Term Frequency Inverse Document Frequency normalization to a
sparse matrix of occurrence counts.
CountVectorizer : Produces a sparse representation of the counts using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of TfidfVectorizer this is not explaining the difference well. Just "Transforms text into a sparse matrix of n-gram counts" is sufficient. So is the current wording.

@reshamas
Copy link
Member Author

reshamas commented Jan 9, 2019

@jnothman
I was not sure whether I should shorten the description for class TfidfTransformer on lines 1137 to 1174.

@jnothman
Copy link
Member

jnothman commented Jan 9, 2019

I was not sure whether I should shorten the description for class TfidfTransformer on lines 1137 to 1174.

I was commenting on See Also, not on object descriptions. See Also's purpose is to compare other objects to the present object, so it should often not be the same as the description. No, you should not shorten the description for TfidfTransformer; that is totally out of scope for this issue.

@@ -1322,17 +1333,17 @@ class TfidfVectorizer(CountVectorizer):
Otherwise the input is expected to be the sequence strings or
bytes items are expected to be analyzed directly.

encoding : string, 'utf-8' by default.
encoding : string, 'utf-8' (default='utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer consistent with CountVectorizer. The point of this PR is to make TfidfVectorizer's parameter descriptions identical to those in either CountVectorizer or TfidfTransformer because TfidfVectorizer is a composition of those two.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@@ -1307,6 +1312,12 @@ class TfidfVectorizer(CountVectorizer):

Equivalent to CountVectorizer followed by TfidfTransformer.

CountVectorizer : Transforms text into a sparse matrix of n-gram counts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These see also entries don't belong here. Just remove them.

@@ -1307,6 +1312,12 @@ class TfidfVectorizer(CountVectorizer):

Equivalent to CountVectorizer followed by TfidfTransformer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put backticks around the class names, so they will be hyperlinked in the online docs.

@@ -1305,7 +1310,7 @@ def idf_(self, value):
class TfidfVectorizer(CountVectorizer):
"""Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed by TfidfTransformer.
Equivalent to ``CountVectorizer`` followed by ``TfidfTransformer``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can use single backticks and reference the actual docs of the two classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be:
Equivalent to :class:CountVectorizer followed by :class:TfidfTransformer

or can you point me to an example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if it doesn't work , this should do:

:class:`sklearn.feature_extraction.CountVectorizer` 

git grep \` sklearn/ should also give you more examples.

@@ -1310,7 +1310,8 @@ def idf_(self, value):
class TfidfVectorizer(CountVectorizer):
"""Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to ``CountVectorizer`` followed by ``TfidfTransformer``.
Equivalent to :class:`sklearn.feature_extraction.CountVectorizer` followed
by :class:`sklearn.feature_extraction.TfidfTransformer`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be sufficient to just do `CountVectorizer`

@jnothman jnothman merged commit ff46f6e into scikit-learn:master Jan 16, 2019
@jnothman
Copy link
Member

Thanks @reshamas

@reshamas
Copy link
Member Author

thank you @jnothman and @adrinjalali

also, thanks to @nderzsy

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TfidfVectorizer documentation doesn't track TfidfTransformer
3 participants