-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] improve tfidfvectorizer documentation #12822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think conventionally we put detailed information such as this one into the user guides. Having a lot of documentation here makes the source code pretty lengthy, but on the other side it makes nice class documentation pages, so, I'm personally torn whether we should put these here or in the guides. I also know that our "User Guide" hyper link is kinda hidden there in the page and many people don't really notice it unless they know such a thing exists almost on all pages. |
sklearn/feature_extraction/text.py
Outdated
relative to the appearance of a document. | ||
|
||
The formula that is used to compute the tf-idf of term t is | ||
tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'll look much nicer if you put the formula in a math block, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done ✅
I suspect there are formatting errors (though it did pass the flake8 test), but I am not sure how to determine where they are. |
I thought the point of the issues you are claiming to fix here is that you need to update the Parameters section, not the summary! |
@reshamas as for formatting, I found going through PEP8 for the code part, and numpydoc for the docstrings pretty helpful. Specially for the docstrings, we have some minor modifications on how we write them, but they're mostly numpydoc standards. It may even be useful to have these in the contributing guide if they're not there already. |
@jnothman I am looking through the trail of old issues and PRs, and it's not clear to me what needs to be updated. I also don't entirely understand "parameters", "summary" and "docstring". And, it looks like the train has been moving on this track for a while now. @adrinjalali thanks for sharing the reference. I will read it and sort it out. |
Docstring can be in a class or a function. Parameters are listed under the
"Parameters" heading. TfidfVectorizer combines the parameters of
CountVectorizer and TfidfTransformer, so should have similar descriptions
to them.
|
…into doc_tdidf
sklearn/feature_extraction/text.py
Outdated
@@ -1440,6 +1440,28 @@ class TfidfVectorizer(CountVectorizer): | |||
sublinear_tf : boolean, default=False | |||
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf). | |||
|
|||
CountVectorizer : sparse matrix, (n_samples, n_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting these here would suggest that CountVectorizer
is an __init__
parameter to the class, which it isn't. That's why travis is failing.
These information you have here are useful (IMO), but I guess the idea here is to also explain what these classes do by better explaining the parameters already listed in the docstring. For instance, the norm
parameter was mentioned as badly documented in the issue. As an example, you can have a look at the parameter descriptions of LogisticRegression
, I hope it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrinjalali
I updated some descriptions. Is this better?
I have no idea why that doesn't work, but an idea is to leave the Notes section formula free, and only keep an intuitive short high level description of |
sklearn/feature_extraction/text.py
Outdated
norm : 'l1', 'l2' or None, optional (default=’l2’) | ||
Norm is used to normalize term vectors to have unit norm. | ||
``norm='l2'`` uses cosine similarity. | ||
``norm='l1'`` uses the Euclidean distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. Intuitively I wouldn't think that the Euclidean distance would result in an l1
normalization. Is it really true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should be:
L2: Euclidean
L1: Manhattan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The norm is passed to the normalize
function here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/data.py#L1518
That function also doesn't have a good explanation for the norm
parameter. I suggest you figure out what exactly it does there, and fix the description in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the level of detail that is needed:
The vector type to be used for normalization of the matrix is specified by the user (column (feature) or row (data sample point)). The norms are computed as follows:
The L1 norm:
1. converts each item in the vector to a positive value
2. computes the total sum of these absolute values
3. divides each item in the vector by total sum of absolute values
The L2 norm:
1. converts each item in the vector to its square
2. computes the total sum of these squared values
3. divides each item in the vector by total sum of squared values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that level of detail is appropriate.
l2 does imply cosine similarity when the dot product between vectors is taken, but that context is not mentioned.
l1 does not have anything to do with euclidean and does not result in a unit vector, so both statements are wrong. l1 might be used in text processing as a simple form of length normalisation, in the case that raw counts (not tf-idf) are used. Otherwise it is not commonly used in text processing.
I think referencing normalize
/Normalizer
and mentioning cosine similarity are appropriate. No motivation needs to be given for l1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think "vectors in matrices with unit norm" is a bit confusing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of each output row
, should it be
each output row/column
OReach output vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter doesn't say anything about each column's distribution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will make the change in the text you included and exclude :math:
since it is not rendering with \log
or \frac{}{}
. Does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
That's sounds fine. Was that part of the issue for this PR? To add in the formula? |
sklearn/feature_extraction/text.py
Outdated
norm : 'l1', 'l2' or None, optional (default=’l2’) | ||
Norm is used to normalize term vectors to have unit norm. | ||
``norm='l2'`` uses cosine similarity. | ||
``norm='l1'`` uses the Euclidean distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that level of detail is appropriate.
l2 does imply cosine similarity when the dot product between vectors is taken, but that context is not mentioned.
l1 does not have anything to do with euclidean and does not result in a unit vector, so both statements are wrong. l1 might be used in text processing as a simple form of length normalisation, in the case that raw counts (not tf-idf) are used. Otherwise it is not commonly used in text processing.
I think referencing normalize
/Normalizer
and mentioning cosine similarity are appropriate. No motivation needs to be given for l1.
sklearn/feature_extraction/text.py
Outdated
|
||
The formula that is used to compute the tf-idf of term t is | ||
:math:`tf-idf(d, t) = tf(t) * idf(d, t)` and the idf is computed | ||
as `idf(d, t) = log(\frac{n}{df(d, t)} + 1)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also have :math:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add :math:
back in. It's that the html does not render wherever there is \log
or \frac
and I'm not sure what to do about that.
sklearn/feature_extraction/text.py
Outdated
occur in all documents in a training set, will not be entirely ignored. | ||
(Note that the idf formula above differs from the standard textbook | ||
notation that defines the idf as | ||
`idf(d, t) = log{\frac{n}{df(d, t) + 1}}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also have :math:
?
Norm used to normalize term vectors. None for no normalization. | ||
norm : 'l1', 'l2' or None, optional (default=’l2’) | ||
Each output row will have unit norm, either: | ||
* 'l2': Sum of squares of vector elements is 1. The cosine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a blank line before it. See broken rendering at https://42814-843222-gh.circle-artifacts.com/0/doc/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
sklearn/feature_extraction/text.py
Outdated
Type of the matrix returned by fit_transform() or transform(). | ||
|
||
norm : 'l1', 'l2' or None, optional | ||
Norm used to normalize term vectors. None for no normalization. | ||
norm : 'l1', 'l2' or None, optional (default=’l2’) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're changing this in TfidfVectorizer, it also should be changed in TfidfTrasnformer. Isn't that consistency the main purpose of this PR?
sklearn/feature_extraction/text.py
Outdated
CountVectorizer : Produces a sparse representation of the counts using | ||
scipy.sparse.csr_matrix. | ||
|
||
TfidfTransformer : Converts the count matrix from CountVectorizer to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of TfidfVectorizer this is far too much detail.
Just say "Performs the TF-IDF transformation from a provided matrix of counts"
sklearn/feature_extraction/text.py
Outdated
TfidfTransformer | ||
Apply Term Frequency Inverse Document Frequency normalization to a | ||
sparse matrix of occurrence counts. | ||
CountVectorizer : Produces a sparse representation of the counts using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of TfidfVectorizer this is not explaining the difference well. Just "Transforms text into a sparse matrix of n-gram counts" is sufficient. So is the current wording.
@jnothman |
I was commenting on See Also, not on object descriptions. See Also's purpose is to compare other objects to the present object, so it should often not be the same as the description. No, you should not shorten the description for TfidfTransformer; that is totally out of scope for this issue. |
sklearn/feature_extraction/text.py
Outdated
@@ -1322,17 +1333,17 @@ class TfidfVectorizer(CountVectorizer): | |||
Otherwise the input is expected to be the sequence strings or | |||
bytes items are expected to be analyzed directly. | |||
|
|||
encoding : string, 'utf-8' by default. | |||
encoding : string, 'utf-8' (default='utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer consistent with CountVectorizer. The point of this PR is to make TfidfVectorizer's parameter descriptions identical to those in either CountVectorizer or TfidfTransformer because TfidfVectorizer is a composition of those two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
sklearn/feature_extraction/text.py
Outdated
@@ -1307,6 +1312,12 @@ class TfidfVectorizer(CountVectorizer): | |||
|
|||
Equivalent to CountVectorizer followed by TfidfTransformer. | |||
|
|||
CountVectorizer : Transforms text into a sparse matrix of n-gram counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These see also entries don't belong here. Just remove them.
sklearn/feature_extraction/text.py
Outdated
@@ -1307,6 +1312,12 @@ class TfidfVectorizer(CountVectorizer): | |||
|
|||
Equivalent to CountVectorizer followed by TfidfTransformer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please put backticks around the class names, so they will be hyperlinked in the online docs.
sklearn/feature_extraction/text.py
Outdated
@@ -1305,7 +1310,7 @@ def idf_(self, value): | |||
class TfidfVectorizer(CountVectorizer): | |||
"""Convert a collection of raw documents to a matrix of TF-IDF features. | |||
|
|||
Equivalent to CountVectorizer followed by TfidfTransformer. | |||
Equivalent to ``CountVectorizer`` followed by ``TfidfTransformer``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you can use single backticks and reference the actual docs of the two classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be:
Equivalent to :class:CountVectorizer
followed by :class:TfidfTransformer
or can you point me to an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, if it doesn't work , this should do:
:class:`sklearn.feature_extraction.CountVectorizer`
git grep \` sklearn/
should also give you more examples.
sklearn/feature_extraction/text.py
Outdated
@@ -1310,7 +1310,8 @@ def idf_(self, value): | |||
class TfidfVectorizer(CountVectorizer): | |||
"""Convert a collection of raw documents to a matrix of TF-IDF features. | |||
|
|||
Equivalent to ``CountVectorizer`` followed by ``TfidfTransformer``. | |||
Equivalent to :class:`sklearn.feature_extraction.CountVectorizer` followed | |||
by :class:`sklearn.feature_extraction.TfidfTransformer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be sufficient to just do `CountVectorizer`
Thanks @reshamas |
thank you @jnothman and @adrinjalali also, thanks to @nderzsy |
This reverts commit a02c265.
This reverts commit a02c265.
Referencing PR / Issue
This closes #12204
This also closes #6766 and closes #9369
This also closes #12811 (which is including the wrong file in my PR).
Note
This adds more information in the TfidfVectorizer documentation. It now includes comments about CountVectorizer and TfidfTransformer.
cc: @blooraspberry
#wimlds