-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings #9369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This seems like an improvement to me, but @amueller will have to speak as to if it addresses all the concerns he brought up. |
that contain term t. The effect of adding "1" to the idf in the equation | ||
above is that terms with zero idf, i.e., terms that occur in all documents | ||
|
||
tf-idf(d, t, D) = tf(t, d) * idf(t, D) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you maybe reproduce the formula also in the TfidfVectorizer docs? Otherwise looks good :) I feel like the two should largely have the same docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did that!
sklearn/feature_extraction/text.py
Outdated
@@ -1008,6 +1026,8 @@ class TfidfTransformer(BaseEstimator, TransformerMixin): | |||
Normalization is "c" (cosine) when ``norm='l2'``, "n" (none) | |||
when ``norm=None``. | |||
|
|||
By default, the l2 norm is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be at the explanation of the norm parameter, the first line should end with something like (default='l2')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't decide whether or not to go down a rabbit hole of l2 vs. l1 norm, partly because my first pass at a google search didn't reveal the most cogent explanations; feels out of scope for a docstring, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I mostly wanted the clarification at the explanation of the parameter.
sklearn/feature_extraction/text.py
Outdated
@@ -1005,6 +1023,9 @@ class TfidfTransformer(BaseEstimator, TransformerMixin): | |||
Tf is "n" (natural) by default, "l" (logarithmic) when | |||
``sublinear_tf=True``. | |||
Idf is "t" when use_idf is given, "n" (none) otherwise. | |||
The "norm" parameter provides the "how" of how each input vector is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is redundant here as it is explained below, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is - I was trying to address it as it seemed like it was one of the points of discussion in the previous issue. Since norm discussed as a parameter, though, it makes more sense to discuss the default there.
LGTM, maybe you should do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like is too verose and detailed for a docstring. We have the user guide links to provide such detail, and that particularly helps us avoid maintaining duplicate documentation!
Rather this should be limited to saying that tfidf reweights vectors of term counts to emphasise terms that are frequent within a document and infrequent across the collection if documents. The rest in the user guide
|
||
Inverse-document-frequency (idf) = 1 / (# of docs a term appears in) | ||
|
||
This is a common term weighting scheme in information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "family of schemes". Really you mean that the product of a monotonic increasing function of TF and a monotonic increasing function of IDF constitutes a family of term weighting schemes. You miss the fact that we care about their product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps adding tfidf as the next definition makes sense. It may also be appropriate to format these definitions as a definition list.
Issue - docstring for TfidfVectorizer doesn't track TfidfTransformer
#6766
What does this implement/fix? Explain your changes.
I made the TfIdfVectorizer docstring a little bit more verbose about tf-idf, referred to TfidfTransformer, and edited both docstrings for clarity.