Skip to content

[MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings #9369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

[MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings #9369

wants to merge 4 commits into from

Conversation

musiciancodes
Copy link

@musiciancodes musiciancodes commented Jul 15, 2017

Issue - docstring for TfidfVectorizer doesn't track TfidfTransformer

#6766

What does this implement/fix? Explain your changes.

I made the TfIdfVectorizer docstring a little bit more verbose about tf-idf, referred to TfidfTransformer, and edited both docstrings for clarity.

@jmschrei
Copy link
Member

This seems like an improvement to me, but @amueller will have to speak as to if it addresses all the concerns he brought up.

that contain term t. The effect of adding "1" to the idf in the equation
above is that terms with zero idf, i.e., terms that occur in all documents

tf-idf(d, t, D) = tf(t, d) * idf(t, D)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe reproduce the formula also in the TfidfVectorizer docs? Otherwise looks good :) I feel like the two should largely have the same docs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did that!

@@ -1008,6 +1026,8 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
Normalization is "c" (cosine) when ``norm='l2'``, "n" (none)
when ``norm=None``.

By default, the l2 norm is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be at the explanation of the norm parameter, the first line should end with something like (default='l2')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't decide whether or not to go down a rabbit hole of l2 vs. l1 norm, partly because my first pass at a google search didn't reveal the most cogent explanations; feels out of scope for a docstring, though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I mostly wanted the clarification at the explanation of the parameter.

@@ -1005,6 +1023,9 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
Tf is "n" (natural) by default, "l" (logarithmic) when
``sublinear_tf=True``.
Idf is "t" when use_idf is given, "n" (none) otherwise.
The "norm" parameter provides the "how" of how each input vector is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is redundant here as it is explained below, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is - I was trying to address it as it seemed like it was one of the points of discussion in the previous issue. Since norm discussed as a parameter, though, it makes more sense to discuss the default there.

@jmschrei
Copy link
Member

LGTM, maybe you should do default='l2' instead of default = 'l2', as a nitpick?

@jmschrei jmschrei changed the title [MRG] Edited TfidfVectorizer and TfidfTransformer docstrings [MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings Jul 16, 2017
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like is too verose and detailed for a docstring. We have the user guide links to provide such detail, and that particularly helps us avoid maintaining duplicate documentation!

Rather this should be limited to saying that tfidf reweights vectors of term counts to emphasise terms that are frequent within a document and infrequent across the collection if documents. The rest in the user guide


Inverse-document-frequency (idf) = 1 / (# of docs a term appears in)

This is a common term weighting scheme in information
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "family of schemes". Really you mean that the product of a monotonic increasing function of TF and a monotonic increasing function of IDF constitutes a family of term weighting schemes. You miss the fact that we care about their product.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps adding tfidf as the next definition makes sense. It may also be appropriate to format these definitions as a definition list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants