[MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings #9369

musiciancodes · 2017-07-15T20:31:07Z

Issue - docstring for TfidfVectorizer doesn't track TfidfTransformer

What does this implement/fix? Explain your changes.

I made the TfIdfVectorizer docstring a little bit more verbose about tf-idf, referred to TfidfTransformer, and edited both docstrings for clarity.

jmschrei · 2017-07-15T20:40:02Z

This seems like an improvement to me, but @amueller will have to speak as to if it addresses all the concerns he brought up.

amueller · 2017-07-15T20:46:43Z

sklearn/feature_extraction/text.py

-    that contain term t. The effect of adding "1" to the idf in the equation
-    above is that terms with zero idf, i.e., terms  that occur in all documents
+
+    tf-idf(d, t, D) = tf(t, d) * idf(t, D)


Can you maybe reproduce the formula also in the TfidfVectorizer docs? Otherwise looks good :) I feel like the two should largely have the same docs.

Just did that!

amueller · 2017-07-15T20:47:32Z

sklearn/feature_extraction/text.py

@@ -1008,6 +1026,8 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
    Normalization is "c" (cosine) when ``norm='l2'``, "n" (none)
    when ``norm=None``.

+    By default, the l2 norm is used.


This should be at the explanation of the norm parameter, the first line should end with something like (default='l2')

I couldn't decide whether or not to go down a rabbit hole of l2 vs. l1 norm, partly because my first pass at a google search didn't reveal the most cogent explanations; feels out of scope for a docstring, though?

yeah I mostly wanted the clarification at the explanation of the parameter.

…orizer docstring

amueller · 2017-07-15T21:30:26Z

sklearn/feature_extraction/text.py

@@ -1005,6 +1023,9 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
    Tf is "n" (natural) by default, "l" (logarithmic) when
    ``sublinear_tf=True``.
    Idf is "t" when use_idf is given, "n" (none) otherwise.
+    The "norm" parameter provides the "how" of how each input vector is


I feel this is redundant here as it is explained below, right?

It is - I was trying to address it as it seemed like it was one of the points of discussion in the previous issue. Since norm discussed as a parameter, though, it makes more sense to discuss the default there.

jmschrei · 2017-07-16T15:59:11Z

LGTM, maybe you should do default='l2' instead of default = 'l2', as a nitpick?

jnothman

I feel like is too verose and detailed for a docstring. We have the user guide links to provide such detail, and that particularly helps us avoid maintaining duplicate documentation!

Rather this should be limited to saying that tfidf reweights vectors of term counts to emphasise terms that are frequent within a document and infrequent across the collection if documents. The rest in the user guide

jnothman · 2017-07-23T23:45:33Z

sklearn/feature_extraction/text.py

+
+    Inverse-document-frequency (idf) = 1 / (# of docs a term appears in)
+
+    This is a common term weighting scheme in information


Maybe "family of schemes". Really you mean that the product of a monotonic increasing function of TF and a monotonic increasing function of IDF constitutes a family of term weighting schemes. You miss the fact that we care about their product.

Perhaps adding tfidf as the next definition makes sense. It may also be appropriate to format these definitions as a definition list.

edited tfidfvectorizer and tfidftransformer docstrings

f0314ae

amueller reviewed Jul 15, 2017

View reviewed changes

Added a brief explanation of the normed parameter, expanded TfidfVect…

c9428f4

…orizer docstring

amueller reviewed Jul 15, 2017

View reviewed changes

moved norm default comment to the parameters section of the docstring

e3fae5f

jmschrei changed the title ~~[MRG] Edited TfidfVectorizer and TfidfTransformer docstrings~~ [MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings Jul 16, 2017

TomDLT added the Documentation label Jul 18, 2017

minor edit to default value i ndocstring

1d224a5

jnothman reviewed Jul 23, 2017

View reviewed changes

qinhanmin2014 mentioned this pull request Sep 30, 2018

[MRG] tfidfvectorizer documentation #12204

Closed

This was referenced Dec 18, 2018

improve TfidfVectorizer documentation #12811

Closed

[MRG] improve tfidfvectorizer documentation #12822

Merged

jnothman closed this in #12822 Jan 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings #9369

[MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings #9369

Uh oh!

musiciancodes commented Jul 15, 2017 •

edited

Loading

Uh oh!

jmschrei commented Jul 15, 2017

Uh oh!

amueller Jul 15, 2017

Uh oh!

musiciancodes Jul 15, 2017

Uh oh!

amueller Jul 15, 2017

Uh oh!

musiciancodes Jul 15, 2017

Uh oh!

amueller Jul 15, 2017

Uh oh!

amueller Jul 15, 2017

Uh oh!

musiciancodes Jul 15, 2017

Uh oh!

jmschrei commented Jul 16, 2017

Uh oh!

jnothman left a comment

Uh oh!

jnothman Jul 23, 2017

Uh oh!

jnothman Jul 23, 2017

Uh oh!

Uh oh!


		Inverse-document-frequency (idf) = 1 / (# of docs a term appears in)

		This is a common term weighting scheme in information

Uh oh!

[MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings #9369

[MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings #9369

Uh oh!

Conversation

musiciancodes commented Jul 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue - docstring for TfidfVectorizer doesn't track TfidfTransformer

What does this implement/fix? Explain your changes.

Uh oh!

jmschrei commented Jul 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmschrei commented Jul 16, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

musiciancodes commented Jul 15, 2017 •

edited

Loading