Correct TF-IDF formula in TfidfTransformer comments. #13054

vishaalkapoor · 2019-01-27T22:43:49Z

What does this implement/fix? Explain your changes.

The documentation for TfidfTransformer included some typos around the definition of tf-idf. They have been corrected as per Wikipedia article for TFIDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

vishaalkapoor · 2019-01-27T22:44:16Z

Please take a look, @rasbt

Thanks!

The existing formula has some typos involving the arguments of tf and idf. Corrected the formula as per the Wikipedia article for TFIDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

rasbt · 2019-01-27T22:55:44Z

Except for the minor language corrections and using set cardinality notation instead of defining 'n' is maybe more concise. The equations look equivalent to me (I don't see that anything has changed, but maybe I missed sth).

https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

If you prefer df(t, D) over you may also want to change that below

If ``smooth_idf=True`` (the default), the constant "1" is added to the
numerator and denominator of the idf as if an extra document was seen
containing every term in the collection exactly once, which prevents
zero divisions: idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1.

and https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

but either way is fine since we don't use actual numbers.

EDIT: Oh I just realized the diff is in the capital d and D whereas d refers to a particular doc and D is the set of docs. Looks good to me.

rth · 2019-01-28T06:58:26Z

Thanks for this update @vishaalkapoor

Just checked the formulation in reference textbooks,

Introduction to Information Retrieval bu Manning et al see https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html
Modern information retrieval

both define TDFIDF as,

    tfidf(t, d) = tf(t, d) * idf (t)

so the difference is indeed that idf and df do not depend on the document (and this should be corrected in the docstring). The definition in the documentation is correct though https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

In general, I find this definition simpler than adding the document collection set D as an index everywhere.

vishaalkapoor · 2019-01-28T19:22:14Z

Thanks for taking a look @rasbt and @rth!

Just confirming things - the main issue I wanted to fix was the occurrences of df(d, t) and tf(t). Since there is only one 'd', I understood it as standing for the document, rather than the document set. In this case, the two formulas for df and tf should be df(t) and tf(t, d) leaving the document set D implicit. The second issue as mentioned was the conflict between the dummy 'd' and the d in the df(d, t).

The Wikipedia entry makes everything explicit by including both a d and D but I agree it's pedantic and I prefer keeping D implicit as well. I also agree using n is cleaner instead of |D|.

I will make the above corrections to match the Stanford nlp course link, as I agree they will be the simplest to read.

However, the documentation does seem to have a similar issue. If you follow the text from the tf-idf formula, you'll notice df(d, t) somehow acquires a 'd' which conflicts with the 'd' in tf(t, d) (https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting). I can make a follow up change there.

rasbt · 2019-01-28T19:54:40Z

In general, I find this definition simpler than adding the document collection set D as an index everywhere.

Yeah, adding it everywhere may be overkill. We can just simply let d \in D somewhere in the beginning, that would be clear enough.

rth · 2019-01-28T20:05:21Z

the documentation does seem to have a similar issue. If you follow the text from the tf-idf formula, you'll notice df(d, t) somehow acquires a 'd' which conflicts with the 'd' in tf(t, d)

Agreed, a fix would be welcome.

vishaalkapoor · 2019-01-29T01:02:41Z

Thanks for the helpful comments.

I've pushed changes to the python docstring and included changes to the feature_extraction documentation that I believe include all the comments. Note as per 'd \in D', I just mention the document set in the introductory definitions of n and df and believe that should be enough to set the scope/context as being the document set.

rth

LGTM. Thanks @vishaalkapoor !

jnothman

Good!

…n#13054)

…kit-learn#13054)" This reverts commit 0eabcd0.

…n#13054)

Correct TF-IDF formula in TfidfTransformer comments.

60635b9

The existing formula has some typos involving the arguments of tf and idf. Corrected the formula as per the Wikipedia article for TFIDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Remove document set D as it is pedantic.

9424115

vishaalkapoor force-pushed the tfidf_typo branch from 78609c6 to 9424115 Compare January 28, 2019 19:16

Language to introduce document set as the context.

b3aa841

vishaalkapoor force-pushed the tfidf_typo branch from cf42aa8 to b3aa841 Compare January 29, 2019 01:01

Corrections to the feature extraction documentation.

566cf77

rth approved these changes Jan 29, 2019

View reviewed changes

jnothman approved these changes Jan 29, 2019

View reviewed changes

jnothman merged commit fdf2f38 into scikit-learn:master Jan 29, 2019

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jan 30, 2019

DOC Correct TF-IDF formula in TfidfTransformer comments. (scikit-lear…

de9630f

…n#13054)

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 6, 2019

DOC Correct TF-IDF formula in TfidfTransformer comments. (scikit-lear…

a6bfec6

…n#13054)

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 7, 2019

DOC Correct TF-IDF formula in TfidfTransformer comments. (scikit-lear…

8270c8d

…n#13054)

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019

DOC Correct TF-IDF formula in TfidfTransformer comments. (scikit-lear…

16d1ef6

…n#13054)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

DOC Correct TF-IDF formula in TfidfTransformer comments. (scikit-lear…

0eabcd0

…n#13054)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC Correct TF-IDF formula in TfidfTransformer comments. (sci…

2b520fb

…kit-learn#13054)" This reverts commit 0eabcd0.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC Correct TF-IDF formula in TfidfTransformer comments. (sci…

6521912

…kit-learn#13054)" This reverts commit 0eabcd0.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

DOC Correct TF-IDF formula in TfidfTransformer comments. (scikit-lear…

50efe7c

…n#13054)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Correct TF-IDF formula in TfidfTransformer comments. #13054

Correct TF-IDF formula in TfidfTransformer comments. #13054

Uh oh!

vishaalkapoor commented Jan 27, 2019 •

edited

Loading

Uh oh!

vishaalkapoor commented Jan 27, 2019

Uh oh!

rasbt commented Jan 27, 2019 •

edited

Loading

Uh oh!

rth commented Jan 28, 2019

Uh oh!

vishaalkapoor commented Jan 28, 2019

Uh oh!

rasbt commented Jan 28, 2019

Uh oh!

rth commented Jan 28, 2019

Uh oh!

vishaalkapoor commented Jan 29, 2019 •

edited

Loading

Uh oh!

rth left a comment

Uh oh!

jnothman left a comment

Uh oh!

Uh oh!

Uh oh!

Correct TF-IDF formula in TfidfTransformer comments. #13054

Correct TF-IDF formula in TfidfTransformer comments. #13054

Uh oh!

Conversation

vishaalkapoor commented Jan 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Uh oh!

vishaalkapoor commented Jan 27, 2019

Uh oh!

rasbt commented Jan 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Jan 28, 2019

Uh oh!

vishaalkapoor commented Jan 28, 2019

Uh oh!

rasbt commented Jan 28, 2019

Uh oh!

rth commented Jan 28, 2019

Uh oh!

vishaalkapoor commented Jan 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vishaalkapoor commented Jan 27, 2019 •

edited

Loading

rasbt commented Jan 27, 2019 •

edited

Loading

vishaalkapoor commented Jan 29, 2019 •

edited

Loading