Skip to content

Correct TF-IDF formula in TfidfTransformer comments. #13054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 29, 2019

Conversation

vishaalkapoor
Copy link
Contributor

@vishaalkapoor vishaalkapoor commented Jan 27, 2019

What does this implement/fix? Explain your changes.

The documentation for TfidfTransformer included some typos around the definition of tf-idf. They have been corrected as per Wikipedia article for TFIDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

@vishaalkapoor
Copy link
Contributor Author

Please take a look, @rasbt

Thanks!

The existing formula has some typos involving the arguments of tf and
idf. Corrected the formula as per the Wikipedia article for TFIDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
@rasbt
Copy link
Contributor

rasbt commented Jan 27, 2019

Except for the minor language corrections and using set cardinality notation instead of defining 'n' is maybe more concise. The equations look equivalent to me (I don't see that anything has changed, but maybe I missed sth).

https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

If you prefer df(t, D) over you may also want to change that below

If ``smooth_idf=True`` (the default), the constant "1" is added to the
numerator and denominator of the idf as if an extra document was seen
containing every term in the collection exactly once, which prevents
zero divisions: idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1.

and https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

but either way is fine since we don't use actual numbers.

EDIT: Oh I just realized the diff is in the capital d and D whereas d refers to a particular doc and D is the set of docs. Looks good to me.

@rth
Copy link
Member

rth commented Jan 28, 2019

Thanks for this update @vishaalkapoor

Just checked the formulation in reference textbooks,

both define TDFIDF as,

    tfidf(t, d) = tf(t, d) * idf (t)

so the difference is indeed that idf and df do not depend on the document (and this should be corrected in the docstring). The definition in the documentation is correct though https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

In general, I find this definition simpler than adding the document collection set D as an index everywhere.

@vishaalkapoor
Copy link
Contributor Author

Thanks for taking a look @rasbt and @rth!

Just confirming things - the main issue I wanted to fix was the occurrences of df(d, t) and tf(t). Since there is only one 'd', I understood it as standing for the document, rather than the document set. In this case, the two formulas for df and tf should be df(t) and tf(t, d) leaving the document set D implicit. The second issue as mentioned was the conflict between the dummy 'd' and the d in the df(d, t).

The Wikipedia entry makes everything explicit by including both a d and D but I agree it's pedantic and I prefer keeping D implicit as well. I also agree using n is cleaner instead of |D|.

I will make the above corrections to match the Stanford nlp course link, as I agree they will be the simplest to read.

However, the documentation does seem to have a similar issue. If you follow the text from the tf-idf formula, you'll notice df(d, t) somehow acquires a 'd' which conflicts with the 'd' in tf(t, d) (https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting). I can make a follow up change there.

@rasbt
Copy link
Contributor

rasbt commented Jan 28, 2019

In general, I find this definition simpler than adding the document collection set D as an index everywhere.

Yeah, adding it everywhere may be overkill. We can just simply let d \in D somewhere in the beginning, that would be clear enough.

@rth
Copy link
Member

rth commented Jan 28, 2019

the documentation does seem to have a similar issue. If you follow the text from the tf-idf formula, you'll notice df(d, t) somehow acquires a 'd' which conflicts with the 'd' in tf(t, d)

Agreed, a fix would be welcome.

@vishaalkapoor
Copy link
Contributor Author

vishaalkapoor commented Jan 29, 2019

Thanks for the helpful comments.

I've pushed changes to the python docstring and included changes to the feature_extraction documentation that I believe include all the comments. Note as per 'd \in D', I just mention the document set in the introductory definitions of n and df and believe that should be enough to set the scope/context as being the document set.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @vishaalkapoor !

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

@jnothman jnothman merged commit fdf2f38 into scikit-learn:master Jan 29, 2019
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jan 30, 2019
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 6, 2019
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 7, 2019
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants