-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Correct TF-IDF formula in TfidfTransformer comments. #13054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please take a look, @rasbt Thanks! |
The existing formula has some typos involving the arguments of tf and idf. Corrected the formula as per the Wikipedia article for TFIDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Except for the minor language corrections and using set cardinality notation instead of defining 'n' is maybe more concise. The equations look equivalent to me (I don't see that anything has changed, but maybe I missed sth). https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting If you prefer
and https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting but either way is fine since we don't use actual numbers. EDIT: Oh I just realized the diff is in the capital d and D whereas |
Thanks for this update @vishaalkapoor Just checked the formulation in reference textbooks,
both define TDFIDF as,
so the difference is indeed that In general, I find this definition simpler than adding the document collection set |
78609c6
to
9424115
Compare
Thanks for taking a look @rasbt and @rth! Just confirming things - the main issue I wanted to fix was the occurrences of df(d, t) and tf(t). Since there is only one 'd', I understood it as standing for the document, rather than the document set. In this case, the two formulas for df and tf should be df(t) and tf(t, d) leaving the document set D implicit. The second issue as mentioned was the conflict between the dummy 'd' and the d in the df(d, t). The Wikipedia entry makes everything explicit by including both a d and D but I agree it's pedantic and I prefer keeping D implicit as well. I also agree using n is cleaner instead of |D|. I will make the above corrections to match the Stanford nlp course link, as I agree they will be the simplest to read. However, the documentation does seem to have a similar issue. If you follow the text from the tf-idf formula, you'll notice df(d, t) somehow acquires a 'd' which conflicts with the 'd' in tf(t, d) (https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting). I can make a follow up change there. |
Yeah, adding it everywhere may be overkill. We can just simply let d \in D somewhere in the beginning, that would be clear enough. |
Agreed, a fix would be welcome. |
cf42aa8
to
b3aa841
Compare
Thanks for the helpful comments. I've pushed changes to the python docstring and included changes to the feature_extraction documentation that I believe include all the comments. Note as per 'd \in D', I just mention the document set in the introductory definitions of n and df and believe that should be enough to set the scope/context as being the document set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @vishaalkapoor !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good!
…kit-learn#13054)" This reverts commit 0eabcd0.
…kit-learn#13054)" This reverts commit 0eabcd0.
What does this implement/fix? Explain your changes.
The documentation for TfidfTransformer included some typos around the definition of tf-idf. They have been corrected as per Wikipedia article for TFIDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf