-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] tfidfvectorizer documentation #12204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/feature_extraction/text.py
Outdated
|
||
CountVectorizer converts a collection of text documents to a matrix of token counts. | ||
|
||
TfidfTransformer then converts the count matrix from CountVectorizer to a normalized tf-idf representation. Tf is term frequency, and idf is inverse document frequency. This is a common way to calculate the count of a word relative to the appearance of a ducument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure to break the lines please?
TfidfTransformer then converts the count matrix from CountVectorizer to a normalized tf-idf representation. Tf is term frequency, and idf is inverse document frequency. This is a common way to calculate the count of a word relative to the appearance of a ducument. | ||
|
||
The formula that is used to compute the tf-idf of term t is | ||
tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the other documentation has some formatting for these, can you make sure to copy the code, not the rendering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
Can you please reference the issues and PRs this is addressing in the description? Then merging this will close these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blooraspberry I've restarted Travis for you and there're flake8 errors. Please correct them according to https://travis-ci.org/scikit-learn/scikit-learn/jobs/435054659
Travis output is unreadable, so here is what needs to be fixed in text.py:
|
Hello @blooraspberry , Thank you for participating in the WiMLDS/scikit sprint. We would love to merge all the PRs that were submitted. It would be great if you could follow up on the work that you started! For the PR you submitted, would you please update and re-submit? Please include #wimlds in your PR conversation. Any questions:
cc: @reshamas |
Hi Sergul,
Sorry I just saw this email -- didn't realize my github is connected to
another email account. I'll take a look soon.
Sharon
…On Sun, Nov 11, 2018 at 11:29 AM Sergul Aydore ***@***.***> wrote:
Hello @blooraspberry <https://github.com/blooraspberry> ,
Thank you for participating in the WiMLDS/scikit sprint. We would love to
merge all the PRs that were submitted. It would be great if you could
follow up on the work that you started! For the PR you submitted, would you
please update and re-submit? Please include #wimlds in your PR conversation.
Any questions:
- see workflow
<https://github.com/WiMLDS/nyc-2018-scikit-sprint/blob/master/2_contributing_workflow.md>
for reference
- ask on this PR conversation or the issue tracker
- ask on wimlds gitter <https://gitter.im/scikit-learn/wimlds> with a
reference to this PR
cc: @reshamas <https://github.com/reshamas>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12204 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AQRDPCZ6Lbg7xfuDOTEPYeAq2fTiWU0_ks5uuFB8gaJpZM4XAlZG>
.
|
@blooraspberry |
I am working on this PR. |
closes #6766 and closes #9369
Reference Issues/PRs
What does this implement/fix? Explain your changes.
This adds more information in the TfidfVectorizer documentation. It now includes comments about CountVectorizer and TfidfTransformer.
Any other comments?