Skip to content

[MRG] tfidfvectorizer documentation #12204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

blooraspberry
Copy link

@blooraspberry blooraspberry commented Sep 29, 2018

closes #6766 and closes #9369

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This adds more information in the TfidfVectorizer documentation. It now includes comments about CountVectorizer and TfidfTransformer.

Any other comments?


CountVectorizer converts a collection of text documents to a matrix of token counts.

TfidfTransformer then converts the count matrix from CountVectorizer to a normalized tf-idf representation. Tf is term frequency, and idf is inverse document frequency. This is a common way to calculate the count of a word relative to the appearance of a ducument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure to break the lines please?

TfidfTransformer then converts the count matrix from CountVectorizer to a normalized tf-idf representation. Tf is term frequency, and idf is inverse document frequency. This is a common way to calculate the count of a word relative to the appearance of a ducument.

The formula that is used to compute the tf-idf of term t is
tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other documentation has some formatting for these, can you make sure to copy the code, not the rendering?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@amueller
Copy link
Member

Can you please reference the issues and PRs this is addressing in the description? Then merging this will close these.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blooraspberry I've restarted Travis for you and there're flake8 errors. Please correct them according to https://travis-ci.org/scikit-learn/scikit-learn/jobs/435054659

@qinhanmin2014
Copy link
Member

And I guess this closes #6766 and #9369

@NicolasHug
Copy link
Member

NicolasHug commented Oct 2, 2018

Travis output is unreadable, so here is what needs to be fixed in text.py:

~/dev/sklearn(branch:pr/12212*) » flake8 sklearn/feature_extraction/text.py
sklearn/feature_extraction/text.py:240:9: E731 do not assign a lambda expression, use a def
sklearn/feature_extraction/text.py:1289:64: W291 trailing whitespace
sklearn/feature_extraction/text.py:1291:75: W291 trailing whitespace
sklearn/feature_extraction/text.py:1292:18: W291 trailing whitespace
sklearn/feature_extraction/text.py:1294:78: W291 trailing whitespace
sklearn/feature_extraction/text.py:1295:79: W291 trailing whitespace
sklearn/feature_extraction/text.py:1296:78: W291 trailing whitespace
sklearn/feature_extraction/text.py:1297:46: W291 trailing whitespace

@sergulaydore
Copy link
Contributor

Hello @blooraspberry ,

Thank you for participating in the WiMLDS/scikit sprint. We would love to merge all the PRs that were submitted. It would be great if you could follow up on the work that you started! For the PR you submitted, would you please update and re-submit? Please include #wimlds in your PR conversation.

Any questions:

  • see workflow for reference
  • ask on this PR conversation or the issue tracker
  • ask on wimlds gitter with a reference to this PR

cc: @reshamas

@blooraspberry
Copy link
Author

blooraspberry commented Nov 29, 2018 via email

@reshamas
Copy link
Member

@blooraspberry
Will you be completing this PR?

@reshamas
Copy link
Member

I am working on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TfidfVectorizer documentation doesn't track TfidfTransformer
7 participants