Skip to content

[MRG+2-1] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. #7839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

Mxbonn
Copy link

@Mxbonn Mxbonn commented Nov 7, 2016

I added an option to CountVectorizer and TfidfVectorizer to use a floating point value for the max_features parameter to express a percentage of the features instead of an absolute value.

I also changed the default from None to 1.0 and added tests.

This is my first contribution to this project so let me know if I did anything wrong :).

Added a float option for max_features in CountVectorizer and
TfidfVectorizer. Floating values represent a percentage.
Also casted the constructed max_features * n_features to an int to avoid
deprecationwarnings
@Mxbonn
Copy link
Author

Mxbonn commented Nov 7, 2016

I'm not entirely sure how to fix the travis error.
I think just changing max_features=None with max_features=1.0 in doc/modules/feature_extraction.rst
I fail to reproduce the travis steps locally.

raise ValueError(
"max_features=%r, neither a positive integer nor None"
% max_features)
if max_features < 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to support the legacy max_features=None as users may have been doing this. That setting will break here in Py3k. Please add a test.

@jnothman
Copy link
Member

jnothman commented Nov 7, 2016

You just need to update the doctests to reflect the new parametrisation

@Mxbonn
Copy link
Author

Mxbonn commented Nov 8, 2016

@jnothman I added the legacy max_features=None as you requested

@jnothman
Copy link
Member

jnothman commented Nov 8, 2016

thanks

On 8 November 2016 at 21:00, Maxim Bonnaerens notifications@github.com
wrote:

@jnothman https://github.com/jnothman I added the legacy
max_features=None as you requested


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7839 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yVBugHo6jxAgxuZtQEj1Nt5cnSSks5q8EgpgaJpZM4KrqdR
.

@@ -854,6 +852,10 @@ def fit_transform(self, raw_documents, y=None):
if max_doc_count < min_doc_count:
raise ValueError(
"max_df corresponds to < documents than min_df")
max_features = (max_features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this hard to read. I'd rather an explicit:

if max_features is None:
    max_features = n_features
elif ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree that this was not the best solution but I wanted to keep it consistent with the above lines. I changed it now.

Build a vocabulary that only consider the top max_features ordered
by term frequency across the corpus.
If float the parameter represents a proportion of the features, integer
absolute counts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we should add a versionchanged note

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I add:

.. versionchanged:: 0.19
           Added percentage option.

Or which version will this be in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fine.

If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
max_features : float in range [0.0, 1.0] or int, default=1.0
Build a vocabulary that only consider the top max_features ordered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while we're here, change "consider" to "considers"

counts_None = cv_None.fit_transform(JUNK_FOOD_DOCS).sum(axis=0)

features_1 = cv_1.get_feature_names()
features_3 = cv_3.get_feature_names()
features_all = cv_all.get_feature_names()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check for equality and subset between the different sets of feature names. (i.e. 1 subset of 3 which is subset of None; 3 = .5, None = 1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand what you mean. I already check if the new percentage option is consistent with manually specifying the amount of features (e.g. 3 = 0.5 and 1 = None) in the assert statements a few line under this. Which check do you want me to add?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so important, but I thought worth checking that the actual 1- feature set is a subset of the 3-feature set, etc. No big deal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I understand now. I added a check for this ;).

@jnothman
Copy link
Member

jnothman commented Nov 8, 2016

Otherwise LGTM

@jnothman jnothman changed the title Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. [MRG+1] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. Nov 8, 2016
@amueller
Copy link
Member

amueller commented Nov 17, 2016

That seems like an odd way to parametrize a vocabulary to me... is that something nlp people do? I feel min_df is much more natural?

@Mxbonn
Copy link
Author

Mxbonn commented Nov 19, 2016

I am relatively new in nlp so I can only speak for myself that I find this feature more useful than setting min_df.

@jnothman
Copy link
Member

Thinking about this again, it is a bit fiddly. We at least should clarify that this is the count of distinct tokens ("types" in some literature) after stopwords are removed.

@Mxbonn
Copy link
Author

Mxbonn commented Nov 21, 2016

@jnothman Do you want me to change this in the description or are you still unsure if the feature is useful anyhow?

@jnothman
Copy link
Member

Document it, please.

On 21 November 2016 at 22:05, Maxim Bonnaerens notifications@github.com
wrote:

@jnothman https://github.com/jnothman Do you want me to change this in
the description or are you still unsure if the feature is useful anyhow?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7839 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6z1SclPEZolvB5P8OTX96eupx20vks5rAXrsgaJpZM4KrqdR
.

Updated documenation to reflect that the max_feature value is applied
after stop words removal
@Mxbonn
Copy link
Author

Mxbonn commented Nov 23, 2016

@jnothman Documented it :)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still LGTM

Build a vocabulary that only considers the top max_features ordered
by term frequency across the corpus.
If float, the parameter represents a proportion of the features after
stop words removal, integer absolute counts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar here is a bit hard to read. How about a new sentence: "If an integer, it represents absolute counts"?

Build a vocabulary that only considers the top max_features ordered
by term frequency across the corpus.
If float, the parameter represents a proportion of the features after
stop words removal, integer absolute counts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@amueller
Copy link
Member

LGTM, why not... needs whatsnew

@amueller amueller changed the title [MRG+1] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. [MRG+2] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. Nov 30, 2016
raise ValueError(
"max_features=%r, neither a positive integer nor None"
% max_features)
if max_features is not None and max_features < 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would max_features be None?

I think you need better type checking to make sure that max_features is an integer or a float. A good start would be isinstance(max_features, numbers.Number) but then you need to check 0 < max_features <= 1 in the float case and max_features > 0 in the int case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve
None is the current default. I changed this default to 1.0 which now has the same meaning but to keep code compatible, None is still supported.

I chose not to check specifically for both floats and ints because I based it on the checks in min_df and max_df which both support float and integer value in the same manner and they just check for min_df < 0. But I could change it if you wish :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep code compatible, None is still supported.

Deprecate max_features=None at the very least.

"max_features=%r, neither a positive integer nor None"
% max_features)
if max_features is not None and max_features < 0:
raise ValueError("negative value for max_features")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a more informative error message, it is always a good idea to print the value that was provided, i.e. something like:

raise ValueError("'max_features' need to be a float in [0, 1] or a positive integer." 
    "You provided 'max_features={}'".format(max_features)

Copy link
Author

@Mxbonn Mxbonn Dec 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve Do you want me to update the ValueErrors for min_df and max_df as well then, because I based my error message on those current error messagaes?

Copy link
Member

@lesteve lesteve Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, keep it like this for now but it would be great to improve the type checking of min_df, max_df and max_features as well as the error messages in a separate PR.

@GaelVaroquaux GaelVaroquaux changed the title [MRG+2] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. [MRG+2-1] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. Dec 2, 2016
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 2, 2016 via email

@lesteve
Copy link
Member

lesteve commented Dec 2, 2016

However, having a different behavior based on the type of the entry is a very bad idea: people will have the behavior switch unintentionally.

max_df and min_df already have this semantic (absolute count if an int, relative frequency if a float) in CountVectorizer and TfIdfVectorizer. I think this PR was just proposing to have an similar feature for max_features. Are you saying max_df and min_df should be changed too?

@jnothman
Copy link
Member

jnothman commented Dec 3, 2016 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 3, 2016 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 3, 2016 via email

@jnothman
Copy link
Member

jnothman commented Dec 3, 2016 via email

@jnothman
Copy link
Member

jnothman commented Dec 3, 2016 via email

@amueller
Copy link
Member

amueller commented Dec 3, 2016

I think we have seen bug reports based on the ambiguity (because the meaning of 1.0 changes between versions when the float is introduced). But less than what I would expect. I don't think this PR is the right place to discuss this interface. We should either keep this one consistent with the rest, or change it and deprecate it everywhere. But that deprecation is much more major surgery than this PR.

About the "%" api: what happens if we add an integer to a feature that used to accept only floats? Will the float be deprecated and replaced by a %? Or will we use % for all floats 0-1 from now on?
It will not only require people to add the string but also to change the number.
If find max_features=.1 a lot more concise than max_features="10%".

I'm -1 on changing our syntax in this PR.

Issue to discuss here: #7973

@jnothman
Copy link
Member

jnothman commented Dec 4, 2016 via email

@amueller amueller added the Needs Decision Requires decision label Aug 5, 2019
Base automatically changed from master to main January 22, 2021 10:49
@Mxbonn Mxbonn closed this Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants