[MRG+2-1] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. #7839

Mxbonn · 2016-11-07T20:17:21Z

I added an option to CountVectorizer and TfidfVectorizer to use a floating point value for the max_features parameter to express a percentage of the features instead of an absolute value.

I also changed the default from None to 1.0 and added tests.

This is my first contribution to this project so let me know if I did anything wrong :).

Added a float option for max_features in CountVectorizer and TfidfVectorizer. Floating values represent a percentage.

Also casted the constructed max_features * n_features to an int to avoid deprecationwarnings

Mxbonn · 2016-11-07T21:41:24Z

I'm not entirely sure how to fix the travis error.
I think just changing max_features=None with max_features=1.0 in doc/modules/feature_extraction.rst
I fail to reproduce the travis steps locally.

jnothman · 2016-11-07T23:21:15Z

sklearn/feature_extraction/text.py

-                raise ValueError(
-                    "max_features=%r, neither a positive integer nor None"
-                    % max_features)
+        if max_features < 0:


We still need to support the legacy max_features=None as users may have been doing this. That setting will break here in Py3k. Please add a test.

jnothman · 2016-11-07T23:30:41Z

You just need to update the doctests to reflect the new parametrisation

Mxbonn · 2016-11-08T10:00:07Z

@jnothman I added the legacy max_features=None as you requested

jnothman · 2016-11-08T10:46:30Z

thanks

On 8 November 2016 at 21:00, Maxim Bonnaerens notifications@github.com
wrote:

@jnothman https://github.com/jnothman I added the legacy
max_features=None as you requested

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7839 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yVBugHo6jxAgxuZtQEj1Nt5cnSSks5q8EgpgaJpZM4KrqdR
.

jnothman · 2016-11-08T10:50:56Z

sklearn/feature_extraction/text.py

@@ -854,6 +852,10 @@ def fit_transform(self, raw_documents, y=None):
            if max_doc_count < min_doc_count:
                raise ValueError(
                    "max_df corresponds to < documents than min_df")
+            max_features = (max_features


I find this hard to read. I'd rather an explicit:

if max_features is None: max_features = n_features elif ...

Yes I agree that this was not the best solution but I wanted to keep it consistent with the above lines. I changed it now.

jnothman · 2016-11-08T11:52:56Z

sklearn/feature_extraction/text.py

+        Build a vocabulary that only consider the top max_features ordered
+        by term frequency across the corpus.
+        If float the parameter represents a proportion of the features, integer
+        absolute counts.


Technically we should add a versionchanged note

So I add:

.. versionchanged:: 0.19 Added percentage option.

Or which version will this be in?

Yes, that's fine.

jnothman · 2016-11-08T11:53:28Z

sklearn/feature_extraction/text.py

-        If not None, build a vocabulary that only consider the top
-        max_features ordered by term frequency across the corpus.
+    max_features : float in range [0.0, 1.0] or int, default=1.0
+        Build a vocabulary that only consider the top max_features ordered


while we're here, change "consider" to "considers"

jnothman · 2016-11-08T11:55:20Z

sklearn/feature_extraction/tests/test_text.py

    counts_None = cv_None.fit_transform(JUNK_FOOD_DOCS).sum(axis=0)

    features_1 = cv_1.get_feature_names()
    features_3 = cv_3.get_feature_names()
+    features_all = cv_all.get_feature_names()


Can you please check for equality and subset between the different sets of feature names. (i.e. 1 subset of 3 which is subset of None; 3 = .5, None = 1)

I'm not sure if I understand what you mean. I already check if the new percentage option is consistent with manually specifying the amount of features (e.g. 3 = 0.5 and 1 = None) in the assert statements a few line under this. Which check do you want me to add?

It's not so important, but I thought worth checking that the actual 1- feature set is a subset of the 3-feature set, etc. No big deal.

Okay I understand now. I added a check for this ;).

jnothman · 2016-11-08T11:55:32Z

Otherwise LGTM

amueller · 2016-11-17T22:36:42Z

That seems like an odd way to parametrize a vocabulary to me... is that something nlp people do? I feel min_df is much more natural?

Mxbonn · 2016-11-19T19:26:13Z

I am relatively new in nlp so I can only speak for myself that I find this feature more useful than setting min_df.

jnothman · 2016-11-20T02:33:37Z

Thinking about this again, it is a bit fiddly. We at least should clarify that this is the count of distinct tokens ("types" in some literature) after stopwords are removed.

Mxbonn · 2016-11-21T11:05:14Z

@jnothman Do you want me to change this in the description or are you still unsure if the feature is useful anyhow?

jnothman · 2016-11-21T20:15:35Z

Document it, please.

On 21 November 2016 at 22:05, Maxim Bonnaerens notifications@github.com
wrote:

@jnothman https://github.com/jnothman Do you want me to change this in
the description or are you still unsure if the feature is useful anyhow?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7839 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6z1SclPEZolvB5P8OTX96eupx20vks5rAXrsgaJpZM4KrqdR
.

Updated documenation to reflect that the max_feature value is applied after stop words removal

Mxbonn · 2016-11-23T13:57:33Z

@jnothman Documented it :)

jnothman

still LGTM

jnothman · 2016-11-24T01:57:22Z

sklearn/feature_extraction/text.py

+        Build a vocabulary that only considers the top max_features ordered
+        by term frequency across the corpus.
+        If float, the parameter represents a proportion of the features after
+        stop words removal, integer absolute counts.


Grammar here is a bit hard to read. How about a new sentence: "If an integer, it represents absolute counts"?

jnothman · 2016-11-24T01:57:33Z

sklearn/feature_extraction/text.py

+        Build a vocabulary that only considers the top max_features ordered
+        by term frequency across the corpus.
+        If float, the parameter represents a proportion of the features after
+        stop words removal, integer absolute counts.


amueller · 2016-11-30T21:30:19Z

LGTM, why not... needs whatsnew

lesteve · 2016-12-01T13:27:03Z

sklearn/feature_extraction/text.py

-                raise ValueError(
-                    "max_features=%r, neither a positive integer nor None"
-                    % max_features)
+        if max_features is not None and max_features < 0:


Why would max_features be None?

I think you need better type checking to make sure that max_features is an integer or a float. A good start would be isinstance(max_features, numbers.Number) but then you need to check 0 < max_features <= 1 in the float case and max_features > 0 in the int case.

@lesteve
None is the current default. I changed this default to 1.0 which now has the same meaning but to keep code compatible, None is still supported.

I chose not to check specifically for both floats and ints because I based it on the checks in min_df and max_df which both support float and integer value in the same manner and they just check for min_df < 0. But I could change it if you wish :)

to keep code compatible, None is still supported.

Deprecate max_features=None at the very least.

lesteve · 2016-12-01T13:29:37Z

sklearn/feature_extraction/text.py

-                    "max_features=%r, neither a positive integer nor None"
-                    % max_features)
+        if max_features is not None and max_features < 0:
+            raise ValueError("negative value for max_features")


Need a more informative error message, it is always a good idea to print the value that was provided, i.e. something like:

raise ValueError("'max_features' need to be a float in [0, 1] or a positive integer." "You provided 'max_features={}'".format(max_features)

@lesteve Do you want me to update the ValueErrors for min_df and max_df as well then, because I based my error message on those current error messagaes?

Fine, keep it like this for now but it would be great to improve the type checking of min_df, max_df and max_features as well as the error messages in a separate PR.

GaelVaroquaux · 2016-12-02T06:54:17Z

I think you need better type checking to make sure that max_features is an integer or a float. A good start would be isinstance(max_features, numbers.Number) but then you need to check 0 < max_features <= 1 in the float case and max_features > 0 in the int case.

I haven't had time to look at this PR, so I cannot comment in details. However, having a different behavior based on the type of the entry is a very bad idea: people will have the behavior switch unintentionally. I would be -1 for the PR as it is (I've edited the title accordingly). I thing that these are bad semantics. Feel free to override my downvote if there is a strong concensus the other way. I would prefer a more explicit difference like a string "50%". I know that this is not the way that we have been doing things so far in scikit-learn, but I think that the way we have been doing things is a bad one. If we don't go that way, I support strongly the comments of @lesteve that I have quoted above.

lesteve · 2016-12-02T10:06:46Z

However, having a different behavior based on the type of the entry is a very bad idea: people will have the behavior switch unintentionally.

max_df and min_df already have this semantic (absolute count if an int, relative frequency if a float) in CountVectorizer and TfIdfVectorizer. I think this PR was just proposing to have an similar feature for max_features. Are you saying max_df and min_df should be changed too?

jnothman · 2016-12-03T10:39:06Z

I've thought the same about the brittleness of distinguishing float from int... but it is certainly something we have repeatedly in the code base. It's a much safer feature where floats >= 1 are forbidden, and blocking the

= 1 case removes backwards compatibility concerns where users had

previously passed floats. Given that this is a case where we can constrain the float to be strictly less than 1, I'm not persuaded that this is the right place to change to using a "%" representation.

…

On 2 December 2016 at 21:06, Loïc Estève ***@***.***> wrote: However, having a different behavior based on the type of the entry is a very bad idea: people will have the behavior switch unintentionally. max_df and min_df already have this semantic (absolute count if an int, relative frequency if a float) this in CountVectorizer and TfIdfVectorizer. I think this PR was just proposing to have an similar feature for max_features. Are you saying it should be changed too? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7839 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-bRuMukRj5DXutrLT-PgBxtrr4dks5rD-24gaJpZM4KrqdR> .

GaelVaroquaux · 2016-12-03T10:54:02Z

max_df and min_df already have this semantic (absolute count if an int, relative frequency if a float) this in CountVectorizer and TfIdfVectorizer.

I know.

Are you saying it should be changed too?

I don't know about backward incompatible changes. Maybe, I think that they would be good. I do think that I would prefer not repeating this ambiguous design.

GaelVaroquaux · 2016-12-03T10:55:07Z

I've thought the same about the brittleness of distinguishing float from int... but it is certainly something we have repeatedly in the code base.

And I don't like it. Everywhere in the codebase.

jnothman · 2016-12-03T13:16:33Z

I agree. But you also don't like deprecation :)

…

On 3 December 2016 at 21:55, Gael Varoquaux ***@***.***> wrote: > I've thought the same about the brittleness of distinguishing float from > int... but it is certainly something we have repeatedly in the code base. And I don't like it. Everywhere in the codebase. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7839 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6y8-qZXxbji016CAvFVSt4IlzM6-ks5rEUqMgaJpZM4KrqdR> .

jnothman · 2016-12-03T13:17:02Z

*flippant/gratuitous deprecation

…

On 4 December 2016 at 00:16, Joel Nothman ***@***.***> wrote: I agree. But you also don't like deprecation :) On 3 December 2016 at 21:55, Gael Varoquaux ***@***.***> wrote: > > I've thought the same about the brittleness of distinguishing float from > > int... but it is certainly something we have repeatedly in the code > base. > > And I don't like it. Everywhere in the codebase. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#7839 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz6y8-qZXxbji016CAvFVSt4IlzM6-ks5rEUqMgaJpZM4KrqdR> > . >

amueller · 2016-12-03T21:29:55Z

I think we have seen bug reports based on the ambiguity (because the meaning of 1.0 changes between versions when the float is introduced). But less than what I would expect. I don't think this PR is the right place to discuss this interface. We should either keep this one consistent with the rest, or change it and deprecate it everywhere. But that deprecation is much more major surgery than this PR.

About the "%" api: what happens if we add an integer to a feature that used to accept only floats? Will the float be deprecated and replaced by a %? Or will we use % for all floats 0-1 from now on?
It will not only require people to add the string but also to change the number.
If find max_features=.1 a lot more concise than max_features="10%".

I'm -1 on changing our syntax in this PR.

Issue to discuss here: #7973

jnothman · 2016-12-04T01:44:57Z

"because the meaning of 1.0 changes between versions when the float is introduced" is why I'd prefer it be used only when there is an exclusive upper bound at 1 for floats.

…

On 4 December 2016 at 08:29, Andreas Mueller ***@***.***> wrote: I think we have seen bug reports based on the ambiguity (because the meaning of 1.0 changes between versions when the float is introduced). But less than what I would expect. I don't think this PR is the right place to discuss this interface. We should either keep this one consistent with the rest, or change it and deprecate it everywhere. But that deprecation is much more major surgery than this PR. About the "%" api: what happens if we add an integer to a feature that used to accept only floats? Will the float be deprecated and replaced by a %? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7839 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68gHlSpFsk72Mwq4e2G4CbUuzt6Eks5rEd9UgaJpZM4KrqdR> .

Mxbonn added 2 commits November 7, 2016 20:14

Added float option for max_features

9aa3a29

Added a float option for max_features in CountVectorizer and TfidfVectorizer. Floating values represent a percentage.

Added test

6855762

Also casted the constructed max_features * n_features to an int to avoid deprecationwarnings

jnothman requested changes Nov 7, 2016

View reviewed changes

Mxbonn added 2 commits November 8, 2016 09:27

Still support None

88cffbd

Changed linebreak according to pep8

284f68c

jnothman reviewed Nov 8, 2016

View reviewed changes

Mxbonn added 2 commits November 8, 2016 12:19

Changed if else structure

07b3279

Resolved pep8 conflict

250c354

jnothman reviewed Nov 8, 2016

View reviewed changes

Mxbonn added 2 commits November 8, 2016 13:43

Fixed typo and added versionchanged

993648e

Added subset test

fe7409b

jnothman approved these changes Nov 8, 2016

View reviewed changes

jnothman changed the title ~~Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features.~~ [MRG+1] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. Nov 8, 2016

Updated documentation

af47f24

Updated documenation to reflect that the max_feature value is applied after stop words removal

jnothman reviewed Nov 24, 2016

View reviewed changes

Changed grammar

96617c2

lesteve requested changes Dec 1, 2016

View reviewed changes

Added DeprecationWarning for None in max_features

c18a3cd

amueller added the Needs Decision Requires decision label Aug 5, 2019

github-actions bot added the module:feature_extraction label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:49

Mxbonn closed this Sep 20, 2023

[MRG+2-1] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. #7839

[MRG+2-1] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. #7839

Conversation

Mxbonn commented Nov 7, 2016

Mxbonn commented Nov 7, 2016

Choose a reason for hiding this comment

jnothman commented Nov 7, 2016

Mxbonn commented Nov 8, 2016

jnothman commented Nov 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Nov 8, 2016

amueller commented Nov 17, 2016 • edited Loading

Mxbonn commented Nov 19, 2016

jnothman commented Nov 20, 2016

Mxbonn commented Nov 21, 2016

jnothman commented Nov 21, 2016

Mxbonn commented Nov 23, 2016

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Nov 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mxbonn Dec 1, 2016 • edited Loading

Choose a reason for hiding this comment

lesteve Dec 5, 2016 • edited Loading

Choose a reason for hiding this comment

GaelVaroquaux commented Dec 2, 2016 via email

lesteve commented Dec 2, 2016 • edited Loading

jnothman commented Dec 3, 2016 via email

GaelVaroquaux commented Dec 3, 2016 via email

GaelVaroquaux commented Dec 3, 2016 via email

jnothman commented Dec 3, 2016 via email

jnothman commented Dec 3, 2016 via email

amueller commented Dec 3, 2016 • edited Loading

jnothman commented Dec 4, 2016 via email

amueller commented Nov 17, 2016 •

edited

Loading

Mxbonn Dec 1, 2016 •

edited

Loading

lesteve Dec 5, 2016 •

edited

Loading

lesteve commented Dec 2, 2016 •

edited

Loading

amueller commented Dec 3, 2016 •

edited

Loading