-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2-1] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. #7839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added a float option for max_features in CountVectorizer and TfidfVectorizer. Floating values represent a percentage.
Also casted the constructed max_features * n_features to an int to avoid deprecationwarnings
I'm not entirely sure how to fix the travis error. |
sklearn/feature_extraction/text.py
Outdated
raise ValueError( | ||
"max_features=%r, neither a positive integer nor None" | ||
% max_features) | ||
if max_features < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to support the legacy max_features=None
as users may have been doing this. That setting will break here in Py3k. Please add a test.
You just need to update the doctests to reflect the new parametrisation |
@jnothman I added the legacy |
thanks On 8 November 2016 at 21:00, Maxim Bonnaerens notifications@github.com
|
sklearn/feature_extraction/text.py
Outdated
@@ -854,6 +852,10 @@ def fit_transform(self, raw_documents, y=None): | |||
if max_doc_count < min_doc_count: | |||
raise ValueError( | |||
"max_df corresponds to < documents than min_df") | |||
max_features = (max_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this hard to read. I'd rather an explicit:
if max_features is None:
max_features = n_features
elif ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree that this was not the best solution but I wanted to keep it consistent with the above lines. I changed it now.
sklearn/feature_extraction/text.py
Outdated
Build a vocabulary that only consider the top max_features ordered | ||
by term frequency across the corpus. | ||
If float the parameter represents a proportion of the features, integer | ||
absolute counts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we should add a versionchanged
note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I add:
.. versionchanged:: 0.19
Added percentage option.
Or which version will this be in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fine.
sklearn/feature_extraction/text.py
Outdated
If not None, build a vocabulary that only consider the top | ||
max_features ordered by term frequency across the corpus. | ||
max_features : float in range [0.0, 1.0] or int, default=1.0 | ||
Build a vocabulary that only consider the top max_features ordered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while we're here, change "consider" to "considers"
counts_None = cv_None.fit_transform(JUNK_FOOD_DOCS).sum(axis=0) | ||
|
||
features_1 = cv_1.get_feature_names() | ||
features_3 = cv_3.get_feature_names() | ||
features_all = cv_all.get_feature_names() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please check for equality and subset between the different sets of feature names. (i.e. 1 subset of 3 which is subset of None; 3 = .5, None = 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand what you mean. I already check if the new percentage option is consistent with manually specifying the amount of features (e.g. 3 = 0.5 and 1 = None) in the assert statements a few line under this. Which check do you want me to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not so important, but I thought worth checking that the actual 1- feature set is a subset of the 3-feature set, etc. No big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I understand now. I added a check for this ;).
Otherwise LGTM |
That seems like an odd way to parametrize a vocabulary to me... is that something nlp people do? I feel |
I am relatively new in nlp so I can only speak for myself that I find this feature more useful than setting |
Thinking about this again, it is a bit fiddly. We at least should clarify that this is the count of distinct tokens ("types" in some literature) after stopwords are removed. |
@jnothman Do you want me to change this in the description or are you still unsure if the feature is useful anyhow? |
Document it, please. On 21 November 2016 at 22:05, Maxim Bonnaerens notifications@github.com
|
Updated documenation to reflect that the max_feature value is applied after stop words removal
@jnothman Documented it :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still LGTM
sklearn/feature_extraction/text.py
Outdated
Build a vocabulary that only considers the top max_features ordered | ||
by term frequency across the corpus. | ||
If float, the parameter represents a proportion of the features after | ||
stop words removal, integer absolute counts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar here is a bit hard to read. How about a new sentence: "If an integer, it represents absolute counts"?
sklearn/feature_extraction/text.py
Outdated
Build a vocabulary that only considers the top max_features ordered | ||
by term frequency across the corpus. | ||
If float, the parameter represents a proportion of the features after | ||
stop words removal, integer absolute counts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
LGTM, why not... needs whatsnew |
raise ValueError( | ||
"max_features=%r, neither a positive integer nor None" | ||
% max_features) | ||
if max_features is not None and max_features < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would max_features be None?
I think you need better type checking to make sure that max_features is an integer or a float. A good start would be isinstance(max_features, numbers.Number)
but then you need to check 0 < max_features <= 1 in the float case and max_features > 0 in the int case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lesteve
None
is the current default. I changed this default to 1.0 which now has the same meaning but to keep code compatible, None is still supported.
I chose not to check specifically for both floats and ints because I based it on the checks in min_df and max_df which both support float and integer value in the same manner and they just check for min_df < 0
. But I could change it if you wish :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to keep code compatible, None is still supported.
Deprecate max_features=None
at the very least.
"max_features=%r, neither a positive integer nor None" | ||
% max_features) | ||
if max_features is not None and max_features < 0: | ||
raise ValueError("negative value for max_features") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a more informative error message, it is always a good idea to print the value that was provided, i.e. something like:
raise ValueError("'max_features' need to be a float in [0, 1] or a positive integer."
"You provided 'max_features={}'".format(max_features)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lesteve Do you want me to update the ValueErrors for min_df
and max_df
as well then, because I based my error message on those current error messagaes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine, keep it like this for now but it would be great to improve the type checking of min_df
, max_df
and max_features
as well as the error messages in a separate PR.
I think you need better type checking to make sure that max_features is an
integer or a float. A good start would be isinstance(max_features,
numbers.Number) but then you need to check 0 < max_features <= 1 in the float
case and max_features > 0 in the int case.
I haven't had time to look at this PR, so I cannot comment in details.
However, having a different behavior based on the type of the entry is a
very bad idea: people will have the behavior switch unintentionally.
I would be -1 for the PR as it is (I've edited the title accordingly). I
thing that these are bad semantics. Feel free to override my downvote if
there is a strong concensus the other way.
I would prefer a more explicit difference like a string "50%". I know
that this is not the way that we have been doing things so far in
scikit-learn, but I think that the way we have been doing things is a bad
one.
If we don't go that way, I support strongly the comments of @lesteve that
I have quoted above.
|
|
I've thought the same about the brittleness of distinguishing float from
int... but it is certainly something we have repeatedly in the code base.
It's a much safer feature where floats >= 1 are forbidden, and blocking the
= 1 case removes backwards compatibility concerns where users had
previously passed floats.
Given that this is a case where we can constrain the float to be strictly
less than 1, I'm not persuaded that this is the right place to change to
using a "%" representation.
…On 2 December 2016 at 21:06, Loïc Estève ***@***.***> wrote:
However, having a different behavior based on the type of the entry is a
very bad idea: people will have the behavior switch unintentionally.
max_df and min_df already have this semantic (absolute count if an int,
relative frequency if a float) this in CountVectorizer and TfIdfVectorizer.
I think this PR was just proposing to have an similar feature for
max_features. Are you saying it should be changed too?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7839 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-bRuMukRj5DXutrLT-PgBxtrr4dks5rD-24gaJpZM4KrqdR>
.
|
max_df and min_df already have this semantic (absolute count if an int,
relative frequency if a float) this in CountVectorizer and TfIdfVectorizer.
I know.
Are you saying it should be changed too?
I don't know about backward incompatible changes. Maybe, I think that
they would be good.
I do think that I would prefer not repeating this ambiguous design.
|
I've thought the same about the brittleness of distinguishing float from
int... but it is certainly something we have repeatedly in the code base.
And I don't like it. Everywhere in the codebase.
|
I agree. But you also don't like deprecation :)
…On 3 December 2016 at 21:55, Gael Varoquaux ***@***.***> wrote:
> I've thought the same about the brittleness of distinguishing float from
> int... but it is certainly something we have repeatedly in the code base.
And I don't like it. Everywhere in the codebase.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7839 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6y8-qZXxbji016CAvFVSt4IlzM6-ks5rEUqMgaJpZM4KrqdR>
.
|
*flippant/gratuitous deprecation
…On 4 December 2016 at 00:16, Joel Nothman ***@***.***> wrote:
I agree. But you also don't like deprecation :)
On 3 December 2016 at 21:55, Gael Varoquaux ***@***.***>
wrote:
> > I've thought the same about the brittleness of distinguishing float from
> > int... but it is certainly something we have repeatedly in the code
> base.
>
> And I don't like it. Everywhere in the codebase.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#7839 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6y8-qZXxbji016CAvFVSt4IlzM6-ks5rEUqMgaJpZM4KrqdR>
> .
>
|
I think we have seen bug reports based on the ambiguity (because the meaning of 1.0 changes between versions when the float is introduced). But less than what I would expect. I don't think this PR is the right place to discuss this interface. We should either keep this one consistent with the rest, or change it and deprecate it everywhere. But that deprecation is much more major surgery than this PR. About the "%" api: what happens if we add an integer to a feature that used to accept only floats? Will the float be deprecated and replaced by a %? Or will we use % for all floats 0-1 from now on? I'm -1 on changing our syntax in this PR. Issue to discuss here: #7973 |
"because the meaning of 1.0 changes between versions when the float is
introduced" is why I'd prefer it be used only when there is an exclusive
upper bound at 1 for floats.
…On 4 December 2016 at 08:29, Andreas Mueller ***@***.***> wrote:
I think we have seen bug reports based on the ambiguity (because the
meaning of 1.0 changes between versions when the float is introduced). But
less than what I would expect. I don't think this PR is the right place to
discuss this interface. We should either keep this one consistent with the
rest, or change it and deprecate it everywhere. But that deprecation is
much more major surgery than this PR.
About the "%" api: what happens if we add an integer to a feature that
used to accept only floats? Will the float be deprecated and replaced by a
%?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7839 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68gHlSpFsk72Mwq4e2G4CbUuzt6Eks5rEd9UgaJpZM4KrqdR>
.
|
I added an option to CountVectorizer and TfidfVectorizer to use a floating point value for the max_features parameter to express a percentage of the features instead of an absolute value.
I also changed the default from None to 1.0 and added tests.
This is my first contribution to this project so let me know if I did anything wrong :).