Determining parameter behavior based on float / int type #7973

amueller · 2016-12-03T21:37:08Z

There are many places in scikit-learn where ints and floats (usually between 0 and 1) are allowed, and have different semantics.
Most of us think this is brittle. This is an issue to discuss whether we should deprecate and use a different interface, and what that interface should be.

I think the main issue arises when people pass 1.0 vs 1 and 0.0 vs 0 (though they are likely to have the same semantics).
I think we can keep the current semantics (as they are very concise) as long as we are careful about 1.0 and 1.

@GaelVaroquaux proposed to use a more explicit syntax, like "10%", which is a non-backward-compatible change and needs a full deprecation of all floats 0-1. There is a question on whether this applies to all floats between 0 and 1 or only those where also an integer is possible.

My half-baked idea would be to replace 1.0 by "all", which is quite intuitive. I'm not sure what to do about 1, though. We could require something like "one", which would be a bit inconsistent, but solve all use-cases (grid-searching from 1 to 10 will be odd, though).

However, if we do the %, then grid-searching fractions becomes pretty ugly.

The text was updated successfully, but these errors were encountered:

amueller · 2016-12-03T21:50:39Z

Ok so criteria for a solution I think should be

How many current use-cases will be deprecated?
How concise can common usage patterns be expressed?
How concise can grid-search be expressed?
Do we need deprecations when we allow the other type for a parameter that only allowed one type before?

Hm and clearly also whether the approach "solves" the ambiguity. Though I'm actually not sure what it means to "solve the ambiguity". My definition was "Every usage that doesn't raise an error has 'obvious' behavior".

amueller · 2016-12-03T21:54:27Z

Maybe another good definition of "Solving ambiguity" would be "typecasting a valid parameter never becomes a valid parameter with changed semantics"

amueller · 2016-12-03T22:00:00Z

Wow I didn't realize but all the options in the trees actually accept both 1 and 1.0 with different semantics. I know we discussed that before but ... hm...
We could satisfy the second definition by just deprecating 1.0 (replace by "all") and allowing 1

jnothman · 2016-12-04T01:46:13Z

Yes, though I think there are other cases where float values are multipliers > 1, and those are more problematic.

…

On 4 December 2016 at 09:00, Andreas Mueller ***@***.***> wrote: Wow I didn't realize but all the options in the trees actually accept both 1 and 1.0 with different semantics. I know we discussed that before but ... hm... We could satisfy the second definition by just deprecating 1.0 (replace by "all") and allowing 1 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#7973 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xNyP4x-8fMepJhnVdT1_gZWVPFtks5rEeZhgaJpZM4LDbDF> .

amueller · 2016-12-05T20:29:50Z

@jnothman do you have an example?

GaelVaroquaux · 2016-12-05T20:33:43Z

. My definition was "Every usage that doesn't raise an error has 'obvious' behavior".

I like that. And "No two legit usages are written almost the same".

jnothman · 2016-12-05T20:51:32Z

Don't have time to look for example right now, but there might've been something in #1454 where you choose to reduce or expand the sample size...?

…

On 6 December 2016 at 07:33, Gael Varoquaux ***@***.***> wrote: >. My definition was "Every usage that doesn't raise an error has 'obvious' behavior". I like that. And "No two legit usages are written almost the same". — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7973 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz64RBhqI9h7HMrlAe4GPrEDcva2Kyks5rFHUogaJpZM4LDbDF> .

jnothman · 2016-12-05T20:51:52Z

I could imagine it in n_components for random projections.

…

On 6 December 2016 at 07:51, Joel Nothman ***@***.***> wrote: Don't have time to look for example right now, but there might've been something in #1454 where you choose to reduce or expand the sample size...? On 6 December 2016 at 07:33, Gael Varoquaux ***@***.***> wrote: > >. My definition was "Every usage that doesn't raise an error has > 'obvious' behavior". > > I like that. And "No two legit usages are written almost the same". > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#7973 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz64RBhqI9h7HMrlAe4GPrEDcva2Kyks5rFHUogaJpZM4LDbDF> > . >

amueller · 2016-12-05T21:02:49Z

@jnothman as something that we already have or something we might want to consider in the future? Maybe the resampling has that... hum

jnothman · 2016-12-05T23:22:27Z

Just things from off my head that might fit that spec. No time to look now.

…

On 6 December 2016 at 08:02, Andreas Mueller ***@***.***> wrote: @jnothman <https://github.com/jnothman> as something that we already have or something we might want to consider in the future? Maybe the resampling has that... hum — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7973 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67yAsXVrvGqwfsCGqEM9q_P15hx9ks5rFHv6gaJpZM4LDbDF> .

adrinjalali · 2021-08-22T09:08:50Z

@amueller do you still think we're gonna fix this? :D Removing from the milestone, please add to 2.0 if you think this should be prioritized.

amueller added this to the 1.0 milestone Dec 3, 2016

amueller mentioned this issue Dec 3, 2016

[MRG+2-1] Add floating point option to max_feature option in CountVectorizer and TfidfVectorizer to use a percentage of the features. #7839

Closed

rth mentioned this issue Apr 18, 2018

Support n_components being a float for TruncatedSVD #10988

Closed

adrinjalali removed this from the 1.0 milestone Aug 22, 2021

cmarmo added the Needs Decision Requires decision label Dec 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determining parameter behavior based on float / int type #7973

Determining parameter behavior based on float / int type #7973

amueller commented Dec 3, 2016 •

edited

Loading

amueller commented Dec 3, 2016

amueller commented Dec 3, 2016

amueller commented Dec 3, 2016

jnothman commented Dec 4, 2016 via email

amueller commented Dec 5, 2016

GaelVaroquaux commented Dec 5, 2016 via email

jnothman commented Dec 5, 2016 via email

jnothman commented Dec 5, 2016 via email

amueller commented Dec 5, 2016

jnothman commented Dec 5, 2016 via email

adrinjalali commented Aug 22, 2021

Determining parameter behavior based on float / int type #7973

Determining parameter behavior based on float / int type #7973

Comments

amueller commented Dec 3, 2016 • edited Loading

amueller commented Dec 3, 2016

amueller commented Dec 3, 2016

amueller commented Dec 3, 2016

jnothman commented Dec 4, 2016 via email

amueller commented Dec 5, 2016

GaelVaroquaux commented Dec 5, 2016 via email

jnothman commented Dec 5, 2016 via email

jnothman commented Dec 5, 2016 via email

amueller commented Dec 5, 2016

jnothman commented Dec 5, 2016 via email

adrinjalali commented Aug 22, 2021

amueller commented Dec 3, 2016 •

edited

Loading