-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Merge discrete branch into master #9342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman seems that the pep8 error introduced in 1674412 is only corrected in branch 0.19, but not in master. What's more, strange test failure still exist which block my pull request. (https://travis-ci.org/scikit-learn/scikit-learn/jobs/270333690, https://travis-ci.org/scikit-learn/scikit-learn/jobs/270333725) |
I need this feature |
Great, 'cause we need compelling use cases before we will merge. Could you
give us one, preferably demonstrated on a publicly available dataset?
…On 13 September 2017 at 17:00, Victor ***@***.***> wrote:
I need this feature
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9342 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz65sBrgsw_x7c5_flQQ4AgTbLtRYqks5sh31_gaJpZM4OVjuZ>
.
|
Yes, but I don't find it particularly convincing. If someone turns up
claiming "I need this feature" then they're in a better place to provide a
compelling example than you or I am.
…On 13 September 2017 at 17:19, Hanmin Qin ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I have already submitted a pull
request in #9713 <#9713>
several days ago. Here is current result
<https://13449-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/auto_examples/preprocessing/plot_iris_discretization.html>
from Circle CI.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9342 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6z81A81HgIXXzIQo6WsEwVsD6BPlks5sh4HlgaJpZM4OVjuZ>
.
|
we are using sklearn to train the model and export it into pmml, there's lot of case, we use the binning |
How does binning benefit you?
…On 13 September 2017 at 17:34, Victor ***@***.***> wrote:
we are using sklearn to train the model and export it into pmml, there's
lot of case, we use the binning
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9342 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz646w95ImnRE1QU6CDPRortWDfAGoks5sh4VygaJpZM4OVjuZ>
.
|
we train the CTR model, where are lots of user portrait and shop portrait features to bin, and it is easy to do cross feature if we have binning |
It will also be great if JPMML will support it. |
so if I understand correctly: you're talking about feature combination
(more or less a polynomial expansion; or a Cartesian product of two
commentary feature spaces) without making assumptions about the
distribution of the real-valued inputs. Yes that sounds reasonable. Not
simple to demonstrate, perhaps. Thanks
…On 13 Sep 2017 5:39 pm, "Victor" ***@***.***> wrote:
It will also be great if JPMML will support it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9342 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69e4uNxCiw_1JCo2wrHA2bmkzCPfks5sh4a9gaJpZM4OVjuZ>
.
|
ping @jnothman Sorry to disturb. |
Sounds like a clear use case to me, if one has external reasons to use a linear model (as one often does in highly regulated applications, such as finance/law, where the model needs to be simple enough to explain its decisions)... |
Conflicts: doc/modules/preprocessing.rst
ping @jnothman something to confirm |
yes, that sounds like an improperly resolved merge (git can't always get it
right).
|
I remove the duplicate section and go through the PR, seems fine from my side. |
I've marked this as MRG, although someone might declare a proposed optional feature (#9338 = alternative (non-uniform bin size) strategies; #9337 = automatic number of bins; #9341 = NaN support) as mandatory. Please give a brief review that this is altogether something we want, with correct interface and naming, especially @ogrisel and @agramfort (and @vene?) who were withheld on #7668. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through the PR and checked the result from Circle. LGTM from my side.
Should we be discarding the |
Once #11272 is merged in, I think we should just be bold and merge the discretizer into master for release. Its value still hasn't been altogether proven, but I think, especially with the non-uniform support from #11272, it has several features that are beyond the average user's ability/consideration to implement, and is likely to prove helpful. |
While thinking about automatic determination of n_bins, I had the realisation that a default unary encoding would be more robust to changes in number of bins than a one-hot encoding. Doubling the number of bins with unary would produce a strict superset of features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we now remove the ignored_features support, given ColumnTransformer??
Seems that the decision has been made in Also, the PR cannot be merged since it's using some deprecated parameters of OneHotEncoder. Further modification requires the final decision here. |
|
I think we should go ahead and remove |
Given the plan to support inverse_transform in ColumnTransformer, I agree that we have no reason to support |
Thank you. The plan to support inverse_transform in CT requires some API
decisions so it's unlikely to be instantaneously resolved!
|
I might still prefer to leave onehot as the default since I don't think unary encoding is widely known. It might be more friendly to make sure that the default is easy to understand. Users can always change the encode parameter if they prefer other ways. |
The main advantage of one-hot is that it's sparse. Unary encoding is not
that uncommon, it's just not always known by a name.
|
There are definitely lots of things for me to learn before becoming a good maintainer :) |
well you have to have ordered categories for unary to be relevant. one
common way for ordinal categories to come about is binning...
|
I think we should merge when green, and create a merge commit to retain commit attribution as far as possible here. |
Oh I did not have time to put my 2 cents on the PR of @TomDLT. |
No rush, I think, @gmelaitre. I'm going to merge, and tweak those issues so that they're open if someone wants to tackle them at the sprint. |
Closed by 7636606. Well done everyone! |
Do people think this has a place being included in |
In practice, what is the use case in which we can use this strategy for normalization. If there is one, we should probably explain it inside the example. |
Fixes #5778, adding discretization functionality to preprocessing, with thanks to @hlin117.
This should be considered for merge after resolving: