[MRG+2] Merge discrete branch into master #9342

jnothman · 2017-07-12T12:32:05Z

Fixes #5778, adding discretization functionality to preprocessing, with thanks to @hlin117.

This should be considered for merge after resolving:

discrete branch: add encoding option to discretizer #9336: encoding parameter
discrete branch: add a compelling example of discretization's benefits #9339: an example
Updating what's new
discrete branch: quantile and kmeans strategies #9338: alternative (non-uniform bin size) strategies (optional) (see [MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) #11272)
KBinsDiscretizer: Automatic determination of number of bins #9337: automatic number of bins (optional)
KBinsDiscretizer: allow nans #9341: NaN support (optional)
encoding='unary' subsequent to [MRG+1] UnaryEncoder to encode ordinal features into unary levels #8652

qinhanmin2014 · 2017-08-31T10:04:22Z

@jnothman seems that the pep8 error introduced in 1674412 is only corrected in branch 0.19, but not in master. What's more, strange test failure still exist which block my pull request. (https://travis-ci.org/scikit-learn/scikit-learn/jobs/270333690, https://travis-ci.org/scikit-learn/scikit-learn/jobs/270333725)

hopeztm7500 · 2017-09-13T07:00:13Z

I need this feature

jnothman · 2017-09-13T07:13:55Z

Great, 'cause we need compelling use cases before we will merge. Could you give us one, preferably demonstrated on a publicly available dataset?

…

On 13 September 2017 at 17:00, Victor ***@***.***> wrote: I need this feature — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9342 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz65sBrgsw_x7c5_flQQ4AgTbLtRYqks5sh31_gaJpZM4OVjuZ> .

qinhanmin2014 · 2017-09-13T07:18:57Z

@jnothman I have already submitted a pull request in #9713 several days ago. Here is current result from Circle CI.

jnothman · 2017-09-13T07:33:17Z

Yes, but I don't find it particularly convincing. If someone turns up claiming "I need this feature" then they're in a better place to provide a compelling example than you or I am.

…

On 13 September 2017 at 17:19, Hanmin Qin ***@***.***> wrote: @jnothman <https://github.com/jnothman> I have already submitted a pull request in #9713 <#9713> several days ago. Here is current result <https://13449-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/auto_examples/preprocessing/plot_iris_discretization.html> from Circle CI. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9342 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6z81A81HgIXXzIQo6WsEwVsD6BPlks5sh4HlgaJpZM4OVjuZ> .

hopeztm7500 · 2017-09-13T07:34:07Z

we are using sklearn to train the model and export it into pmml, there's lot of case, we use the binning

jnothman · 2017-09-13T07:34:47Z

How does binning benefit you?

…

On 13 September 2017 at 17:34, Victor ***@***.***> wrote: we are using sklearn to train the model and export it into pmml, there's lot of case, we use the binning — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9342 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz646w95ImnRE1QU6CDPRortWDfAGoks5sh4VygaJpZM4OVjuZ> .

hopeztm7500 · 2017-09-13T07:37:13Z

we train the CTR model, where are lots of user portrait and shop portrait features to bin, and it is easy to do cross feature if we have binning

hopeztm7500 · 2017-09-13T07:39:38Z

It will also be great if JPMML will support it.

jnothman · 2017-09-13T08:21:12Z

so if I understand correctly: you're talking about feature combination (more or less a polynomial expansion; or a Cartesian product of two commentary feature spaces) without making assumptions about the distribution of the real-valued inputs. Yes that sounds reasonable. Not simple to demonstrate, perhaps. Thanks

…

On 13 Sep 2017 5:39 pm, "Victor" ***@***.***> wrote: It will also be great if JPMML will support it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9342 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69e4uNxCiw_1JCo2wrHA2bmkzCPfks5sh4a9gaJpZM4OVjuZ> .

qinhanmin2014 · 2017-11-22T14:26:07Z

ping @jnothman Sorry to disturb.
I have come across an example about discretization in amueller's latest book "Introduction to machine learning with python" (Chapter 4 section 2). The example is written in user-defined function but it is easy to reproduce with KBinsDiscretizer. The results are as follows (sligtly different from the original book since I'm using KBinsDiscretizer):

Some insights of the example:
(1) linear regression model and decision tree model predict a constant value in each bin
(2) linear models become more flexible after discretization, but discretization generally has no beneficial effect for tree-based models
If you are interested, I'll submit a PR for further review.

jnothman · 2017-11-22T20:41:11Z

Sounds like a clear use case to me, if one has external reasons to use a linear model (as one often does in highly regulated applications, such as finance/law, where the model needs to be simple enough to explain its decisions)...

Conflicts: doc/modules/preprocessing.rst

qinhanmin2014 · 2017-11-23T15:02:35Z

ping @jnothman something to confirm
Currently, there's 2 different "Encoding categorical features" sections here. This is because we move the section in #7668(KBinsDiscretizer) and modify the section in #9151(CategoricalEncoder). I think we should move the modified section(See #9151) to the new place(See #7668). WDYT? Thanks a lot :)

jnothman · 2017-11-23T21:10:47Z

yes, that sounds like an improperly resolved merge (git can't always get it right).

qinhanmin2014 · 2017-11-24T02:08:39Z

I remove the duplicate section and go through the PR, seems fine from my side.

jnothman · 2017-11-27T23:23:07Z

I've marked this as MRG, although someone might declare a proposed optional feature (#9338 = alternative (non-uniform bin size) strategies; #9337 = automatic number of bins; #9341 = NaN support) as mandatory.

Please give a brief review that this is altogether something we want, with correct interface and naming, especially @ogrisel and @agramfort (and @vene?) who were withheld on #7668.

qinhanmin2014

I've gone through the PR and checked the result from Circle. LGTM from my side.

jnothman · 2018-06-24T22:52:03Z

Should we be discarding the ignored_features parameter here, seeing as we've deprecated a similar thing from OneHotEncoder?

jnothman · 2018-07-01T12:09:12Z

Once #11272 is merged in, I think we should just be bold and merge the discretizer into master for release. Its value still hasn't been altogether proven, but I think, especially with the non-uniform support from #11272, it has several features that are beyond the average user's ability/consideration to implement, and is likely to prove helpful.

jnothman · 2018-07-09T22:56:24Z

While thinking about automatic determination of n_bins, I had the realisation that a default unary encoding would be more robust to changes in number of bins than a one-hot encoding. Doubling the number of bins with unary would produce a strict superset of features.

…screte branch) (#11272)

jnothman

Should we now remove the ignored_features support, given ColumnTransformer??

qinhanmin2014 · 2018-07-10T01:50:22Z

Should we now remove the ignored_features support, given ColumnTransformer??

Seems that the decision has been made in OneHotEncoder (deprecate categorical_features), so we have to follow here (also in UnaryEncoder) ? I'm fine with the decision. Note that it will make inverse_transform more difficult when we only transform part of the feature, but I tend to believe it's not a common application scenario.

Also, the PR cannot be merged since it's using some deprecated parameters of OneHotEncoder. Further modification requires the final decision here.

jnothman · 2018-07-10T02:05:39Z

inverse_transform is an interesting case indeed. ColumnTransformer is a bit lacking without inverse_transform. It would require something like #1952

jnothman · 2018-07-10T02:12:51Z

I think we should go ahead and remove ignored_features. But you've raised a very interesting point, towards which I've opened #11463.

qinhanmin2014 · 2018-07-10T02:15:02Z

Given the plan to support inverse_transform in ColumnTransformer, I agree that we have no reason to support ignored_features. Will submit PR ASAP.

jnothman · 2018-07-10T02:23:35Z

Thank you. The plan to support inverse_transform in CT requires some API decisions so it's unlikely to be instantaneously resolved!

qinhanmin2014 · 2018-07-10T14:03:59Z

I had the realisation that a default unary encoding would be more robust to changes in number of bins than a one-hot encoding.

I might still prefer to leave onehot as the default since I don't think unary encoding is widely known. It might be more friendly to make sure that the default is easy to understand. Users can always change the encode parameter if they prefer other ways.

jnothman · 2018-07-10T14:10:09Z

The main advantage of one-hot is that it's sparse. Unary encoding is not that uncommon, it's just not always known by a name.

qinhanmin2014 · 2018-07-10T15:18:19Z

Unary encoding is not that uncommon, it's just not always known by a name.

There are definitely lots of things for me to learn before becoming a good maintainer :)

jnothman · 2018-07-10T23:55:24Z

well you have to have ordered categories for unary to be relevant. one common way for ordinal categories to come about is binning...

jnothman · 2018-07-11T21:35:38Z

I think we should merge when green, and create a merge commit to retain commit attribution as far as possible here.

glemaitre · 2018-07-11T22:16:16Z

Oh I did not have time to put my 2 cents on the PR of @TomDLT.
Is it worth to support missing values right now?

jnothman · 2018-07-12T00:20:45Z

No rush, I think, @gmelaitre. I'm going to merge, and tweak those issues so that they're open if someone wants to tackle them at the sprint.

jnothman · 2018-07-12T00:30:31Z

Closed by 7636606. Well done everyone!

jnothman · 2018-07-12T00:31:54Z

Do people think this has a place being included in examples/preprocessing/plot_all_scaling.py?

glemaitre · 2018-07-12T04:44:08Z

In practice, what is the use case in which we can use this strategy for normalization. If there is one, we should probably explain it inside the example.

hlin117 and others added 2 commits July 12, 2017 18:20

ENH add fixed-width discretizer

7ef342e

Merge branch 'master' into discrete

56902d3

jnothman and others added 3 commits September 1, 2017 16:02

Fix docstring section heading

a9cd64e

add encoding option to KBinsDiscretizer (#9647)

eef7bdb

[MRG] DOC fix link to user guide (#9705)

a099c26

qinhanmin2014 mentioned this pull request Sep 17, 2017

[MRG] discrete branch: add an example for KBinsDiscretizer #9713

Closed

TomDLT and others added 2 commits November 23, 2017 13:53

Merge branch 'master' into discrete

0a91bce

Conflicts: doc/modules/preprocessing.rst

flake8 fix (a forced unrelated change)

fb2a065

drop duplicate section

4e60a3d

qinhanmin2014 and others added 3 commits November 27, 2017 22:44

[MRG+2] discrete branch: add an example for KBinsDiscretizer (#10192)

2c9134e

DOC add a second example for KBinsDiscretizer (#10195)

430af30

DOC what's new for discretizer

4c034c4

jnothman changed the title ~~[WIP] Merge discrete branch into master~~ [MRG] Merge discrete branch into master Nov 27, 2017

qinhanmin2014 approved these changes Nov 28, 2017

View reviewed changes

TomDLT and others added 4 commits July 10, 2018 09:29

[MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (di…

5a61af9

…screte branch) (#11272)

Merge branch 'master' into discrete

40d1f00

preprocessing.discretization -> preprocessing._discretization

e4d1884

Move _transform_selected helper to base.py

bb719e1

jnothman commented Jul 9, 2018

View reviewed changes

Merge branch 'master' into discrete

d1e2615

qinhanmin2014 mentioned this pull request Jul 10, 2018

[MRG] ENH Remove ignored_features in KBinsDiscretizer #11467

Merged

[MRG] ENH Remove ignored_features in KBinsDiscretizer (#11467)

e4089d1

jnothman mentioned this pull request Jul 12, 2018

Discretizer #5778

Closed

jnothman merged commit e4089d1 into master Jul 12, 2018

jnothman added a commit that referenced this pull request Jul 12, 2018

Merge branch 'discrete' (#9342)

7636606

jnothman deleted the discrete branch July 12, 2018 00:32

Uh oh!

[MRG+2] Merge discrete branch into master #9342

[MRG+2] Merge discrete branch into master #9342

Uh oh!

Conversation

jnothman commented Jul 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qinhanmin2014 commented Aug 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hopeztm7500 commented Sep 13, 2017

Uh oh!

jnothman commented Sep 13, 2017 via email

Uh oh!

qinhanmin2014 commented Sep 13, 2017

Uh oh!

jnothman commented Sep 13, 2017 via email

Uh oh!

hopeztm7500 commented Sep 13, 2017

Uh oh!

jnothman commented Sep 13, 2017 via email

Uh oh!

hopeztm7500 commented Sep 13, 2017

Uh oh!

hopeztm7500 commented Sep 13, 2017

Uh oh!

jnothman commented Sep 13, 2017 via email

Uh oh!

qinhanmin2014 commented Nov 22, 2017

Uh oh!

jnothman commented Nov 22, 2017

Uh oh!

qinhanmin2014 commented Nov 23, 2017

Uh oh!

jnothman commented Nov 23, 2017 via email

Uh oh!

qinhanmin2014 commented Nov 24, 2017

Uh oh!

jnothman commented Nov 27, 2017

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 24, 2018

Uh oh!

jnothman commented Jul 1, 2018

Uh oh!

jnothman commented Jul 9, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Jul 10, 2018

Uh oh!

jnothman commented Jul 10, 2018

Uh oh!

jnothman commented Jul 10, 2018

Uh oh!

qinhanmin2014 commented Jul 10, 2018

Uh oh!

jnothman commented Jul 10, 2018 via email

Uh oh!

qinhanmin2014 commented Jul 10, 2018

Uh oh!

jnothman commented Jul 10, 2018 via email

Uh oh!

qinhanmin2014 commented Jul 10, 2018

Uh oh!

jnothman commented Jul 10, 2018 via email

Uh oh!

jnothman commented Jul 11, 2018

Uh oh!

glemaitre commented Jul 11, 2018

Uh oh!

jnothman commented Jul 12, 2018

Uh oh!

jnothman commented Jul 12, 2018

Uh oh!

jnothman commented Jul 12, 2018

jnothman commented Jul 12, 2017 •

edited

Loading

qinhanmin2014 commented Aug 31, 2017 •

edited

Loading