Skip to content

[MRG+2] Merge discrete branch into master #9342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Jul 12, 2018
Merged

[MRG+2] Merge discrete branch into master #9342

merged 36 commits into from
Jul 12, 2018

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Jul 12, 2017

Fixes #5778, adding discretization functionality to preprocessing, with thanks to @hlin117.

This should be considered for merge after resolving:

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Aug 31, 2017

@jnothman seems that the pep8 error introduced in 1674412 is only corrected in branch 0.19, but not in master. What's more, strange test failure still exist which block my pull request. (https://travis-ci.org/scikit-learn/scikit-learn/jobs/270333690, https://travis-ci.org/scikit-learn/scikit-learn/jobs/270333725)

@hopeztm7500
Copy link

I need this feature

@jnothman
Copy link
Member Author

jnothman commented Sep 13, 2017 via email

@qinhanmin2014
Copy link
Member

@jnothman I have already submitted a pull request in #9713 several days ago. Here is current result from Circle CI.

@jnothman
Copy link
Member Author

jnothman commented Sep 13, 2017 via email

@hopeztm7500
Copy link

we are using sklearn to train the model and export it into pmml, there's lot of case, we use the binning

@jnothman
Copy link
Member Author

jnothman commented Sep 13, 2017 via email

@hopeztm7500
Copy link

we train the CTR model, where are lots of user portrait and shop portrait features to bin, and it is easy to do cross feature if we have binning

@hopeztm7500
Copy link

It will also be great if JPMML will support it.

@jnothman
Copy link
Member Author

jnothman commented Sep 13, 2017 via email

@qinhanmin2014
Copy link
Member

ping @jnothman Sorry to disturb.
I have come across an example about discretization in amueller's latest book "Introduction to machine learning with python" (Chapter 4 section 2). The example is written in user-defined function but it is easy to reproduce with KBinsDiscretizer. The results are as follows (sligtly different from the original book since I'm using KBinsDiscretizer):
index
index2
Some insights of the example:
(1) linear regression model and decision tree model predict a constant value in each bin
(2) linear models become more flexible after discretization, but discretization generally has no beneficial effect for tree-based models
If you are interested, I'll submit a PR for further review.

@jnothman
Copy link
Member Author

Sounds like a clear use case to me, if one has external reasons to use a linear model (as one often does in highly regulated applications, such as finance/law, where the model needs to be simple enough to explain its decisions)...

@qinhanmin2014
Copy link
Member

ping @jnothman something to confirm
Currently, there's 2 different "Encoding categorical features" sections here. This is because we move the section in #7668(KBinsDiscretizer) and modify the section in #9151(CategoricalEncoder). I think we should move the modified section(See #9151) to the new place(See #7668). WDYT? Thanks a lot :)

@jnothman
Copy link
Member Author

jnothman commented Nov 23, 2017 via email

@qinhanmin2014
Copy link
Member

I remove the duplicate section and go through the PR, seems fine from my side.

@jnothman jnothman changed the title [WIP] Merge discrete branch into master [MRG] Merge discrete branch into master Nov 27, 2017
@jnothman
Copy link
Member Author

I've marked this as MRG, although someone might declare a proposed optional feature (#9338 = alternative (non-uniform bin size) strategies; #9337 = automatic number of bins; #9341 = NaN support) as mandatory.

Please give a brief review that this is altogether something we want, with correct interface and naming, especially @ogrisel and @agramfort (and @vene?) who were withheld on #7668.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the PR and checked the result from Circle. LGTM from my side.

@jnothman
Copy link
Member Author

Should we be discarding the ignored_features parameter here, seeing as we've deprecated a similar thing from OneHotEncoder?

@jnothman
Copy link
Member Author

jnothman commented Jul 1, 2018

Once #11272 is merged in, I think we should just be bold and merge the discretizer into master for release. Its value still hasn't been altogether proven, but I think, especially with the non-uniform support from #11272, it has several features that are beyond the average user's ability/consideration to implement, and is likely to prove helpful.

@jnothman
Copy link
Member Author

jnothman commented Jul 9, 2018

While thinking about automatic determination of n_bins, I had the realisation that a default unary encoding would be more robust to changes in number of bins than a one-hot encoding. Doubling the number of bins with unary would produce a strict superset of features.

Copy link
Member Author

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we now remove the ignored_features support, given ColumnTransformer??

@qinhanmin2014
Copy link
Member

Should we now remove the ignored_features support, given ColumnTransformer??

Seems that the decision has been made in OneHotEncoder (deprecate categorical_features), so we have to follow here (also in UnaryEncoder) ? I'm fine with the decision. Note that it will make inverse_transform more difficult when we only transform part of the feature, but I tend to believe it's not a common application scenario.

Also, the PR cannot be merged since it's using some deprecated parameters of OneHotEncoder. Further modification requires the final decision here.

@jnothman
Copy link
Member Author

inverse_transform is an interesting case indeed. ColumnTransformer is a bit lacking without inverse_transform. It would require something like #1952

@jnothman
Copy link
Member Author

I think we should go ahead and remove ignored_features. But you've raised a very interesting point, towards which I've opened #11463.

@qinhanmin2014
Copy link
Member

Given the plan to support inverse_transform in ColumnTransformer, I agree that we have no reason to support ignored_features. Will submit PR ASAP.

@jnothman
Copy link
Member Author

jnothman commented Jul 10, 2018 via email

@qinhanmin2014
Copy link
Member

I had the realisation that a default unary encoding would be more robust to changes in number of bins than a one-hot encoding.

I might still prefer to leave onehot as the default since I don't think unary encoding is widely known. It might be more friendly to make sure that the default is easy to understand. Users can always change the encode parameter if they prefer other ways.

@jnothman
Copy link
Member Author

jnothman commented Jul 10, 2018 via email

@qinhanmin2014
Copy link
Member

Unary encoding is not that uncommon, it's just not always known by a name.​

There are definitely lots of things for me to learn before becoming a good maintainer :)

@jnothman
Copy link
Member Author

jnothman commented Jul 10, 2018 via email

@jnothman
Copy link
Member Author

I think we should merge when green, and create a merge commit to retain commit attribution as far as possible here.

@glemaitre
Copy link
Member

Oh I did not have time to put my 2 cents on the PR of @TomDLT.
Is it worth to support missing values right now?

@jnothman
Copy link
Member Author

No rush, I think, @gmelaitre. I'm going to merge, and tweak those issues so that they're open if someone wants to tackle them at the sprint.

@jnothman
Copy link
Member Author

Closed by 7636606. Well done everyone!

@jnothman jnothman mentioned this pull request Jul 12, 2018
@jnothman jnothman merged commit e4089d1 into master Jul 12, 2018
jnothman added a commit that referenced this pull request Jul 12, 2018
@jnothman
Copy link
Member Author

Do people think this has a place being included in examples/preprocessing/plot_all_scaling.py?

@jnothman jnothman deleted the discrete branch July 12, 2018 00:32
@glemaitre
Copy link
Member

In practice, what is the use case in which we can use this strategy for normalization. If there is one, we should probably explain it inside the example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants