[WIP] Categorical split for decision tree #3346

MatthieuBizien · 2014-07-05T21:27:44Z

Contrary to many algorithms that can only use dummy variables, decision trees can behave differently for categorical data. The leaves of a node will partition the categories. We could expect a better accuracy in some cases, and to limit the number of dummy columns. This is the default behavior of R randomForest package.

I am currently implementing that in sklearn, using the Cython classes. I propose to ask a categorical_features option to the decision trees classes (DecisionTreeClassifier, DecisionTreeRegressor, ExtraTreeClassifier, ExtraTreeRegressor). This option could be, like in other modules of sklearn, 'None', 'all', a mask or a list of features.

Each feature could have up to 32 classes, because we will have to test all the combinaisons, so 2**31 cases. This limit allows us to use a binary representation of a split. The same limit exists in R, I think for the same reasons.

This is a work in progress, and not ready to be merged. I prefer to release it early, so I could have feedbacks.

This reverts commit 004ac5a.

We could use std::map and std::set. Compilation fail.

This reverts commit 08ab279.

This reverts commit adb3087.

coveralls · 2014-07-05T21:40:12Z

Coverage decreased (-0.04%) when pulling cc3eb48 on MatthieuBizien:categorical_split into 1775095 on scikit-learn:master.

gallamine · 2014-07-09T20:45:49Z

Each feature could have up to 32 classes, because we will have to test all the combinaisons, so 2**31 cases.

Can you explain this part a bit more? You're testing all permutations of subsets of the data?

MatthieuBizien · 2014-07-10T00:15:37Z

Yes, I have to test all permutations. Because of the symmetry of the problem, without any loss of generality, we can assume the first class is in the left leaf, so we have to test 2**31 cases in the worst case (i.e. 32 classes), and not 2**32.

2**31 is a lot, but it is still computable, and it is the worst case, when the user provide 32 classes. If the number of classes is less important, or if the tree had already been split on this feature, the complexity would be less important. I assume that for most real world cases, the number of classes will be small.

We can imagine some heuristics if we have many classes (it is "just" discrete optimization), but it is, I think, too soon.

gallamine · 2014-07-10T16:32:58Z

Do you have any thoughts on how you'd handle the case when the user provides more than 32 categories? I'm thinking of my own work where almost everything has more than 32 categories (e.g. country or postal codes)

MatthieuBizien · 2014-07-10T20:57:35Z

At the beginning, I think it is easier not to handle that case, to raise an exception and to ask the user to use dummy variables. When this pull request will be working and merged, it will be possible to start working on heuristics for finding the best split, without testing all combinaisons. I am not a specialist of discrete optimization, but I am sure there are efficient algorithms for that. The underlying structure will also need to be different because we will no longer be able to store a split in an int32.

jnothman · 2014-07-10T22:47:23Z

Contrary to many algorithms, that can only use dummy variables, decision
trees can behave differently for categorical data.

The expressive power of the tree is identical whether or not these are
handled specially. As far as I can tell the difference by introducing such
a feature is that it can drastically affect max_depth criteria, etc.

On 11 July 2014 06:57, MatthieuBizien notifications@github.com wrote:

At the beginning, I thing it is easier not to handle that case, to raise
an exception and to ask the user to use dummy variables. When this pull
request will be working and merged, it will be possible to start working on
heuristics for finding the best split, without testing all combinaisons. I
am not a specialist of discrete optimization, but I am sure there are
efficient algorithms for that. The underlying structure will also need to
be different, because we will no longer be able to store a split in an
int32.

—
Reply to this email directly or view it on GitHub
#3346 (comment)
.

glouppe · 2014-07-11T09:52:05Z

The expressive power of the tree is identical whether or not these are
handled specially.

Not exactly. By assuming numerical features, we assume that categorical features are ordered, which restricts the sets of candidate splits and therefore the expressive power of the tree (for a finite learning set).

glouppe · 2014-07-11T09:59:52Z

Thanks for you contribution @MatthieuBizien !

A few comments though before you proceed further:

The API for this has already been subject to debate. We have never settled to something that pleases everyone. I would like to hear some core developers opinion on the proposed API? As I understand, the interface is here similar to what we already have for OneHotEncoder. CC: @ogrisel @larsmans @jnothman @GaelVaroquaux
In terms of algorithms.
i) 2**31 is way too large. In R, they restrict the number of combinations to 2**8. If the number of categories is larger, then 2**8 combinations are sampled at random.
ii) In binary classification or in regression, there exists an optimal linear algorithm for finding the best split. It basically boils down to replace the categories by their probability, use these probabilities as a new ordered feature and apply the usual algorithm for finding the best split. You can find details about this in Section 3.6.3.2 of http://orbi.ulg.ac.be/handle/2268/170309

glouppe · 2014-07-11T10:01:53Z

In terms of internal interface, this may also be the opportunity to try to factor out code from Splitters. What is your opinion on this @arjoly ?

MatthieuBizien · 2014-07-11T10:51:38Z

@glouppe You're welcome. Thanks for your advices in term of algorithm, I will use that.

arjoly · 2014-07-11T11:20:23Z

In terms of internal interface, this may also be the opportunity to try to factor out code from Splitters. What is your opinion on this @arjoly ?

Yeah, this would a great opportunity. This could already be done outside this pull request.

scikit-learn#3346

jnothman · 2014-07-12T09:26:24Z

Not exactly. By assuming numerical features, we assume that categorical
features are ordered, which restrict the sets of candidate splits and
therefore the expressive power of the tree.

(But assuming infinite depth is allowed, the expressiveness is identical.)

On 11 July 2014 20:51, MatthieuBizien notifications@github.com wrote:

@glouppe https://github.com/glouppe You're welcome. Thanks for your
advices in term of algorithm, I will use that.

—
Reply to this email directly or view it on GitHub
#3346 (comment)
.

amueller · 2014-07-12T21:15:59Z

On 07/12/2014 11:26 AM, jnothman wrote:

Not exactly. By assuming numerical features, we assume that categorical
features are ordered, which restrict the sets of candidate splits and
therefore the expressive power of the tree.

(But assuming infinite depth is allowed, the expressiveness is
identical.)

Yes. But even then, the resulting decision surface would most likely not
be the same.

mblondel · 2014-07-14T09:56:34Z

I'm enthusiastic about this feature. One usecase is to do hyper-parameter optimization (as in hyperopt) over categorical hyper-parameters.

changed incoherent notations categorical_split and split_categories to partition

…plit BestSplitter.node_split is now "just" 100 lines long

ogrisel · 2014-08-13T14:24:10Z

Note that pandas 0.15 will have a native data type for categories encoding:

http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#categoricals-in-series-dataframe

We could make the decision trees able to deal with dataframe features natively. That would make it more natural to use for the user: no need to pass a feature mask.

However that would require some refactoring to support lazy, per-column __array__ conversion instead of doing it globally for the whole datafreame in the check_X_y call.

ogrisel · 2014-08-13T14:25:55Z

Yes. But even then, the resulting decision surface would most likely not be the same.

Also it would make the graphical export of a single decision tree much easier to understand. Many users are interested by the structure of the learned trees when applied to categorical data.

pprett · 2014-08-13T14:36:01Z

totally - the same applies to partial dependence plots as well

2014-08-13 16:25 GMT+02:00 Olivier Grisel notifications@github.com:

Yes. But even then, the resulting decision surface would most likely not
be the same.

Also it would make the graphical export of a single decision tree much
easier to understand. Many users are interested by the structure of the
learned trees when applied to categorical data.

—
Reply to this email directly or view it on GitHub
#3346 (comment)
.

Peter Prettenhofer

spitz-dan-l · 2015-04-28T14:52:03Z

Hello,

It seems like there hasn't been development on this PR in awhile. Is there any idea of how far it is from completion? I would love to use it.

mjbommar · 2015-04-28T16:23:25Z

^ +1

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
Web: http://www.bommaritollc.com
Mobile: +1 (646) 450-3387

On Tue, Apr 28, 2015 at 10:52 AM, spitz-dan-l notifications@github.com
wrote:

Hello,

It seems like there hasn't been development on this PR in awhile. Is there
any idea of how far it is from completion? I would love to use it.

—
Reply to this email directly or view it on GitHub
#3346 (comment)
.

amueller · 2015-04-28T20:45:04Z

I think this is a somewhat significant addition, and it doesn't look like anyone worked on it recently. I think most sklearn people are excited about it, but no-one had the time to work on it. Help welcome.

MatthieuBizien · 2015-04-29T08:07:49Z

I don't have time to work on it for the moment. It wasn't so for away from completeness, but there had been some major changes in the master code.

dedan · 2015-05-05T00:53:24Z

+1 for this. Especially in combination with the pandas categorial data type

amueller · 2015-05-05T22:11:25Z

I am 90% certain that the input will not be the pandas categorical data type, at least in the first iteration. I'm sure @GaelVaroquaux has opinions about this ^^

GaelVaroquaux · 2015-05-07T05:47:01Z

I am 90% certain that the input will not be the pandas categorical data type, at least in the first iteration. I'm sure @GaelVaroquaux has opinions about this ^^

Yup! Input should be a common denominator to libraries and applications. Therefore the only thing that we have are numpy arrays.

amueller · 2015-05-07T16:55:03Z

Some people might argue that a dataframe is a much better common denominator, as mixed datatypes are the norm, and homogeneous datatypes are a special case that only appears in some obscure imaging techiques ;)

elzurdo · 2016-03-31T11:06:48Z

Hi,
I was wondering if there was any progress on the issue of telling a Decision Tree (or Ensemble) which features are categorical so it can split differently than numerical?

jnothman · 2016-03-31T23:48:00Z

#4899 is the latest news

On 31 March 2016 at 22:06, Eyal notifications@github.com wrote:

Hi,
I was wondering if there was any progress on the issue of telling a
Decision Tree (or Ensemble) which features are categorical so it can split
differently than numerical?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#3346 (comment)

denson · 2017-08-05T12:52:13Z

I have been working on fairly large (millions of rows) problems with features that have large numbers (1000's) of categories. H2O is working pretty well but I would greatly prefer to stick with scikit-learn only.

I believe that H2O is using the algorithm described in A Streaming Parallel Decision Tree Algorithm. I am thinking it would probably be easier to add SPDT rather than modify random forest.

jimmywan · 2017-09-11T17:07:32Z

Should this be closed in favor of continuing the discussion in #4899 ?

jnothman · 2017-09-11T21:54:24Z

I suppose so.

jcharite-via · 2017-11-02T21:55:19Z

Just checking in to see if there is any progress on this issue?

fcoclavero · 2019-08-20T15:12:33Z

Any update on this issue?

MatthieuBizien · 2020-06-04T10:37:41Z

Hi everyone, I was not able to work on that PR for a long time, and when I went back there were a lot of things had have changed in the Scikit-learn codebase. I also have less available time than when I started the PR, so I have to stop here. You can use catboost, wait for #12866 (or try to continue where I left 😉).

MatthieuBizien added 7 commits July 5, 2014 04:06

First draft of categorical split

004ac5a

Revert "First draft of categorical split"

adb3087

This reverts commit 004ac5a.

Generate Cpp from Cython for _tree and _utils (BROKEN)

08ab279

We could use std::map and std::set. Compilation fail.

Revert "Generate Cpp from Cython for _tree and _utils (BROKEN)"

e62c442

This reverts commit 08ab279.

Revert "Revert "First draft of categorical split""

1b61617

This reverts commit adb3087.

Removed dependancies to std

f31eec6

Added categorical_features argument to trees

cc3eb48

Added TODO after discussion in pull request

a49f802

scikit-learn#3346

Added option is_categorical to Splitter

aa9f94c

GaelVaroquaux changed the title ~~Categorical split for decision tree~~ [WIP] Categorical split for decision tree Jul 15, 2014

MatthieuBizien added 6 commits July 15, 2014 22:06

Use criterion in categorical BestSplitter

5948786

Remove unnecessary TODO

ce79f76

Finished remove outcome_by_cat

532e152

Added type PARTITION_t as unit64

a709d2d

Notation : switch categorical splits to partition

f4db780

changed incoherent notations categorical_split and split_categories to partition

Implemented update_factors

d3a1a19

MatthieuBizien added 3 commits July 18, 2014 18:00

Linear time algorithm for binary classification and regression

e6fdb1f

Changed Xf to Xi in the categorical case

eefdd6d

Create functions _categorical_feature_split and _continuous_feature_s…

b84530e

…plit BestSplitter.node_split is now "just" 100 lines long

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

trevorstephens mentioned this pull request May 13, 2015

[API] Consistent API for attaching properties to samples #4497

Closed

jblackburne mentioned this pull request Jun 25, 2015

NOCATS: Categorical splits for tree-based learners #4899

Closed

ogrisel mentioned this pull request Oct 19, 2015

Categorical feature in Tree-based classifiers #5442

Closed

jnothman closed this Sep 11, 2017

scikit-learn deleted a comment from woodrujm Mar 13, 2019

adam2392 mentioned this pull request Jul 8, 2024

FEA Categorical split support for DecisionTree*, ExtraTree*, RandomForest* and `ExtraTrees* #29437

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Categorical split for decision tree #3346

[WIP] Categorical split for decision tree #3346

MatthieuBizien commented Jul 5, 2014

coveralls commented Jul 5, 2014

gallamine commented Jul 9, 2014

MatthieuBizien commented Jul 10, 2014 •

edited by TomDLT

Loading

gallamine commented Jul 10, 2014

MatthieuBizien commented Jul 10, 2014

jnothman commented Jul 10, 2014

glouppe commented Jul 11, 2014

glouppe commented Jul 11, 2014 •

edited by TomDLT

Loading

glouppe commented Jul 11, 2014

MatthieuBizien commented Jul 11, 2014

arjoly commented Jul 11, 2014

jnothman commented Jul 12, 2014

amueller commented Jul 12, 2014

mblondel commented Jul 14, 2014

ogrisel commented Aug 13, 2014

ogrisel commented Aug 13, 2014

pprett commented Aug 13, 2014

spitz-dan-l commented Apr 28, 2015

mjbommar commented Apr 28, 2015

amueller commented Apr 28, 2015

MatthieuBizien commented Apr 29, 2015

dedan commented May 5, 2015

amueller commented May 5, 2015

GaelVaroquaux commented May 7, 2015 via email

amueller commented May 7, 2015

elzurdo commented Mar 31, 2016

jnothman commented Mar 31, 2016

denson commented Aug 5, 2017

jimmywan commented Sep 11, 2017

jnothman commented Sep 11, 2017

jcharite-via commented Nov 2, 2017

fcoclavero commented Aug 20, 2019

MatthieuBizien commented Jun 4, 2020

[WIP] Categorical split for decision tree #3346

[WIP] Categorical split for decision tree #3346

Conversation

MatthieuBizien commented Jul 5, 2014

coveralls commented Jul 5, 2014

gallamine commented Jul 9, 2014

MatthieuBizien commented Jul 10, 2014 • edited by TomDLT Loading

gallamine commented Jul 10, 2014

MatthieuBizien commented Jul 10, 2014

jnothman commented Jul 10, 2014

glouppe commented Jul 11, 2014

glouppe commented Jul 11, 2014 • edited by TomDLT Loading

glouppe commented Jul 11, 2014

MatthieuBizien commented Jul 11, 2014

arjoly commented Jul 11, 2014

jnothman commented Jul 12, 2014

amueller commented Jul 12, 2014

mblondel commented Jul 14, 2014

ogrisel commented Aug 13, 2014

ogrisel commented Aug 13, 2014

pprett commented Aug 13, 2014

spitz-dan-l commented Apr 28, 2015

mjbommar commented Apr 28, 2015

amueller commented Apr 28, 2015

MatthieuBizien commented Apr 29, 2015

dedan commented May 5, 2015

amueller commented May 5, 2015

GaelVaroquaux commented May 7, 2015 via email

amueller commented May 7, 2015

elzurdo commented Mar 31, 2016

jnothman commented Mar 31, 2016

denson commented Aug 5, 2017

jimmywan commented Sep 11, 2017

jnothman commented Sep 11, 2017

jcharite-via commented Nov 2, 2017

fcoclavero commented Aug 20, 2019

MatthieuBizien commented Jun 4, 2020

MatthieuBizien commented Jul 10, 2014 •

edited by TomDLT

Loading

glouppe commented Jul 11, 2014 •

edited by TomDLT

Loading