-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Categorical split for decision tree #3346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Categorical split for decision tree #3346
Conversation
Can you explain this part a bit more? You're testing all permutations of subsets of the data? |
Yes, I have to test all permutations. Because of the symmetry of the problem, without any loss of generality, we can assume the first class is in the left leaf, so we have to test
We can imagine some heuristics if we have many classes (it is "just" discrete optimization), but it is, I think, too soon. |
Do you have any thoughts on how you'd handle the case when the user provides more than 32 categories? I'm thinking of my own work where almost everything has more than 32 categories (e.g. country or postal codes) |
At the beginning, I think it is easier not to handle that case, to raise an exception and to ask the user to use dummy variables. When this pull request will be working and merged, it will be possible to start working on heuristics for finding the best split, without testing all combinaisons. I am not a specialist of discrete optimization, but I am sure there are efficient algorithms for that. The underlying structure will also need to be different because we will no longer be able to store a split in an int32. |
The expressive power of the tree is identical whether or not these are On 11 July 2014 06:57, MatthieuBizien notifications@github.com wrote:
|
Not exactly. By assuming numerical features, we assume that categorical features are ordered, which restricts the sets of candidate splits and therefore the expressive power of the tree (for a finite learning set). |
Thanks for you contribution @MatthieuBizien ! A few comments though before you proceed further:
|
In terms of internal interface, this may also be the opportunity to try to factor out code from Splitters. What is your opinion on this @arjoly ? |
@glouppe You're welcome. Thanks for your advices in term of algorithm, I will use that. |
Yeah, this would a great opportunity. This could already be done outside this pull request. |
(But assuming infinite depth is allowed, the expressiveness is identical.) On 11 July 2014 20:51, MatthieuBizien notifications@github.com wrote:
|
On 07/12/2014 11:26 AM, jnothman wrote:
|
I'm enthusiastic about this feature. One usecase is to do hyper-parameter optimization (as in hyperopt) over categorical hyper-parameters. |
changed incoherent notations categorical_split and split_categories to partition
…plit BestSplitter.node_split is now "just" 100 lines long
Note that pandas 0.15 will have a native data type for categories encoding: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#categoricals-in-series-dataframe We could make the decision trees able to deal with dataframe features natively. That would make it more natural to use for the user: no need to pass a feature mask. However that would require some refactoring to support lazy, per-column |
Also it would make the graphical export of a single decision tree much easier to understand. Many users are interested by the structure of the learned trees when applied to categorical data. |
totally - the same applies to partial dependence plots as well 2014-08-13 16:25 GMT+02:00 Olivier Grisel notifications@github.com:
Peter Prettenhofer |
Hello, It seems like there hasn't been development on this PR in awhile. Is there any idea of how far it is from completion? I would love to use it. |
^ +1 Thanks, On Tue, Apr 28, 2015 at 10:52 AM, spitz-dan-l notifications@github.com
|
I think this is a somewhat significant addition, and it doesn't look like anyone worked on it recently. I think most sklearn people are excited about it, but no-one had the time to work on it. Help welcome. |
I don't have time to work on it for the moment. It wasn't so for away from completeness, but there had been some major changes in the master code. |
+1 for this. Especially in combination with the pandas categorial data type |
I am 90% certain that the input will not be the pandas categorical data type, at least in the first iteration. I'm sure @GaelVaroquaux has opinions about this ^^ |
I am 90% certain that the input will not be the pandas categorical data type,
at least in the first iteration. I'm sure @GaelVaroquaux has opinions about
this ^^
Yup! Input should be a common denominator to libraries and applications.
Therefore the only thing that we have are numpy arrays.
|
Some people might argue that a dataframe is a much better common denominator, as mixed datatypes are the norm, and homogeneous datatypes are a special case that only appears in some obscure imaging techiques ;) |
Hi, |
#4899 is the latest news On 31 March 2016 at 22:06, Eyal notifications@github.com wrote:
|
I have been working on fairly large (millions of rows) problems with features that have large numbers (1000's) of categories. H2O is working pretty well but I would greatly prefer to stick with scikit-learn only. I believe that H2O is using the algorithm described in A Streaming Parallel Decision Tree Algorithm. I am thinking it would probably be easier to add SPDT rather than modify random forest. |
Should this be closed in favor of continuing the discussion in #4899 ? |
I suppose so. |
Just checking in to see if there is any progress on this issue? |
Any update on this issue? |
Hi everyone, I was not able to work on that PR for a long time, and when I went back there were a lot of things had have changed in the Scikit-learn codebase. I also have less available time than when I started the PR, so I have to stop here. You can use catboost, wait for #12866 (or try to continue where I left 😉). |
Contrary to many algorithms that can only use dummy variables, decision trees can behave differently for categorical data. The leaves of a node will partition the categories. We could expect a better accuracy in some cases, and to limit the number of dummy columns. This is the default behavior of R randomForest package.
I am currently implementing that in sklearn, using the Cython classes. I propose to ask a categorical_features option to the decision trees classes (DecisionTreeClassifier, DecisionTreeRegressor, ExtraTreeClassifier, ExtraTreeRegressor). This option could be, like in other modules of sklearn, 'None', 'all', a mask or a list of features.
Each feature could have up to 32 classes, because we will have to test all the combinaisons, so 2**31 cases. This limit allows us to use a binary representation of a split. The same limit exists in R, I think for the same reasons.
This is a work in progress, and not ready to be merged. I prefer to release it early, so I could have feedbacks.