Skip to content

Add streaming option to decision trees #18888

Open
@PSSF23

Description

@PSSF23

Collaborative effort of the @neurodata lab, under the directions of @jovo.

Describe the workflow you want to enable

The current sklearn.tree module is designed for batch learning only and thus not capable of handling streaming data. I hope to make the tree builders capable of updating the existing tree and make streaming classifications possible.

Describe your proposed solution

Stream decision tree and forest classifiers that use the partial_fit function to update existing tree structures. The function would build upon existing branches and incorporate new batches of data into the decision trees. Combined into simple ensembles, updated trees could achieve higher accuracy and produce consistent results.

We compared our algorithms with existing stream estimators: Hoeffding tree [1] from river and Mondrian forest [2] from scikit-garden, in three classification tasks:

Random forest and batch decision tree classifiers were also included. The benchmarks can be found here.

Although our work was initially inspired by Hoeffding tree and Mondrian forest, they performed very inconsistently and caused severe memory issues. The disadvantages made the implementations unsuitable for high-dimensional classifications and large sample sizes. Thus, our simple approach turned out to be the optimal choice and could serve as a good starting point for exploring streaming trees.

First step - stream decision tree (#18889)

Implement the partial_fit function in DecisionTreeClassifier for modifying the existing tree with new batches of training samples. This step would check consistent shapes (features and classes) of training batches.

Potential changes needed on:

Second step - stream decision forest

Implement forest ensemble(s) that rely on the partial_fit function or add the function to existing classifiers like RandomForestClassifier.

Potential changes needed on:

Link for my fork & branch: https://github.com/PSSF23/scikit-learn-stream/tree/stream

References

[1] Pedro Domingos and Geoff Hulten. 2000. Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00). Association for Computing Machinery, New York, NY, USA, 71–80. https://doi.org/10.1145/347090.347107

[2] Lakshminarayanan, Balaji, Daniel M. Roy, and Yee Whye Teh. "Mondrian forests: Efficient online random forests." Advances in neural information processing systems 27 (2014): 3140-3148.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions