Description
Collaborative effort of the @neurodata lab, under the directions of @jovo.
Describe the workflow you want to enable
The current sklearn.tree
module is designed for batch learning only and thus not capable of handling streaming data. I hope to make the tree builders capable of updating the existing tree and make streaming classifications possible.
Describe your proposed solution
Stream decision tree and forest classifiers that use the partial_fit
function to update existing tree structures. The function would build upon existing branches and incorporate new batches of data into the decision trees. Combined into simple ensembles, updated trees could achieve higher accuracy and produce consistent results.
We compared our algorithms with existing stream estimators: Hoeffding tree [1] from river
and Mondrian forest [2] from scikit-garden
, in three classification tasks:
Random forest and batch decision tree classifiers were also included. The benchmarks can be found here.
Although our work was initially inspired by Hoeffding tree and Mondrian forest, they performed very inconsistently and caused severe memory issues. The disadvantages made the implementations unsuitable for high-dimensional classifications and large sample sizes. Thus, our simple approach turned out to be the optimal choice and could serve as a good starting point for exploring streaming trees.
First step - stream decision tree (#18889)
Implement the partial_fit
function in DecisionTreeClassifier
for modifying the existing tree with new batches of training samples. This step would check consistent shapes (features and classes) of training batches.
Potential changes needed on:
Second step - stream decision forest
Implement forest ensemble(s) that rely on the partial_fit
function or add the function to existing classifiers like RandomForestClassifier
.
Potential changes needed on:
Link for my fork & branch: https://github.com/PSSF23/scikit-learn-stream/tree/stream
References
[1] Pedro Domingos and Geoff Hulten. 2000. Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00). Association for Computing Machinery, New York, NY, USA, 71–80. https://doi.org/10.1145/347090.347107
[2] Lakshminarayanan, Balaji, Daniel M. Roy, and Yee Whye Teh. "Mondrian forests: Efficient online random forests." Advances in neural information processing systems 27 (2014): 3140-3148.