Description
I enconter memory leakage in decition trees both on my own computer (ubuntu 16.0 on anaconda python 2.7) using a fresh scikit-learn from pip and on a kaggle kernel python 3.5 (https://www.kaggle.com/ppallesen/titanic/notebook388ea683bf/). I seems to be related to the number of jobs, since the memory leakege is strongly reduced by reducing the number of jobs to 1. So I think there might be a memory leakeage in the parellel statement. The size of the memory use changes alot from run to run, which seems kind of odd.
import gc
import os
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
import psutil
p = psutil.Process(os.getpid())
X = np.random.normal(size=(10000, 50))
Y = np.random.binomial(1, 0.5, size=(10000, ))
def print_mem():
print("{:.0f}MB".format(p.memory_info().rss / 1e6))
print_mem()
for i in range(5):
et = ExtraTreesClassifier(n_estimators=1000, max_features=1,
n_jobs=4).fit(X, Y)
del et
gc.collect()
print_mem()
out:
115MB
402MB
387MB
715MB
703MB
879MB
So the memory consumption before the ExtraTreesClassifier is 104 MB and after running it and deleting it the memory consumstion is several hundrend mega bits higher but not nesseraly incresing.