Skip to content

Segfault when passing Criterion object to Forest ensembles with n_jobs>1 #12623

Closed
@wyegelwel

Description

@wyegelwel

Description

When passing in a Criterion object to RandomForest or ExtraTrees as opposed to a Criterion string, I've observed segfaults when fitting when n_jobs is > 1. In my case, I've written a custom Criterion, but can reproduce the problem with one of the sklearn built in criterions if you pass in the Criterion object instead of the string.

I believe the problem is that when creating the list of estimators for the ensemble, the parameters aren't copied so that the same Criterion object is used for all the trees. When n_jobs=1, this is ok because the criterion is re-initialized at each split. However, when n_jobs>1, the same criterion is modified by multiple threads resulting in cases where pointers are freed and then accessed.

Steps/Code to Reproduce

The following code reproduces the segfault:

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree.tree import CRITERIA_REG
import numpy as np

X = np.random.random((1000, 3))
y = np.random.random((1000, 1))

n_samples, n_outputs = y.shape
mse_criterion = CRITERIA_REG['mse'](n_outputs, n_samples)
rf = ExtraTreesRegressor(n_estimators=400, n_jobs=-1, criterion=mse_criterion)

rf.fit(X,y)

Versions

System

python: 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 21:41:56)  [GCC 7.3.0]

Python deps

sklearn: 0.20.0
setuptools: 40.2.0
pip: 10.0.1
Cython: 0.28.5
numpy: 1.13.3
pandas: 0.23.4
scipy: 1.1.0

Discussion

I've tried adding a call to copy.deepcopy() around the getattr call for all the parameters accessed when making the estimators to fit which seems to fix the problem. Would that be an acceptable fix or are you interested in a deeper fix?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions