RFECV race condition on estimator

In RFECV, at
https://github.com/scikit-learn/scikit-learn/blob/fb3ed90fb501a755ce2938fb566bd0f6e2235054/sklearn/feature_selection/_rfe.py#L723-L726
the estimator is passed as-is to the fit function. Since `fit` modifies the object without copying, this is prone to race conditions (see example below).

Contrast this to BaseSearchCV, where the estimator is properly cloned:
https://github.com/scikit-learn/scikit-learn/blob/fb3ed90fb501a755ce2938fb566bd0f6e2235054/sklearn/model_selection/_search.py#L823-L833

---

<details>
<summary>On my system, with parameter `n_jobs=-1`, I got the following error:</summary>

```
5 fits failed with the following error:
Traceback (most recent call last):
  File ".../site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File ".../site-packages/sklearn/feature_selection/_rfe.py", line 723, in fit
    scores = parallel(
  File ".../site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File ".../site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File ".../site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File ".../site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File ".../site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File ".../site-packages/sklearn/utils/fixes.py", line 117, in __call__
    return self.function(*args, **kwargs)
  File ".../site-packages/sklearn/feature_selection/_rfe.py", line 37, in _rfe_single_fit
    return rfe._fit(
  File ".../site-packages/sklearn/feature_selection/_rfe.py", line 327, in _fit
    self.scores_.append(step_score(self.estimator_, features))
  File ".../site-packages/sklearn/feature_selection/_rfe.py", line 40, in <lambda>
    lambda estimator, features: _score(
  File ".../site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File ".../site-packages/sklearn/metrics/_scorer.py", line 219, in __call__
    return self._score(
  File ".../site-packages/sklearn/metrics/_scorer.py", line 261, in _score
    y_pred = method_caller(estimator, "predict", X)
  File ".../site-packages/sklearn/metrics/_scorer.py", line 71, in _cached_call
    return getattr(estimator, method)(*args, **kwargs)
  File ".../site-packages/sklearn/ensemble/_forest.py", line 835, in predict
    return self.classes_.take(np.argmax(proba, axis=1), axis=0)
AttributeError: 'list' object has no attribute 'take'
```
</details>

It is generated from the following snippet:
```py
  rf = RandomForestClassifier()
  rfecv = RFECV(rf, scoring='accuracy', n_jobs=-1)
  rfecv.fit(X_train, y_train)
```

The error appears to happen because `n_outputs_` is not constant between runs. The error does not happen without parallelism.

	delayed(_fit_and_score)(
	clone(base_estimator),
	X,
	y,
	train=train,
	test=test,
	parameters=parameters,
	split_progress=(split_idx, n_splits),
	candidate_progress=(cand_idx, n_candidates),
	**fit_and_score_kwargs,
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFECV race condition on estimator #23533

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	scores = parallel(
	func(rfe, self.estimator, X, y, train, test, scorer)
	for train, test in cv.split(X, y, groups)
	)

Uh oh!

RFECV race condition on estimator #23533

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions