-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Closed
Description
Describe the bug
I trigger a segfault in HistGradientBoostingClassifier
. ~~I could trigger during cross-validation with n_jobs=-1
and n_jobs=1
.~~Actually, I am not able to trigger anymore in n_jobs=1
but it was the case before (on a case without a random_state
set.
I am using both missing values and categorical features management at the same time. I don't know if it could be one of the issue.
Steps/Code to Reproduce
# %%
import pandas as pd
target_name = "RainTomorrow"
data = pd.read_csv("./weather.csv", parse_dates=["Date"])
data = data.dropna(axis="index", subset=[target_name])
X, y = data.drop(columns=["Date", target_name]), data[target_name]
# %%
X.info()
# %%
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector
categorical_columns = make_column_selector(dtype_include=object)(X)
preprocessing = make_column_transformer(
(
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
categorical_columns,
),
remainder="passthrough",
)
# %%
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingClassifier
model = make_pipeline(
preprocessing,
HistGradientBoostingClassifier(
categorical_features=range(len(categorical_columns)),
random_state=0,
),
)
# %%
from sklearn.model_selection import cross_validate
cross_validate(model, X, y, n_jobs=-1)
I am also attaching the dataset that I used to trigger the problem.
I tried to reproduce with a random set with both categorical and missing values but it did segfault.
Expected Results
At least it should not segfault.
Actual Results
---------------------------------------------------------------------------
TerminatedWorkerError Traceback (most recent call last)
~/Documents/scratch/bug_hist_gradient_boosting.py in <module>
40 from sklearn.model_selection import cross_validate
41
----> 42 cross_validate(model, X, y, n_jobs=-1)
~/Documents/packages/scikit-learn/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
265 # independent, and that it is pickle-able.
266 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 267 results = parallel(
268 delayed(_fit_and_score)(
269 clone(estimator),
~/Documents/packages/joblib/joblib/parallel.py in __call__(self, iterable)
1052
1053 with self._backend.retrieval_context():
-> 1054 self.retrieve()
1055 # Make sure that we get a last message telling us we are done
1056 elapsed_time = time.time() - self._start_time
~/Documents/packages/joblib/joblib/parallel.py in retrieve(self)
931 try:
932 if getattr(self._backend, 'supports_timeout', False):
--> 933 self._output.extend(job.get(timeout=self.timeout))
934 else:
935 self._output.extend(job.get())
~/Documents/packages/joblib/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
~/mambaforge/envs/dev/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
442 raise CancelledError()
443 elif self._state == FINISHED:
--> 444 return self.__get_result()
445 else:
446 raise TimeoutError()
~/mambaforge/envs/dev/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
387 if self._exception:
388 try:
--> 389 raise self._exception
390 finally:
391 # Break a reference cycle with the exception in self._exception
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGSEGV(-11)}
Versions
System:
python: 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:38:21) [Clang 11.1.0 ]
executable: /Users/glemaitre/mambaforge/envs/dev/bin/python
machine: macOS-11.6-arm64-arm-64bit
Python dependencies:
pip: 21.2.4
setuptools: 58.2.0
sklearn: 1.1.dev0
numpy: 1.21.2
scipy: 1.7.1
Cython: 0.29.24
pandas: 1.3.3
matplotlib: 3.4.3
joblib: 1.0.1
threadpoolctl: 3.0.0
Built with OpenMP: True
Metadata
Metadata
Assignees
Labels
No labels