Skip to content

Segfault in HistGradientBoostingClassifier #21283

@glemaitre

Description

@glemaitre

Describe the bug

I trigger a segfault in HistGradientBoostingClassifier. ~~I could trigger during cross-validation with n_jobs=-1 and n_jobs=1.~~Actually, I am not able to trigger anymore in n_jobs=1 but it was the case before (on a case without a random_state set.

I am using both missing values and categorical features management at the same time. I don't know if it could be one of the issue.

Steps/Code to Reproduce

# %%
import pandas as pd

target_name = "RainTomorrow"
data = pd.read_csv("./weather.csv", parse_dates=["Date"])
data = data.dropna(axis="index", subset=[target_name])
X, y = data.drop(columns=["Date", target_name]), data[target_name]

# %%
X.info()

# %%
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector

categorical_columns = make_column_selector(dtype_include=object)(X)
preprocessing = make_column_transformer(
    (
        OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
        categorical_columns,
    ),
    remainder="passthrough",
)

# %%
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingClassifier

model = make_pipeline(
    preprocessing,
    HistGradientBoostingClassifier(
        categorical_features=range(len(categorical_columns)),
        random_state=0,
    ),
)

# %%
from sklearn.model_selection import cross_validate

cross_validate(model, X, y, n_jobs=-1)

I am also attaching the dataset that I used to trigger the problem.

weather.csv

I tried to reproduce with a random set with both categorical and missing values but it did segfault.

Expected Results

At least it should not segfault.

Actual Results

---------------------------------------------------------------------------
TerminatedWorkerError                     Traceback (most recent call last)
~/Documents/scratch/bug_hist_gradient_boosting.py in <module>
      40 from sklearn.model_selection import cross_validate
      41 
----> 42 cross_validate(model, X, y, n_jobs=-1)

~/Documents/packages/scikit-learn/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    265     # independent, and that it is pickle-able.
    266     parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 267     results = parallel(
    268         delayed(_fit_and_score)(
    269             clone(estimator),

~/Documents/packages/joblib/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

~/Documents/packages/joblib/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

~/Documents/packages/joblib/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

~/mambaforge/envs/dev/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    442                     raise CancelledError()
    443                 elif self._state == FINISHED:
--> 444                     return self.__get_result()
    445                 else:
    446                     raise TimeoutError()

~/mambaforge/envs/dev/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    387         if self._exception:
    388             try:
--> 389                 raise self._exception
    390             finally:
    391                 # Break a reference cycle with the exception in self._exception

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGSEGV(-11)}

Versions

System:
    python: 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:38:21)  [Clang 11.1.0 ]
executable: /Users/glemaitre/mambaforge/envs/dev/bin/python
   machine: macOS-11.6-arm64-arm-64bit

Python dependencies:
          pip: 21.2.4
   setuptools: 58.2.0
      sklearn: 1.1.dev0
        numpy: 1.21.2
        scipy: 1.7.1
       Cython: 0.29.24
       pandas: 1.3.3
   matplotlib: 3.4.3
       joblib: 1.0.1
threadpoolctl: 3.0.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions