Skip to content

Random Forest Prediction and np.nan #19767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cap-jmk opened this issue Mar 25, 2021 · 4 comments
Closed

Random Forest Prediction and np.nan #19767

cap-jmk opened this issue Mar 25, 2021 · 4 comments

Comments

@cap-jmk
Copy link

cap-jmk commented Mar 25, 2021

Describe the bug

When running my processing workflow on a "dirty" dataset containing np.nan values the algorithm does not handle the case.

Steps/Code to Reproduce

Example:

list_with_nan_values = []
for i in range(10): 
    list_with_nan_values.append(np.nan)
prediction  = random_forest.predict(list_with_nan_values)

Expected Results

No error is thrown, or at least a warning that there is a np.nan value there.

Actual Results

setting an array element with a sequence. The requested array 
has an inhomogeneous shape after 1 dimensions. 
The detected shape was (10,) + inhomogeneous part.

Versions


System:
    python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 15:59:12)  [Clang 11.0.1 ]
executable: /Users/julian/opt/anaconda3/envs/cadd-course/bin/python
   machine: Darwin-20.3.0-x86_64-i386-64bit

Python dependencies:
          pip: 21.0.1
   setuptools: 49.6.0.post20210108
      sklearn: 0.24.1
        numpy: 1.20.1
        scipy: 1.6.1
       Cython: None
       pandas: 1.2.3
   matplotlib: 3.3.4
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

@NicolasHug
Copy link
Member

Random forests don't support nans, in fit or in predict. Is this a feature request or a bug report? The HistGradientBoosting esitmators are the only ones to natively support nans (I think). You'll need impute the data if you want to use RFs.

The error message you get is strange. How did you fit the estimator? You should get something like

ValueError: Expected 2D array, got 1D array instead:
array=[nan nan nan nan nan nan nan nan nan nan].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

@glemaitre
Copy link
Member

X is expected to be (n_samples, n_features) and there is not support for missing values in random forest. We add some work going on supporting missing values natively in decision tree in scikit-learn: #5974

I am closing this issue because it looks like a duplicate.

@glemaitre
Copy link
Member

@MQSchleich Feel free to comment in case that we missed something in your original, in which case we would reopen this issue.

@cap-jmk
Copy link
Author

cap-jmk commented Mar 27, 2021

Random forests don't support nans, in fit or in predict. Is this a feature request or a bug report? The HistGradientBoosting esitmators are the only ones to natively support nans (I think). You'll need impute the data if you want to use RFs.

The error message you get is strange. How did you fit the estimator? You should get something like

ValueError: Expected 2D array, got 1D array instead:
array=[nan nan nan nan nan nan nan nan nan nan].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Honestly, I was just reproducing a case without giving you the whole code base. The bug occurred when getting input from a data frame after converting it to a list, having 1 np.nan value. However, it is not bad at all that the forest throws an error. Only the message was so confusing, I needed some time to figure out what was going on. Therefore, I suggest it is actually better to not throw an error but continue processing and to throw a warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants