-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Uses binned values from training to find missing values #16883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Uses binned values from training to find missing values #16883
Conversation
The proposed changes seem equivalent (potentially slower?)
Correct. When there are no missing values, we only do one scan (left to right).
How? |
Ah yes, I was mistaken, this PR is not strictly needed.
Although this change is faster: from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
import numpy as np
rng = np.random.RandomState(42)
X = rng.rand(100_000, 200)
mask = rng.binomial(1, 0.1, size=X.shape).astype(np.bool)
X[mask] = np.nan
mapper = _BinMapper()
X_trans = mapper.fit_transform(X) %timeit (X_trans == mapper.missing_values_bin_idx_).any(axis=0).astype(np.uint8)
# 2.31 ms ± 69.6 µs per loop
%timeit np.isnan(X).any(axis=0).astype(np.uint8)
# 14.4 ms ± 394 µs per loop Secretly, I am using this change for categorical support and this PR may be an improvement (at least in terms of speed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK nice, thanks for the quick benchmark.
I guess we can merge without a +2?
Looks okay to me. |
Reverted this from master in c9ca1d7 since it broke master |
It was broken because of #13022 which does not work for new versions of sphinx. |
Will be fixed with #17724 |
Ah okay I see. |
…earn#16883) * ENH Uses training data to find missing values * CLN Uses the binned data to find missing value
…cikit-learn#16883)" This reverts commit e5cc2b0.
…earn#16883) * ENH Uses training data to find missing values * CLN Uses the binned data to find missing value
…cikit-learn#16883)" This reverts commit e5cc2b0.
From my understanding: If the training data,
X_binned_train
, has no missing values, we would not need to do the right to left sweep when splitting. Secondly, this seems leaks data from the validation set,X_binned_val
, into training.@NicolasHug What do you think?