Skip to content

ENH Uses binned values from training to find missing values #16883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

thomasjpfan
Copy link
Member

From my understanding: If the training data, X_binned_train, has no missing values, we would not need to do the right to left sweep when splitting. Secondly, this seems leaks data from the validation set, X_binned_val, into training.

@NicolasHug What do you think?

@NicolasHug
Copy link
Member

The proposed changes seem equivalent (potentially slower?)

From my understanding: If the training data, X_binned_train, has no missing values, we would not need to do the right to left sweep when splitting.

Correct. When there are no missing values, we only do one scan (left to right).

Secondly, this seems leaks data from the validation set, X_binned_val, into training.

How? has_missing_values is computed with X_train, not X.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Apr 10, 2020

How? has_missing_values is computed with X_train, not X.

Ah yes, I was mistaken, this PR is not strictly needed.

The proposed changes seem equivalent (potentially slower?)

Although this change is faster:

from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
import numpy as np

rng = np.random.RandomState(42)
X = rng.rand(100_000, 200)
mask = rng.binomial(1, 0.1, size=X.shape).astype(np.bool)
X[mask] = np.nan

mapper = _BinMapper()
X_trans = mapper.fit_transform(X)
%timeit (X_trans == mapper.missing_values_bin_idx_).any(axis=0).astype(np.uint8)
# 2.31 ms ± 69.6 µs per loop

%timeit np.isnan(X).any(axis=0).astype(np.uint8)
# 14.4 ms ± 394 µs per loop

Secretly, I am using this change for categorical support and this PR may be an improvement (at least in terms of speed).

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK nice, thanks for the quick benchmark.

I guess we can merge without a +2?

@thomasjpfan
Copy link
Member Author

I guess we can merge without a +2?

Looks okay to me.

@adrinjalali adrinjalali merged commit e5cc2b0 into scikit-learn:master Jun 25, 2020
@adrinjalali
Copy link
Member

Reverted this from master in c9ca1d7 since it broke master

@thomasjpfan
Copy link
Member Author

It was broken because of #13022 which does not work for new versions of sphinx.

@thomasjpfan
Copy link
Member Author

Will be fixed with #17724

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jun 25, 2020

Ah okay I see.

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jul 17, 2020
…earn#16883)

* ENH Uses training data to find missing values

* CLN Uses the binned data to find missing value
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jul 17, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…earn#16883)

* ENH Uses training data to find missing values

* CLN Uses the binned data to find missing value
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants