-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
increment_mean_and_var can now handle NaN values #10618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/utils/extmath.py
Outdated
sum_func = np.nansum if ignore_nan else np.sum | ||
new_sum = sum_func(X, axis=0) | ||
if not isinstance(new_sum, np.ndarray): | ||
new_sum *= np.ones(X.shape[1], dtype=np.float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and other similar lines do not have test coverage. Make sure all the cases you intend to handle are tested.
@jnothman i am constantly getting these mismatch of the values in an array calculated. since i am getting no error in my local system, it looks the only way to figure out which line of the code is creating this error is to undo all new codes and implement 1 step at a time and see if it gets a green tick. So in short:
Please let me know your views before i start doing that (as this kind of approach is going to take up a lot of waiting time as travis and appveyor is extremely slow) |
I must admit that it appears quite perplexing for something like If you really need to keep pushing to test your changes, you can limit the tests to relevant modules by modifying |
(Then again, it seems appveyor is failing with the most recent cython) |
Nah, it looks like cython should have nothing to do with it. Perhaps numpy version. Not sure... :\ |
Behaviour could have changed across numpy versions that pertains to numerical stability. Are you sure that when you do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for nansum may also give a relevant clue:
"""
In NumPy versions <= 1.8.0 Nan is returned for slices that are all-NaN or empty. In later versions zero is returned.
"""
Though that's not going to be the problem for numpy 1.10 :| |
@jnothman i did np.ones() float because I doubted that maybe the division was somehow becoming an integer division i.e. 9/2=4 and not 4.5 . Maybe because of that (though it was quite illogical) but later even after making dtype=float the same errors are repeating, hence that is not the source of problem. Probably i should move 1 step at a time. That would easily locate the source of error. |
as long as `__future__.division` is imported, that should not be an issue.
I'll see if I have a moment to try replicate the test failures.
|
Yes, downgrading numpy to 1.10.4 is sufficient to trigger the errors. I've
not investigated the cause.
|
sklearn/utils/extmath.py
Outdated
if not isinstance(new_sum, np.ndarray): | ||
new_sum *= np.ones(X.shape[1], dtype=np.float) | ||
|
||
new_sample_count = np.count_nonzero(~np.isnan(X), axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the problem line: numpy < 1.12 does not support axis
in count_nonzero
. Unfortunately, it does not trigger a TypeError either, and just silently counts the total number of nonzeros. However, np.sum(~np.isnan(X), axis=0)
will give the same result, as will len(X) - np.sum(np.isnan(X), axis=0)
.
sklearn/utils/extmath.py
Outdated
sum_func = np.nansum if ignore_nan else np.sum | ||
new_sum = sum_func(X, axis=0) | ||
if not isinstance(new_sum, np.ndarray): | ||
new_sum *= np.ones(X.shape[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not covered by tests.
sklearn/utils/extmath.py
Outdated
new_sample_count = np.sum(~np.isnan(X), axis=0) | ||
if not isinstance(new_sample_count, np.ndarray): | ||
# If the input array is 1D | ||
new_sample_count *= np.ones(X.shape[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not covered by tests
sklearn/utils/extmath.py
Outdated
|
||
new_sample_count = np.sum(~np.isnan(X), axis=0) | ||
if not isinstance(new_sample_count, np.ndarray): | ||
# If the input array is 1D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not match the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X will be 2D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So remove that
@jnothman can you please review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the docstring of updated_sample_count
and last_sample_count
to reflect the internal change.
For the moment, they are my comments. But frankly, I am getting lost with what is happening.
sklearn/utils/extmath.py
Outdated
@@ -688,21 +688,41 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None, | |||
# old = stats until now | |||
# new = the current increment | |||
# updated = the aggregated stats | |||
if not isinstance(last_sample_count, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since that we don't have that much code which call _incremental_mean_and_var
in the code base, I would change last_sample_count
and updated_sample_count
to be always an ndarray
. So remove this statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre do you mean that I change last_sample_count
and update_sample_count
to a ndarray? this will require to make changes in various files like:
sklearn/decomposition/incremental_pca.py
sklearn/decomposition/tests/test_incremental_pca.py
sklearn/preprocessing/tests/test_data.py
sklearn/utils/tests/test_extmath.py
sklearn/utils/extmath.py
Outdated
sum_func = np.nansum if ignore_nan else np.sum | ||
new_sum = sum_func(X, axis=0) | ||
if not isinstance(new_sum, np.ndarray): | ||
new_sum *= np.ones(X.shape[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need that. X will always be 2D and the sum should always be an ndarray isn't it?
sklearn/utils/extmath.py
Outdated
|
||
new_sample_count = np.sum(~np.isnan(X), axis=0) | ||
if not isinstance(new_sample_count, np.ndarray): | ||
# If the input array is 1D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X will be 2D
sklearn/utils/extmath.py
Outdated
|
||
new_sample_count = np.sum(~np.isnan(X), axis=0) | ||
if not isinstance(new_sample_count, np.ndarray): | ||
# If the input array is 1D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So remove that
sklearn/utils/extmath.py
Outdated
updated_sample_count = last_sample_count + new_sample_count | ||
|
||
updated_mean = (last_sum + new_sum) / updated_sample_count | ||
updated_mean[np.isinf(updated_mean)] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we care about inf
here. It should failed with inf
isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh is it because of a division by zero. You need to comment it then
sklearn/utils/extmath.py
Outdated
@@ -711,7 +731,12 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None, | |||
last_over_new_count / updated_sample_count * | |||
(last_sum / last_over_new_count - new_sum) ** 2) | |||
updated_variance = updated_unnormalized_variance / updated_sample_count | |||
updated_variance[np.isnan(updated_variance)] = 0 | |||
updated_variance[np.isinf(updated_variance)] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment that this is due to the division by zero
sklearn/utils/extmath.py
Outdated
|
||
# return vector only when required | ||
if (updated_sample_count[0] == updated_sample_count).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what is this statement about.
sklearn/utils/tests/test_extmath.py
Outdated
[np.nan, np.nan, np.nan, np.nan, np.nan]]) | ||
X1 = A[:3, :] | ||
X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan]) | ||
X_means, X_variances, X_count = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use \
sklearn/utils/tests/test_extmath.py
Outdated
X1 = A[:3, :] | ||
X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan]) | ||
X_means, X_variances, X_count = \ | ||
_incremental_mean_and_var(X1, [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use directly some numpy array. we should not accept list
sklearn/utils/tests/test_extmath.py
Outdated
[np.nan, np.nan, np.nan, np.nan, np.nan], | ||
[np.nan, np.nan, np.nan, np.nan, np.nan]]) | ||
X1 = A[:3, :] | ||
X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X cannot be 1D. you can to a fully 2D matrix with only nan.
Yes but I considered those changes as minimal since that this is one estimator and mainly tests.
|
@jnothman @glemaitre can you please review the code now. i have made all the changes according to your previous review. |
sklearn/utils/tests/test_extmath.py
Outdated
[np.nan, np.nan, np.nan, np.nan, np.nan]]) | ||
X1 = A[:3, :] | ||
X2 = A[3:, :] | ||
X_means, X_variances, X_count = _incremental_mean_and_var( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X -> X1
sklearn/utils/tests/test_extmath.py
Outdated
[600, np.nan, 170, 430, 300], | ||
[np.nan, np.nan, np.nan, np.nan, np.nan], | ||
[np.nan, np.nan, np.nan, np.nan, np.nan]]) | ||
X1 = A[:3, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better if you just wrote out the relevant portion of A here
sklearn/utils/tests/test_extmath.py
Outdated
[np.nan, np.nan, np.nan, np.nan, np.nan], | ||
[np.nan, np.nan, np.nan, np.nan, np.nan]]) | ||
X1 = A[:3, :] | ||
X2 = A[3:, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your X2 is all NaN. While this is a good test case to have, we really need to be testing whether it works with a succession of not-all-NaN data as well (even).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps you just need the corresponding test of Scaler.partial_fit, which I suspect does not currently accumulate the total count correctly.
self.n_samples_seen_) | ||
_incremental_mean_and_var( | ||
X, self.mean_, self.var_, | ||
self.n_samples_seen_ * np.ones(X.shape[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to keep n_samples_seen_
for each feature from iteration to iteration. I don't see how this could work atm. And yet, for backwards compatibility, we need to report only a scalar in cases that are not affected by this PR (i.e. where there are no NaNs, or perhaps where n_samples_seen_
is constant even if there were NaNs).
For example, you might compress the updated count to a scalar if not np.any(np.diff(n_samples_seen_))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. From the changes, I though that self.n_samples_seen_
should always be an array.
And yet, for backwards compatibility, we need to report only a scalar in cases that are not affected by this PR (i.e. where there are no NaNs, or perhaps where n_samples_seen_ is constant even if there were NaNs).
I thought that it would be easier to change only array from now on. Only incremental_pca
is affected apart of the StandardScaler
and those functions are private so the end-user should not care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_samples_seen_
is not private IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman @glemaitre Sorry for being inactive for the past week. Shall I keep self.n_sample_seen_
a vector or scaler?
PS: As of now, self.n_sample_seen_
is a vector both in StandardScaler
and incremental_pca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it would be best for backwards compatibility to keep a scalar in the case when there are no NaNs or - for simplicity - in the case when all n_samples_seen are equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman @glemaitre I'll make them generalised i.e. if all n_samples_seen are equal, it will return a scalar instead of a vector.
last_sample_count=self.n_samples_seen_) | ||
_incremental_mean_and_var( | ||
X, last_mean=self.mean_, last_variance=self.var_, | ||
last_sample_count=self.n_samples_seen_ * np.ones(n_features)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not be supporting NaNs here. I think maybe we should allow it still to pass in a scalar n_samples_seen_ and _incremental_mean_and_var
can broadcast it to n_features wide if appropriate.
@@ -515,7 +517,7 @@ def test_standard_scaler_trasform_with_partial_fit(): | |||
assert_array_less(zero, scaler_incr.var_ + epsilon) # as less or equal | |||
assert_array_less(zero, scaler_incr.scale_ + epsilon) | |||
# (i+1) because the Scaler has been already fitted | |||
assert_equal((i + 1), scaler_incr.n_samples_seen_) | |||
assert_almost_equal((i + 1), scaler_incr.n_samples_seen_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean assert_array_equal, not almost_equal if you are now trying to compare integer arrays rather than integer scalars
sklearn/utils/extmath.py
Outdated
@@ -643,7 +643,7 @@ def make_nonnegative(X, min_value=0): | |||
|
|||
|
|||
def _incremental_mean_and_var(X, last_mean=.0, last_variance=None, | |||
last_sample_count=0): | |||
last_sample_count=0, ignore_nan=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe to be strict this should be False by default
sklearn/utils/extmath.py
Outdated
updated_sample_count = last_sample_count + new_sample_count | ||
|
||
warnings.filterwarnings('ignore') # as division by 0 might happen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be in a catch_warnings
context or else it changes the setting globally henceforth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I think you should be using with np.errstate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman I can do it, but the rest of the function only has calculation which are mostly division. Hence if I use with np.errstate
then practically the rest of the code <till before the return
statement> will come under it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you need a context manager in any case unless you modify your operands
@@ -689,28 +689,41 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None, | |||
# new = the current increment | |||
# updated = the aggregated stats | |||
last_sum = last_mean * last_sample_count | |||
new_sum = X.sum(axis=0) | |||
sum_func = np.nansum if ignore_nan else np.sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should start with if not isnan(X.sum()): ignore_nan = False, and then use fast paths that don't involve triplicating the memory like ~isnan(new_sum)
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the lines in this function needs to be modified to be able to ignore NaN values. Now, if we do not want to encounter isnan
types of executions are expensive, then there are 2 options:
- create a separate part of the code which computes without ignoring NaN and a separate part of the code which computes ignoring NaN values
- for each line of the code which includes
isnan
function, check ifignore_nan
is true or false and write the code accordingly. Like:sumvar = np.nansum if ignore_nan else np.sum
sklearn/utils/extmath.py
Outdated
last_over_new_count / updated_sample_count * | ||
(last_sum / last_over_new_count - new_sum) ** 2) | ||
# updated_unnormalized_variance can be both NaN or Inf | ||
updated_unnormalized_variance[np.isnan( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use ~np.isfinite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just use updated_unnormalized_variance[np.logical_not(new_sample_count)] = 0
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman yeah sure, but not exactly new_sample_count
but updated_unnormalized_variance[np.logical_not(updated_sample_count)] = 0
because consider the case:
last_sample_count = [1, 3, 5, 7, 9]
- `new_sample_count = [3, 1, 0, 5, 9]'
- so
updated_unnormalized_variance[2] != 0
just becausenew_sample_count[2] == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine. Seems simpler than isnan and isinf to me
@@ -1024,6 +1025,8 @@ def check_estimators_nan_inf(name, estimator_orig): | |||
error_string_transform = ("Estimator doesn't check for NaN and inf in" | |||
" transform.") | |||
for X_train in [X_train_nan, X_train_inf]: | |||
if np.any(np.isnan(X_train)) and name in ALLOW_NAN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as: #10437 (comment)
…ikit-learn into sparseMatrix-test
Sparse matrix test
@pinakinathc ok. ping me when you addressed all the points to be reviewed. |
Reference Issues/PRs
#10457 check if incremental_mean_and_var gives a green tick without failing in numerical_stability
What does this implement/fix? Explain your changes.
Any other comments?