Skip to content

increment_mean_and_var can now handle NaN values #10618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from

Conversation

pinakinathc
Copy link
Contributor

Reference Issues/PRs

#10457 check if incremental_mean_and_var gives a green tick without failing in numerical_stability

What does this implement/fix? Explain your changes.

Any other comments?

sum_func = np.nansum if ignore_nan else np.sum
new_sum = sum_func(X, axis=0)
if not isinstance(new_sum, np.ndarray):
new_sum *= np.ones(X.shape[1], dtype=np.float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and other similar lines do not have test coverage. Make sure all the cases you intend to handle are tested.

@pinakinathc
Copy link
Contributor Author

@jnothman i am constantly getting these mismatch of the values in an array calculated. since i am getting no error in my local system, it looks the only way to figure out which line of the code is creating this error is to undo all new codes and implement 1 step at a time and see if it gets a green tick.

So in short:

  • revert all changes
  • implement each step like np.nansum() and np.nanvar and then see if it keeps on getting a green tick

Please let me know your views before i start doing that (as this kind of approach is going to take up a lot of waiting time as travis and appveyor is extremely slow)

@jnothman
Copy link
Member

I must admit that it appears quite perplexing for something like test_incremental_variance_ddof to succeed on one platform and fail drastically on others. Could I suggest you try installing an old version of cython (0.25.2 is failing) to see if this affects local test runs...?

If you really need to keep pushing to test your changes, you can limit the tests to relevant modules by modifying appveyor.yml and build_tools/travis/test_script.sh.

@jnothman
Copy link
Member

(Then again, it seems appveyor is failing with the most recent cython)

@jnothman
Copy link
Member

Nah, it looks like cython should have nothing to do with it. Perhaps numpy version. Not sure... :\

@jnothman
Copy link
Member

Behaviour could have changed across numpy versions that pertains to numerical stability. Are you sure that when you do np.ones(...) you want them to be floats, not ints?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for nansum may also give a relevant clue:
"""
In NumPy versions <= 1.8.0 Nan is returned for slices that are all-NaN or empty. In later versions zero is returned.
"""

@jnothman
Copy link
Member

Though that's not going to be the problem for numpy 1.10 :|

@pinakinathc
Copy link
Contributor Author

@jnothman i did np.ones() float because I doubted that maybe the division was somehow becoming an integer division i.e. 9/2=4 and not 4.5 . Maybe because of that (though it was quite illogical) but later even after making dtype=float the same errors are repeating, hence that is not the source of problem.

Probably i should move 1 step at a time. That would easily locate the source of error.

@jnothman
Copy link
Member

jnothman commented Feb 12, 2018 via email

@jnothman
Copy link
Member

jnothman commented Feb 12, 2018 via email

if not isinstance(new_sum, np.ndarray):
new_sum *= np.ones(X.shape[1], dtype=np.float)

new_sample_count = np.count_nonzero(~np.isnan(X), axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problem line: numpy < 1.12 does not support axis in count_nonzero. Unfortunately, it does not trigger a TypeError either, and just silently counts the total number of nonzeros. However, np.sum(~np.isnan(X), axis=0) will give the same result, as will len(X) - np.sum(np.isnan(X), axis=0).

sum_func = np.nansum if ignore_nan else np.sum
new_sum = sum_func(X, axis=0)
if not isinstance(new_sum, np.ndarray):
new_sum *= np.ones(X.shape[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not covered by tests.

new_sample_count = np.sum(~np.isnan(X), axis=0)
if not isinstance(new_sample_count, np.ndarray):
# If the input array is 1D
new_sample_count *= np.ones(X.shape[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not covered by tests


new_sample_count = np.sum(~np.isnan(X), axis=0)
if not isinstance(new_sample_count, np.ndarray):
# If the input array is 1D
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not match the condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X will be 2D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So remove that

@pinakinathc
Copy link
Contributor Author

@jnothman can you please review this?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the docstring of updated_sample_count and last_sample_count to reflect the internal change.

For the moment, they are my comments. But frankly, I am getting lost with what is happening.

@@ -688,21 +688,41 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None,
# old = stats until now
# new = the current increment
# updated = the aggregated stats
if not isinstance(last_sample_count, np.ndarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since that we don't have that much code which call _incremental_mean_and_var in the code base, I would change last_sample_count and updated_sample_count to be always an ndarray. So remove this statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre do you mean that I change last_sample_count and update_sample_count to a ndarray? this will require to make changes in various files like:

  • sklearn/decomposition/incremental_pca.py
  • sklearn/decomposition/tests/test_incremental_pca.py
  • sklearn/preprocessing/tests/test_data.py
  • sklearn/utils/tests/test_extmath.py

sum_func = np.nansum if ignore_nan else np.sum
new_sum = sum_func(X, axis=0)
if not isinstance(new_sum, np.ndarray):
new_sum *= np.ones(X.shape[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need that. X will always be 2D and the sum should always be an ndarray isn't it?


new_sample_count = np.sum(~np.isnan(X), axis=0)
if not isinstance(new_sample_count, np.ndarray):
# If the input array is 1D
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X will be 2D


new_sample_count = np.sum(~np.isnan(X), axis=0)
if not isinstance(new_sample_count, np.ndarray):
# If the input array is 1D
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So remove that

updated_sample_count = last_sample_count + new_sample_count

updated_mean = (last_sum + new_sum) / updated_sample_count
updated_mean[np.isinf(updated_mean)] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we care about inf here. It should failed with inf isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh is it because of a division by zero. You need to comment it then

@@ -711,7 +731,12 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None,
last_over_new_count / updated_sample_count *
(last_sum / last_over_new_count - new_sum) ** 2)
updated_variance = updated_unnormalized_variance / updated_sample_count
updated_variance[np.isnan(updated_variance)] = 0
updated_variance[np.isinf(updated_variance)] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment that this is due to the division by zero


# return vector only when required
if (updated_sample_count[0] == updated_sample_count).all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what is this statement about.

[np.nan, np.nan, np.nan, np.nan, np.nan]])
X1 = A[:3, :]
X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan])
X_means, X_variances, X_count = \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use \

X1 = A[:3, :]
X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan])
X_means, X_variances, X_count = \
_incremental_mean_and_var(X1, [0, 0, 0, 0, 0], [0, 0, 0, 0, 0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use directly some numpy array. we should not accept list

[np.nan, np.nan, np.nan, np.nan, np.nan],
[np.nan, np.nan, np.nan, np.nan, np.nan]])
X1 = A[:3, :]
X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X cannot be 1D. you can to a fully 2D matrix with only nan.

@glemaitre
Copy link
Member

glemaitre commented Feb 15, 2018 via email

@pinakinathc
Copy link
Contributor Author

pinakinathc commented Feb 16, 2018

@jnothman @glemaitre can you please review the code now. i have made all the changes according to your previous review.

[np.nan, np.nan, np.nan, np.nan, np.nan]])
X1 = A[:3, :]
X2 = A[3:, :]
X_means, X_variances, X_count = _incremental_mean_and_var(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X -> X1

[600, np.nan, 170, 430, 300],
[np.nan, np.nan, np.nan, np.nan, np.nan],
[np.nan, np.nan, np.nan, np.nan, np.nan]])
X1 = A[:3, :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if you just wrote out the relevant portion of A here

[np.nan, np.nan, np.nan, np.nan, np.nan],
[np.nan, np.nan, np.nan, np.nan, np.nan]])
X1 = A[:3, :]
X2 = A[3:, :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your X2 is all NaN. While this is a good test case to have, we really need to be testing whether it works with a succession of not-all-NaN data as well (even).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps you just need the corresponding test of Scaler.partial_fit, which I suspect does not currently accumulate the total count correctly.

self.n_samples_seen_)
_incremental_mean_and_var(
X, self.mean_, self.var_,
self.n_samples_seen_ * np.ones(X.shape[1]))
Copy link
Member

@jnothman jnothman Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to keep n_samples_seen_ for each feature from iteration to iteration. I don't see how this could work atm. And yet, for backwards compatibility, we need to report only a scalar in cases that are not affected by this PR (i.e. where there are no NaNs, or perhaps where n_samples_seen_ is constant even if there were NaNs).

For example, you might compress the updated count to a scalar if not np.any(np.diff(n_samples_seen_))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. From the changes, I though that self.n_samples_seen_ should always be an array.

And yet, for backwards compatibility, we need to report only a scalar in cases that are not affected by this PR (i.e. where there are no NaNs, or perhaps where n_samples_seen_ is constant even if there were NaNs).

I thought that it would be easier to change only array from now on. Only incremental_pca is affected apart of the StandardScaler and those functions are private so the end-user should not care.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_samples_seen_ is not private IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman @glemaitre Sorry for being inactive for the past week. Shall I keep self.n_sample_seen_ a vector or scaler?
PS: As of now, self.n_sample_seen_ is a vector both in StandardScaler and incremental_pca

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be best for backwards compatibility to keep a scalar in the case when there are no NaNs or - for simplicity - in the case when all n_samples_seen are equal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman @glemaitre I'll make them generalised i.e. if all n_samples_seen are equal, it will return a scalar instead of a vector.

last_sample_count=self.n_samples_seen_)
_incremental_mean_and_var(
X, last_mean=self.mean_, last_variance=self.var_,
last_sample_count=self.n_samples_seen_ * np.ones(n_features))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be supporting NaNs here. I think maybe we should allow it still to pass in a scalar n_samples_seen_ and _incremental_mean_and_var can broadcast it to n_features wide if appropriate.

@@ -515,7 +517,7 @@ def test_standard_scaler_trasform_with_partial_fit():
assert_array_less(zero, scaler_incr.var_ + epsilon) # as less or equal
assert_array_less(zero, scaler_incr.scale_ + epsilon)
# (i+1) because the Scaler has been already fitted
assert_equal((i + 1), scaler_incr.n_samples_seen_)
assert_almost_equal((i + 1), scaler_incr.n_samples_seen_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean assert_array_equal, not almost_equal if you are now trying to compare integer arrays rather than integer scalars

@@ -643,7 +643,7 @@ def make_nonnegative(X, min_value=0):


def _incremental_mean_and_var(X, last_mean=.0, last_variance=None,
last_sample_count=0):
last_sample_count=0, ignore_nan=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe to be strict this should be False by default

updated_sample_count = last_sample_count + new_sample_count

warnings.filterwarnings('ignore') # as division by 0 might happen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be in a catch_warnings context or else it changes the setting globally henceforth

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think you should be using with np.errstate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman I can do it, but the rest of the function only has calculation which are mostly division. Hence if I use with np.errstate then practically the rest of the code <till before the return statement> will come under it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you need a context manager in any case unless you modify your operands

@@ -689,28 +689,41 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None,
# new = the current increment
# updated = the aggregated stats
last_sum = last_mean * last_sample_count
new_sum = X.sum(axis=0)
sum_func = np.nansum if ignore_nan else np.sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should start with if not isnan(X.sum()): ignore_nan = False, and then use fast paths that don't involve triplicating the memory like ~isnan(new_sum) does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the lines in this function needs to be modified to be able to ignore NaN values. Now, if we do not want to encounter isnan types of executions are expensive, then there are 2 options:

  • create a separate part of the code which computes without ignoring NaN and a separate part of the code which computes ignoring NaN values
  • for each line of the code which includes isnan function, check if ignore_nan is true or false and write the code accordingly. Like: sumvar = np.nansum if ignore_nan else np.sum

last_over_new_count / updated_sample_count *
(last_sum / last_over_new_count - new_sum) ** 2)
# updated_unnormalized_variance can be both NaN or Inf
updated_unnormalized_variance[np.isnan(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use ~np.isfinite

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use updated_unnormalized_variance[np.logical_not(new_sample_count)] = 0 or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman yeah sure, but not exactly new_sample_count but updated_unnormalized_variance[np.logical_not(updated_sample_count)] = 0 because consider the case:

  • last_sample_count = [1, 3, 5, 7, 9]
  • `new_sample_count = [3, 1, 0, 5, 9]'
  • so updated_unnormalized_variance[2] != 0 just because new_sample_count[2] == 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine. Seems simpler than isnan and isinf to me

@@ -1024,6 +1025,8 @@ def check_estimators_nan_inf(name, estimator_orig):
error_string_transform = ("Estimator doesn't check for NaN and inf in"
" transform.")
for X_train in [X_train_nan, X_train_inf]:
if np.any(np.isnan(X_train)) and name in ALLOW_NAN:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as: #10437 (comment)

@glemaitre
Copy link
Member

I'll make them generalised i.e. if all n_samples_seen are equal, it will return a scalar instead of a vector.

@pinakinathc ok. ping me when you addressed all the points to be reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants