Skip to content

[WIP] Disregard nan fix #10457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

pinakinathc
Copy link
Contributor

@pinakinathc pinakinathc commented Jan 11, 2018

Reference Issues/PRs

Partially Addressed : #10404

What does this implement/fix? Explain your changes.

  • StandardScaler -> Sparse Matrix ignore NaN
  • Add more thorough test in test_sparsefunctions.py
  • StandardScaler -> _incremental_mean_var ignores NaN
  • Add thorough test in test_extmath.py

Any other comments?

now various functions which previously took last_n to get the previous number of samples while computing the updated mean and variance now takes additional argument: last_n_feat

last_n is a scaler. This is not optional
last_n_feat is optional and takes a vector. Useful in cases when different features have different values because some of them had NaN values which needs to be ignored.

Instruction to use last_n_feat instead of last_n:
function_name(................., last_n=0, last_n_feat=self.n_samples_seen_)

@jnothman
Copy link
Member

I'm curious what you mean by"with increment"

@pinakinathc
Copy link
Contributor Author

sorry there are a lot of errors, i am fixing them

@pinakinathc
Copy link
Contributor Author

@jnothman by increment i meant increment_mean_variance i.e. increment with respect to old_mean and old_variance

@jnothman
Copy link
Member

jnothman commented Jan 11, 2018 via email

@pinakinathc
Copy link
Contributor Author

pinakinathc commented Jan 12, 2018

@jnothman yes. Actually sparse matrix in partial_fit is evaluated by a lot of functions so what i did in the previous commit (though it has errors which i am fixing now) is create similar modules but with the capability of processing NaNs

But the good thing is that only sparse matrix is a bit difficult as i cannot use any nanmean function. So bsically it is all implemented in sparsefunction_fast.pyx
But for non-sparse matrix, i can use nanmean() and things would be easier

Also, let me try a bit more on this one (i love challenges) and then i shall try minmaxscaler if that is okay :)

@pinakinathc
Copy link
Contributor Author

pinakinathc commented Jan 12, 2018

@jnothman @glemaitre and others can you please review the code? and give me some insights regarding this codecov/patch error

@@ -619,8 +632,12 @@ def partial_fit(self, X, y=None):
Ignored
"""
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
force_all_finite=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put in your TODO to change this when there is an agreement on #10459

@glemaitre
Copy link
Member

Before a specific review, you can check which part of the code which is not covered and try to cover it:
https://codecov.io/gh/scikit-learn/scikit-learn/compare/b94876a3a20e1d629698ea1c7b358169b97bfcce...eda1f57b761da7c8cd1c361c0ecbcc87a191389d/diff

Since that they are cpdef you can test them without any issue.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general comment, I would try to factorize your code as much as possible. I don't think that we need to have specific functions but rather propagate an argument to reuse the common part of the code for each function.

scaler = StandardScaler(with_mean=False)
scaler.partial_fit(X_csr)
assert_almost_equal(scaler.mean_, 1.0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to have a test with only nan value in X or at least a complete column

@@ -715,6 +715,41 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None,
return updated_mean, updated_variance, updated_sample_count


def _incremental_mean_and_var_nan(X, last_mean=.0, last_variance=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would factorize _invremental_mean_and_var and _incremental_mean_and_var to avoid code duplication.
I could think of a new argument in _incremental_mean_and_var (e.g. allow_nan=False) and modify a single line:

if allow_nan:
    var_func = np.nanvar
else:
    var_func = np.var
...
new_unnormalized_variance = var_func(X, axis=0) * new_sample_count

"""This is identical to incr_mean_variance_axis0_nan
but can process NaN values
"""
if X.dtype != np.float32:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman do we convert here because we are scared about overflow or this is an unnecessary thing?

@@ -126,6 +135,58 @@ def _csr_mean_variance_axis0(np.ndarray[floating, ndim=1, mode="c"] X_data,
return means, variances


def _csr_mean_variance_axis0_nan(np.ndarray[floating, ndim=1, mode="c"] X_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do you handle nan?

@pinakinathc
Copy link
Contributor Author

thanks @glemaitre i'll start working on these and yes, there is a lot of duplication of code, i'll pass an argument instead.

@pinakinathc
Copy link
Contributor Author

@jnothman @glemaitre please check this. as @glemaitre pointed out check_array() -> force_all_finite, I am waiting for #10459 to get merged.
Apart from this is there any other issues with this?

@@ -28,8 +28,9 @@
from ..utils.sparsefuncs_fast import (inplace_csr_row_normalize_l1,
inplace_csr_row_normalize_l2)
from ..utils.sparsefuncs import (inplace_column_scale,
mean_variance_axis, incr_mean_variance_axis,
min_max_axis)
mean_variance_axis,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert those changes. We tend to change only code which is necessary (even if some part of the code which are not PEP8).

@@ -581,7 +582,7 @@ def _reset(self):
del self.mean_
del self.var_

def fit(self, X, y=None):
def fit(self, X, y=None, accepting_nan=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we allow to change the signature of fit.
In #10437, I added an argument to the class missing_values to accept NaN but also integer.
However, since that this transformer return a floating X, replacing the integer by NaN after having converting the input should be lead to the same underneath processing.
I think that you should go in that direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre should I add 2 class arguments : missing_values and ignore_nan or should I add only one of them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre, your comment was unclear. Are you suggesting that we support non-nan missing values as input here, or not?

I think for preprocessing that outputs floats, we might as well just support NaN missing values, at least for now.

I also don't think we need an ignore_nan param. I consider it fairly safe to have NaN supported by default. Others might disagree, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman if we do not use ignore_nan than any preprocessor class using functions like extmath.py or csr/csc_mean_var_axis0 will automatically ignore nan.

So as of now I am not using ignore_nan but for later stages if others want ignore_nan I would add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre, your comment was unclear. Are you suggesting that we support non-nan missing values as input here, or not?

Sorry it was 8 days ago before the discussion in the issue.
I agree with you and @ogrisel to go for nan support only for the moment. So we don't need any keyword for the class.

I also don't think we need an ignore_nan param.

@jnothman
_incremental_mean_and_var is also used in incremental PCA. There we don't want to deal with nan, therefore we will need the parameter I think, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_incremental_mean_and_var is also used in incremental PCA. There we don't want to deal with nan, therefore we will need the parameter I think, isn't it?

Actually, I think that you are right that the computation ignoring nan will also work the general case and we don't have really a checking at that point. My only concern would be about a regression performance since that we will check if there is nan value knowing that there is none due to earlier call to check_array. Would it have a significant impact?

self.mean_, self.var_ = mean_variance_axis(X, axis=0)
self.mean_, self.var_ = \
mean_variance_axis(X, axis=0,
accepting_nan=accepting_nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename accepting_nan by ignore_nan for those cython function.
In fact, they internally ignore nan.

self.n_samples_seen_ = X.shape[0]
# Next passes
else:
self.mean_, self.var_, self.n_samples_seen_ = \
incr_mean_variance_axis(X, axis=0,
last_mean=self.mean_,
last_var=self.var_,
last_n=self.n_samples_seen_)
last_n=self.n_samples_seen_,
accepting_nan=accepting_nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepting_nan -> ignore_nan

@@ -592,13 +593,29 @@ def fit(self, X, y=None):

y
Ignored

accepting_nan : boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this.


for j in xrange(startptr, endptr):
means[i] += X_data[j]
means[i] /= n_samples
if X_data[j] != X_data[j] and accepting_nan == True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as before

@@ -244,7 +257,7 @@ def incr_mean_variance_axis0(X, last_mean, last_var, unsigned long last_n):
if X.dtype != np.float32:
X = X.astype(np.float64)
return _incr_mean_variance_axis0(X.data, X.shape, X.indices, X.indptr,
X.format, last_mean, last_var, last_n)
X.format, last_mean, last_var, last_n, accepting_nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_nan

@@ -254,7 +267,7 @@ def _incr_mean_variance_axis0(np.ndarray[floating, ndim=1] X_data,
X_format,
last_mean,
last_var,
unsigned long last_n):
unsigned long last_n, accepting_nan=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_nan

Missing docstring

@@ -288,11 +301,11 @@ def _incr_mean_variance_axis0(np.ndarray[floating, ndim=1] X_data,

if X_format == 'csr':
# X is a CSR matrix
new_mean, new_var = _csr_mean_variance_axis0(X_data, shape, X_indices)
new_mean, new_var = _csr_mean_variance_axis0(X_data, shape, X_indices, accepting_nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_nan

else:
# X is a CSC matrix
new_mean, new_var = _csc_mean_variance_axis0(X_data, shape, X_indices,
X_indptr)
X_indptr, accepting_nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_nan

@glemaitre
Copy link
Member

So after a discussion in #10404, you don't need to introduce a keyword (either in fit or in the __init__).
By default you will force to use the function with ignore_nan=True.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add tests to test_sparsefuncs and test_extmath.

@@ -581,7 +582,7 @@ def _reset(self):
del self.mean_
del self.var_

def fit(self, X, y=None):
def fit(self, X, y=None, accepting_nan=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre, your comment was unclear. Are you suggesting that we support non-nan missing values as input here, or not?

I think for preprocessing that outputs floats, we might as well just support NaN missing values, at least for now.

I also don't think we need an ignore_nan param. I consider it fairly safe to have NaN supported by default. Others might disagree, though.

warn_on_dtype=True, estimator=self, dtype=FLOAT_DTYPES)

# Accepting nan values but not infinite values like @glemaitre
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge in master and use check_array with force_all_finite='allow-nan'

@@ -71,6 +71,8 @@
'RANSACRegressor', 'RadiusNeighborsRegressor',
'RandomForestRegressor', 'Ridge', 'RidgeCV']

# @glemaitre did the same thing, so basically things would be together
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add comments that are relevant to the PR but would not be relevant when merged. If you want to add comments about the PR, rather than about the code in its merged state, please do so in Github, not in the code.

if accepting_nan is True:
new_unnormalized_variance = np.nanvar(X, axis=0) * new_sample_count
else:
new_unnormalized_variance = X.var(axis=0) * new_sample_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new sample count also needs updating, and will be different for different features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that tests are currently passing, I think we can safely say that the tests are not correct or are not strong enough yet.

else:
return _incr_mean_var_axis0(X.T, last_mean=last_mean,
last_var=last_var, last_n=last_n)
last_var=last_var, last_n=last_n,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky, because last_n can now be different for each feature.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add tests to test_sparsefuncs and test_extmath.

@pinakinathc
Copy link
Contributor Author

@jnothman @glemaitre what should be the mean of a column where all the elements are NaN.
Should we write it as 0 or NaN ?
Now for cases where old_mean is incremented with the new_mean, we should put the value as 0 and not NaN.

But in other cases, suppose when there is no increment and simple finding the mean,
if someone tries to print the mean array, to see the mean of every column, then the output should be NaN (in cases when all column elements are NaN) and not 0

@jnothman
Copy link
Member

We can have ignore_nan in utils without having a similar parameter in the preprocessing estimators

@jnothman
Copy link
Member

jnothman commented Feb 1, 2018

I'm not sure what solution to that issue, but perhaps start with just storing it as 0, because if you also have to store the number of elements that comprise each feature's mean, you can determine that it's actually NaN...

@pinakinathc
Copy link
Contributor Author

pinakinathc commented Feb 1, 2018 via email

@pinakinathc
Copy link
Contributor Author

@jnothman @glemaitre failed in appveyor but passed for travis. Gone to this link but could not understand the error.

can you please help me understand this error?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be implying that isnan is not available in libc on windows. Until recent versions of MSVC, it seems only the _isnan macro from float.h was available. See https://msdn.microsoft.com/en-us/library/tzthab44.aspx

I think you should be able to use something like from numpy.math cimport npy_isnan instead. See https://github.com/cython/cython/blob/master/Cython/Includes/numpy/math.pxd: "This exports the functionality of the NumPy core math library, aka npymath, which provides implementations of C99 math functions and macros for system with a C89 library (such as MSVC)"

@glemaitre
Copy link
Member

What are you trying to do in your last commit?

@pinakinathc
Copy link
Contributor Author

i was trying to make sure that _incremental_mean_and_var ignores NaN. For this I need to use n_sample_seen_ as an array, as each feature may have different values. There are a lot of test cases which uses it as a scaler and this is creating conflicts.

@glemaitre
Copy link
Member

I think that you mess up some rebasing or something like that. There is changes all over the place. Could you clean up.

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pinakinathc,

Please let us know when this is ready for a proper review.

I am a little worried that you are adding unnecessary complexity, and would appreciate if you discussed some of your design.

new_unnormalized_variance[np.isnan(new_unnormalized_variance)] = 0
new_unnormalized_variance = (new_unnormalized_variance *
new_sample_count)
for i in xrange(n_features):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you can do everything here with vectorized numpy operations and should avoid an explicit loop over each feature.

@@ -688,29 +688,55 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None,
# old = stats until now
# new = the current increment
# updated = the aggregated stats
flag = 0 # if flag == 1 then last_sample_count was an array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bad variable name


# counts[j] contains the number of samples where feature j is non-zero
cdef np.ndarray[int, ndim=1] counts = np.zeros(n_features,
dtype=np.int32)

for i in xrange(non_zero):
col_ind = X_indices[i]
means[col_ind] += X_data[i]
x_i = X_data[i]
if isnan(x_i) and ignore_nan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the faster ignore_nan condition first, please.

return n_samples_feat


def n_samples_count_csr(np.ndarray[floating, ndim=1, mode="c"] X_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should need these helpers. I don't think we should need cython to count the number of NaNs in a column of a sparse matrix. Something like:

X = copy.copy(X)
X.data = np.isnan(X)
return X.sum(axis=0)

should suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman i was working on sparse matrices, but np.isnan(X) does not work for Sparse Matrices. I need to convert X into a dense numpy array using X.toarray() and then using np.isnan(X) or while make the count for NaN happen in the for loops.

I believe counting in for loops is better because these are sparse matrices and converting them to dense numpy array and then counting would probably make computations more expensive.


return updated_mean, updated_var, updated_n_feat

# Where updated_n is a scaler
else:
updated_n = last_n + new_n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am proposing to simplify the code in this case using vectorized numpy functions in #10615. You should use vectorized functions for the above case (or even share the implementation in both cases) too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, there are a lot places where i should use vectorized functions. Dealing with Sparse Matrix and incremental_mean_and_var was getting messed up with various kinds of errors. So I created a separate pull request to check if incremental_mean_and_var* works without failing in numerical stability.

The basic idea/design is:

  • if there is a NaN then avoid/ignore it and calculate mean and var (using the formula present in the code)

While doing this various cases comes up like:

  • if the entire input matrix is NaN, then new_sample_count will be zero.
    Consider that last_sample_count is also Zero then Updated_sample_count will also be zero.

Cases like this was not possible before. This kind of situations kind of lead me to use For loops a bit too much, but I shall change them by using vector functions.

In my last commit, I am also getting error in numerical stability hence the separate pull request to deal with 1 problem at a time and see it they work perfectly.

I am planning to deal with incremental_mean_and_var and Sparse Matrix separately and then combine them to get the final solution

@pinakinathc
Copy link
Contributor Author

@jnothman can you tell me why when i am building the repo using make i am getting no errors and the only way to check whether it is error free or not is to push it here and see if it gets a green tick.

In short is there any way to have tests like the ones travis and appveyor does without pushing it here?
only building the repo using make and pytest is probably not enough.

final_means, final_variances, final_count = \
_incremental_mean_and_var(X2, X_means, X_variances, X_count)
assert_allclose(A_means, final_means, equal_nan=True)
print A_variances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work in python3, I guess.

@gxyd
Copy link
Contributor

gxyd commented Feb 11, 2018 via email

@jnothman
Copy link
Member

Yes, make test will test in your environment, but not in the whole range of environments we support

@jnothman
Copy link
Member

jnothman commented Feb 17, 2018 via email

@jnothman
Copy link
Member

jnothman commented Feb 18, 2018 via email

@pinakinathc
Copy link
Contributor Author

actually I need to count the number of NaN values present in the the sparse matrix.
1 way is to count the number of NaNs using a for-loop. I was trying to use some functions instead of a for-loop hoping it would be faster.

There can be an alternative way too.

csr_mean_variance_axis0 and csc_mean_variance_axis0 is already adding element by element for calculating the mean for each feature j.
I added a NaN checking condition there to calculate the mean properly(by ignoring NaNs). Hence, the number of elements in each column ignoring NaNs are already counted in those functions.
What if I return the already counted n_samples from csr_mean_variances_axis0 and csr_mean_variances_axis0?
That way, we need not put extra effort for calculating the n_samples for each column.

@jnothman
Copy link
Member

Counting number of NaNs, assuming a CSR matrix:

  • overall count: np.isnan(X.data).sum()
  • count over axis: csr_matrix((np.isnan(X.data), X.indices, X.indptr)).sum(axis=...)

But yes, I would welcome adjusting the samples count in csr_mean_variances_axis0 directly. I wouldn't modify the current return value, if only for backwards compatibility in case external libraries use this. Instead you could have an array of n_samples passed in and the function could modify it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants