-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Disregard nan fix #10457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Disregard nan fix #10457
Conversation
I'm curious what you mean by"with increment" |
sorry there are a lot of errors, i am fixing them |
@jnothman by increment i meant increment_mean_variance i.e. increment with respect to old_mean and old_variance |
to support partial_fit?
I will admit this is a bigger task than I realised for someone without more
experience, and maybe starting with minmaxscaler makes more sense...?
|
@jnothman yes. Actually sparse matrix in partial_fit is evaluated by a lot of functions so what i did in the previous commit (though it has errors which i am fixing now) is create similar modules but with the capability of processing NaNs But the good thing is that only sparse matrix is a bit difficult as i cannot use any Also, let me try a bit more on this one (i love challenges) and then i shall try |
@jnothman @glemaitre and others can you please review the code? and give me some insights regarding this |
sklearn/preprocessing/data.py
Outdated
@@ -619,8 +632,12 @@ def partial_fit(self, X, y=None): | |||
Ignored | |||
""" | |||
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy, | |||
force_all_finite=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put in your TODO to change this when there is an agreement on #10459
Before a specific review, you can check which part of the code which is not covered and try to cover it: Since that they are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general comment, I would try to factorize your code as much as possible. I don't think that we need to have specific functions but rather propagate an argument to reuse the common part of the code for each function.
scaler = StandardScaler(with_mean=False) | ||
scaler.partial_fit(X_csr) | ||
assert_almost_equal(scaler.mean_, 1.0) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to have a test with only nan value in X or at least a complete column
sklearn/utils/extmath.py
Outdated
@@ -715,6 +715,41 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None, | |||
return updated_mean, updated_variance, updated_sample_count | |||
|
|||
|
|||
def _incremental_mean_and_var_nan(X, last_mean=.0, last_variance=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would factorize _invremental_mean_and_var
and _incremental_mean_and_var
to avoid code duplication.
I could think of a new argument in _incremental_mean_and_var
(e.g. allow_nan=False
) and modify a single line:
if allow_nan:
var_func = np.nanvar
else:
var_func = np.var
...
new_unnormalized_variance = var_func(X, axis=0) * new_sample_count
sklearn/utils/sparsefuncs_fast.pyx
Outdated
"""This is identical to incr_mean_variance_axis0_nan | ||
but can process NaN values | ||
""" | ||
if X.dtype != np.float32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman do we convert here because we are scared about overflow or this is an unnecessary thing?
sklearn/utils/sparsefuncs_fast.pyx
Outdated
@@ -126,6 +135,58 @@ def _csr_mean_variance_axis0(np.ndarray[floating, ndim=1, mode="c"] X_data, | |||
return means, variances | |||
|
|||
|
|||
def _csr_mean_variance_axis0_nan(np.ndarray[floating, ndim=1, mode="c"] X_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do you handle nan
?
thanks @glemaitre i'll start working on these and yes, there is a lot of duplication of code, i'll pass an argument instead. |
@jnothman @glemaitre please check this. as @glemaitre pointed out check_array() -> force_all_finite, I am waiting for #10459 to get merged. |
sklearn/preprocessing/data.py
Outdated
@@ -28,8 +28,9 @@ | |||
from ..utils.sparsefuncs_fast import (inplace_csr_row_normalize_l1, | |||
inplace_csr_row_normalize_l2) | |||
from ..utils.sparsefuncs import (inplace_column_scale, | |||
mean_variance_axis, incr_mean_variance_axis, | |||
min_max_axis) | |||
mean_variance_axis, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert those changes. We tend to change only code which is necessary (even if some part of the code which are not PEP8).
sklearn/preprocessing/data.py
Outdated
@@ -581,7 +582,7 @@ def _reset(self): | |||
del self.mean_ | |||
del self.var_ | |||
|
|||
def fit(self, X, y=None): | |||
def fit(self, X, y=None, accepting_nan=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we allow to change the signature of fit
.
In #10437, I added an argument to the class missing_values
to accept NaN but also integer.
However, since that this transformer return a floating X, replacing the integer by NaN after having converting the input should be lead to the same underneath processing.
I think that you should go in that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre should I add 2 class arguments : missing_values
and ignore_nan
or should I add only one of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre, your comment was unclear. Are you suggesting that we support non-nan missing values as input here, or not?
I think for preprocessing that outputs floats, we might as well just support NaN missing values, at least for now.
I also don't think we need an ignore_nan
param. I consider it fairly safe to have NaN supported by default. Others might disagree, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman if we do not use ignore_nan
than any preprocessor class using functions like extmath.py
or csr/csc_mean_var_axis0
will automatically ignore nan.
So as of now I am not using ignore_nan
but for later stages if others want ignore_nan
I would add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre, your comment was unclear. Are you suggesting that we support non-nan missing values as input here, or not?
Sorry it was 8 days ago before the discussion in the issue.
I agree with you and @ogrisel to go for nan support only for the moment. So we don't need any keyword for the class.
I also don't think we need an ignore_nan param.
@jnothman
_incremental_mean_and_var
is also used in incremental PCA. There we don't want to deal with nan, therefore we will need the parameter I think, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_incremental_mean_and_var is also used in incremental PCA. There we don't want to deal with nan, therefore we will need the parameter I think, isn't it?
Actually, I think that you are right that the computation ignoring nan will also work the general case and we don't have really a checking at that point. My only concern would be about a regression performance since that we will check if there is nan value knowing that there is none due to earlier call to check_array. Would it have a significant impact?
sklearn/preprocessing/data.py
Outdated
self.mean_, self.var_ = mean_variance_axis(X, axis=0) | ||
self.mean_, self.var_ = \ | ||
mean_variance_axis(X, axis=0, | ||
accepting_nan=accepting_nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename accepting_nan
by ignore_nan
for those cython function.
In fact, they internally ignore nan.
sklearn/preprocessing/data.py
Outdated
self.n_samples_seen_ = X.shape[0] | ||
# Next passes | ||
else: | ||
self.mean_, self.var_, self.n_samples_seen_ = \ | ||
incr_mean_variance_axis(X, axis=0, | ||
last_mean=self.mean_, | ||
last_var=self.var_, | ||
last_n=self.n_samples_seen_) | ||
last_n=self.n_samples_seen_, | ||
accepting_nan=accepting_nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepting_nan
-> ignore_nan
sklearn/preprocessing/data.py
Outdated
@@ -592,13 +593,29 @@ def fit(self, X, y=None): | |||
|
|||
y | |||
Ignored | |||
|
|||
accepting_nan : boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this.
sklearn/utils/sparsefuncs_fast.pyx
Outdated
|
||
for j in xrange(startptr, endptr): | ||
means[i] += X_data[j] | ||
means[i] /= n_samples | ||
if X_data[j] != X_data[j] and accepting_nan == True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as before
sklearn/utils/sparsefuncs_fast.pyx
Outdated
@@ -244,7 +257,7 @@ def incr_mean_variance_axis0(X, last_mean, last_var, unsigned long last_n): | |||
if X.dtype != np.float32: | |||
X = X.astype(np.float64) | |||
return _incr_mean_variance_axis0(X.data, X.shape, X.indices, X.indptr, | |||
X.format, last_mean, last_var, last_n) | |||
X.format, last_mean, last_var, last_n, accepting_nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore_nan
sklearn/utils/sparsefuncs_fast.pyx
Outdated
@@ -254,7 +267,7 @@ def _incr_mean_variance_axis0(np.ndarray[floating, ndim=1] X_data, | |||
X_format, | |||
last_mean, | |||
last_var, | |||
unsigned long last_n): | |||
unsigned long last_n, accepting_nan=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore_nan
Missing docstring
sklearn/utils/sparsefuncs_fast.pyx
Outdated
@@ -288,11 +301,11 @@ def _incr_mean_variance_axis0(np.ndarray[floating, ndim=1] X_data, | |||
|
|||
if X_format == 'csr': | |||
# X is a CSR matrix | |||
new_mean, new_var = _csr_mean_variance_axis0(X_data, shape, X_indices) | |||
new_mean, new_var = _csr_mean_variance_axis0(X_data, shape, X_indices, accepting_nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore_nan
sklearn/utils/sparsefuncs_fast.pyx
Outdated
else: | ||
# X is a CSC matrix | ||
new_mean, new_var = _csc_mean_variance_axis0(X_data, shape, X_indices, | ||
X_indptr) | ||
X_indptr, accepting_nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore_nan
So after a discussion in #10404, you don't need to introduce a keyword (either in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add tests to test_sparsefuncs and test_extmath.
sklearn/preprocessing/data.py
Outdated
@@ -581,7 +582,7 @@ def _reset(self): | |||
del self.mean_ | |||
del self.var_ | |||
|
|||
def fit(self, X, y=None): | |||
def fit(self, X, y=None, accepting_nan=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre, your comment was unclear. Are you suggesting that we support non-nan missing values as input here, or not?
I think for preprocessing that outputs floats, we might as well just support NaN missing values, at least for now.
I also don't think we need an ignore_nan
param. I consider it fairly safe to have NaN supported by default. Others might disagree, though.
sklearn/preprocessing/data.py
Outdated
warn_on_dtype=True, estimator=self, dtype=FLOAT_DTYPES) | ||
|
||
# Accepting nan values but not infinite values like @glemaitre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please merge in master and use check_array with force_all_finite='allow-nan'
sklearn/utils/estimator_checks.py
Outdated
@@ -71,6 +71,8 @@ | |||
'RANSACRegressor', 'RadiusNeighborsRegressor', | |||
'RandomForestRegressor', 'Ridge', 'RidgeCV'] | |||
|
|||
# @glemaitre did the same thing, so basically things would be together |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't add comments that are relevant to the PR but would not be relevant when merged. If you want to add comments about the PR, rather than about the code in its merged state, please do so in Github, not in the code.
sklearn/utils/extmath.py
Outdated
if accepting_nan is True: | ||
new_unnormalized_variance = np.nanvar(X, axis=0) * new_sample_count | ||
else: | ||
new_unnormalized_variance = X.var(axis=0) * new_sample_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new sample count also needs updating, and will be different for different features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that tests are currently passing, I think we can safely say that the tests are not correct or are not strong enough yet.
sklearn/utils/sparsefuncs.py
Outdated
else: | ||
return _incr_mean_var_axis0(X.T, last_mean=last_mean, | ||
last_var=last_var, last_n=last_n) | ||
last_var=last_var, last_n=last_n, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky, because last_n can now be different for each feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add tests to test_sparsefuncs and test_extmath.
@jnothman @glemaitre what should be the mean of a column where all the elements are But in other cases, suppose when there is no increment and simple finding the |
We can have ignore_nan in utils without having a similar parameter in the preprocessing estimators |
I'm not sure what solution to that issue, but perhaps start with just storing it as 0, because if you also have to store the number of elements that comprise each feature's mean, you can determine that it's actually NaN... |
I started taking it as zero and later using a **mask** to convert the final
values to NaN in the StandardScaler.
…On Thu, Feb 1, 2018 at 4:32 PM, Joel Nothman ***@***.***> wrote:
I'm not sure what solution to that issue, but perhaps start with just
storing it as 0, because if you also have to store the number of elements
that comprise each feature's mean, you can determine that it's actually
NaN...
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#10457 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOTVTVq2A8iotIZQxTXQS8QKa6TG1o0Rks5tQZmxgaJpZM4RbWIs>
.
|
f77fcc9
to
d0b7985
Compare
@jnothman @glemaitre failed in appveyor but passed for travis. Gone to this link but could not understand the error. can you please help me understand this error? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be implying that isnan
is not available in libc on windows. Until recent versions of MSVC, it seems only the _isnan
macro from float.h was available. See https://msdn.microsoft.com/en-us/library/tzthab44.aspx
I think you should be able to use something like from numpy.math cimport npy_isnan
instead. See https://github.com/cython/cython/blob/master/Cython/Includes/numpy/math.pxd: "This exports the functionality of the NumPy core math library, aka npymath, which provides implementations of C99 math functions and macros for system with a C89 library (such as MSVC)"
What are you trying to do in your last commit? |
i was trying to make sure that _incremental_mean_and_var ignores NaN. For this I need to use n_sample_seen_ as an array, as each feature may have different values. There are a lot of test cases which uses it as a scaler and this is creating conflicts. |
I think that you mess up some rebasing or something like that. There is changes all over the place. Could you clean up. |
consider reverting to a good commit with git reset --hard, then cherry
picking any good changes that came since. please use git fetch upstream
master; git merge FETCH_HEAD or similar to pull in changes from master
without breaking stuff, where upstream is this repository URL or remote name
|
dad5b12
to
430231c
Compare
This reverts commit 5b5f72d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pinakinathc,
Please let us know when this is ready for a proper review.
I am a little worried that you are adding unnecessary complexity, and would appreciate if you discussed some of your design.
new_unnormalized_variance[np.isnan(new_unnormalized_variance)] = 0 | ||
new_unnormalized_variance = (new_unnormalized_variance * | ||
new_sample_count) | ||
for i in xrange(n_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you can do everything here with vectorized numpy operations and should avoid an explicit loop over each feature.
@@ -688,29 +688,55 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None, | |||
# old = stats until now | |||
# new = the current increment | |||
# updated = the aggregated stats | |||
flag = 0 # if flag == 1 then last_sample_count was an array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bad variable name
|
||
# counts[j] contains the number of samples where feature j is non-zero | ||
cdef np.ndarray[int, ndim=1] counts = np.zeros(n_features, | ||
dtype=np.int32) | ||
|
||
for i in xrange(non_zero): | ||
col_ind = X_indices[i] | ||
means[col_ind] += X_data[i] | ||
x_i = X_data[i] | ||
if isnan(x_i) and ignore_nan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the faster ignore_nan condition first, please.
return n_samples_feat | ||
|
||
|
||
def n_samples_count_csr(np.ndarray[floating, ndim=1, mode="c"] X_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should need these helpers. I don't think we should need cython to count the number of NaNs in a column of a sparse matrix. Something like:
X = copy.copy(X)
X.data = np.isnan(X)
return X.sum(axis=0)
should suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman i was working on sparse matrices, but np.isnan(X)
does not work for Sparse Matrices. I need to convert X into a dense numpy array using X.toarray()
and then using np.isnan(X)
or while make the count for NaN happen in the for loops
.
I believe counting in for loops
is better because these are sparse matrices and converting them to dense numpy array and then counting would probably make computations more expensive.
|
||
return updated_mean, updated_var, updated_n_feat | ||
|
||
# Where updated_n is a scaler | ||
else: | ||
updated_n = last_n + new_n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am proposing to simplify the code in this case using vectorized numpy functions in #10615. You should use vectorized functions for the above case (or even share the implementation in both cases) too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree, there are a lot places where i should use vectorized functions. Dealing with Sparse Matrix and incremental_mean_and_var was getting messed up with various kinds of errors. So I created a separate pull request to check if incremental_mean_and_var* works without failing in numerical stability.
The basic idea/design is:
- if there is a NaN then avoid/ignore it and calculate mean and var (using the formula present in the code)
While doing this various cases comes up like:
- if the entire input matrix is NaN, then new_sample_count will be zero.
Consider that last_sample_count is also Zero then Updated_sample_count will also be zero.
Cases like this was not possible before. This kind of situations kind of lead me to use For loops a bit too much, but I shall change them by using vector functions.
In my last commit, I am also getting error in numerical stability hence the separate pull request to deal with 1 problem at a time and see it they work perfectly.
I am planning to deal with incremental_mean_and_var and Sparse Matrix separately and then combine them to get the final solution
@jnothman can you tell me why when i am building the repo using make i am getting no errors and the only way to check whether it is error free or not is to push it here and see if it gets a green tick. In short is there any way to have tests like the ones travis and appveyor does without pushing it here? |
final_means, final_variances, final_count = \ | ||
_incremental_mean_and_var(X2, X_means, X_variances, X_count) | ||
assert_allclose(A_means, final_means, equal_nan=True) | ||
print A_variances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work in python3, I guess.
Probably you are testing locally with python2, which works just fine.
But I think (atleast) some errors are because of not working with python3.
|
Yes, |
never do .toarray() with data provided by the user, or yes, it defeats the
purpose of sparse data.
you can do `X._with_data(np.isnan(X.data))` if you must
|
Sorry, we shouldn't use _with_data, but should replicate what it does:
reconstructs the sparse matrix with data changed.
…On 18 February 2018 at 10:00, Joel Nothman ***@***.***> wrote:
never do .toarray() with data provided by the user, or yes, it defeats the
purpose of sparse data.
you can do `X._with_data(np.isnan(X.data))` if you must
|
actually I need to count the number of NaN values present in the the sparse matrix. There can be an alternative way too.
|
Counting number of NaNs, assuming a CSR matrix:
But yes, I would welcome adjusting the samples count in |
Reference Issues/PRs
Partially Addressed : #10404
What does this implement/fix? Explain your changes.
Any other comments?
now various functions which previously took last_n to get the previous number of samples while computing the updated mean and variance now takes additional argument: last_n_feat
last_n is a scaler. This is not optional
last_n_feat is optional and takes a vector. Useful in cases when different features have different values because some of them had NaN values which needs to be ignored.
Instruction to use last_n_feat instead of last_n:
function_name(................., last_n=0, last_n_feat=self.n_samples_seen_)