-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG + 1] Fix preprocessing.scale for arrays with near zero ratio variance/max #3747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
travis is not happy |
@maximsch2 : For now, my version gives: print preprocessing.scale(np.arange(10) * 1e100) print preprocessing.scale(np.arange(10) * 1e-100) |
This modification gives the alternative result (but still doesn't pass travis test) |
i.e : |
This version verifies |
@@ -129,13 +129,15 @@ def scale(X, axis=0, with_mean=True, with_std=True, copy=True): | |||
else: | |||
X = np.asarray(X) | |||
warn_if_not_float(X, estimator='The scale function') | |||
preScale = abs(X).max()+1.*(abs(X).max() == 0.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable names are never in CamelCase. so "pre_scale"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abs(X) -> np.abs(X)
and run pep8 checker (spaces missing around +)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
computing twice abs(X).max()
should be avoided.
I don't understand why travis test fails since abs()->np.abs() |
look at travis log. The error is unrelated. |
mean_, std_ = _mean_and_std( | ||
X, axis, with_mean=with_mean, with_std=with_std) | ||
X/pre_scale, axis, with_mean=with_mean, with_std=with_std) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces missing around / operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't like here is that will make an extra copy of the data in memory. It can be a problem for large files.
# concerned feature is efficient, for instance by its mean or | ||
# maximum. | ||
if isinstance(mean_1, np.ndarray): | ||
if not all([isclose(a, 0.) for a in mean_1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there exists np.allclose
# maximum. | ||
if isinstance(mean_1, np.ndarray): | ||
if not np.allclose(mean_1, np.zeros(np.shape(mean_1)[0])): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we fail? We could also warn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but in this case we fail to return a centered result. Actually the obtained X_r is false and according to me it is then useless to keep running the program... isn't it ?
# concerned feature is efficient, for instance by its mean or | ||
# maximum. | ||
if isinstance(mean_1, np.ndarray): | ||
if not np.allclose(mean_1, np.zeros(np.shape(mean_1)[0])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just use np.allclose(mean_1, 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I don't think you need the if at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i'll change the np.allclose. The if is needed if we don't want to change X_r by useless subtraction (Xr -= mean_2 with a mean_2 very close to 0). But maybe the complexity of testing np.allclose is greater than just substracting, is that what you meant ?
I don't see why you want to |
Yes we should warn in the second case. But in the first case do you think we should keep running the program even if the intermediary result is false ? |
The main drawback of this PR is for me that it resolves bugs on very particular data (do they exist in the real world ? Can't the user notice it if one sample is very large, or if all the sample are almost equal?) and pay for it in complexity (by testing in all cases, or if not by substracting a second time in all cases). Does it worth the price ? |
Thanks for the changes. Travis is failing, though. |
# maximum. | ||
if not np.allclose(mean_1, 0): | ||
raise ValueError( | ||
"Centering failed. Dataset may contain too" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are spaces missing at both ends of lines.
I think it is worth computing the statistics again to make sure. If you add a warning that there were numerical issues in the case were you subtract the mean again, I think the fix is good. |
the tests are currently failing... |
Yes I don't understand why. I suspect it is related to np.allclose but I don't see the point |
The problem comes from test_warning_scaling_integers when checking a warning in case of uint8 data. In this case, scale() is unable to standardize the data (since 0 - 1 = 255 in uint8). Whereas the previous version, this version raises an error if not able to scale the data correctly, so that test_warning_scaling_integers do not see the warning I think. |
Ok. So you warn now? That is probably better given the previous behavior. Can you please rebase? |
…0' (scikit-learn#3722) -np.zeros(8) + np.log(1e-5) -np.zeros(22) + np.log(1e-5) -np.zeros(10) + 1e100 -np.ones(10) * 1e-100 in function scale() : -after a convenient normalization by 1/(max(abs(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725). -just a small change in the code to satisfy pep8 requirements Now, scale(np.arange(10.)*1e-100)=scale(np.arange(10.)) remove isclose which is now unnecessary New test for extrem (1e100 and 1e-100) scaling modification on Xr instead of X abs->np.abs, pep8 checker, preScale->pre_scale max->np.max() Abandon of the prescale method (problematical extra copy of the data) -ValueError in the case of too large values in X which yields (X-X.mean()).mean() > 0. -But still cover the case of std() close to 0, by substracting again the (new) mean after scaling if needed. -isclose -> np.isclose -all([isclose()]) -> np.allclose np.isclose -> isclose to avoid bug np.allclose warning
Does scaling even make sense in case of unsigned integers? Maybe it's better to just throw an exception in that case? |
Currently we are warning. We could also convert to float64. I'd rather handle than throw an exception. |
Could you also please wrap the lines that issue a warning with |
added attempt to resolve when warning and corresponding tests
Ok done, tell me if it would be better to create a new test function instead. |
@@ -100,6 +100,27 @@ def test_scaler_1d(): | |||
X = np.ones(5) | |||
assert_array_equal(scale(X, with_mean=False), X) | |||
|
|||
X = np.zeros(8) + np.log(1e-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this number. Isn't this a fancy way of saying -11. That is not what you intended, right?
Apart from the test adding -11, LGTM. |
Yes I took np.log(1e-5) and not -11 because of the size/precision of this float, in view of the original bug #3722. |
Fair enough, let's keep it then. |
@ogrisel review? |
It's not trivial to understand when just reading the source code. We should add an inline comment to explain what's special about this value. |
Please put the new checks for numberical stability regression in a new test function named "test_standard_scaler_numerical_stability" and make sure that all the warnings are checked. At the moment I get the following uncaught warnings:
|
# concerned feature is efficient, for instance by its mean or | ||
# maximum. | ||
if not np.allclose(mean_1, 0): | ||
warnings.warn("Numerical issues were encountered " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were encountered when centering the data
test_standard_scaler_numerical_stability
like this? I don't really understand why clean_warning_registry() is needed but I put it just in case. |
No I meant just using |
Merged via #4436. |
Fix #3722, prolongating #3725.