[MRG + 1] Fix preprocessing.scale for arrays with near zero ratio variance/max #3747

ngoix · 2014-10-09T18:29:35Z

Fix #3722, prolongating #3725.

agramfort · 2014-10-09T18:45:59Z

travis is not happy

ngoix · 2014-10-11T13:19:28Z

@maximsch2 :
I agree with you for scale(np.arrange(10)_1e100) = scale(np.arrange(10)), but do you think the same for scale(np.arrange(10)_1e-100) = scale(np.arrange(10))?

For now, my version gives:

print preprocessing.scale(np.arange(10) * 1e100)
[-1.5666989 -1.21854359 -0.87038828 -0.52223297 -0.17407766 0.17407766
0.52223297 0.87038828 1.21854359 1.5666989 ]

print preprocessing.scale(np.arange(10) * 1e-100)
[ -4.50000000e-100 -3.50000000e-100 -2.50000000e-100 -1.50000000e-100
-5.00000000e-101 5.00000000e-101 1.50000000e-100 2.50000000e-100
3.50000000e-100 4.50000000e-100]

ngoix · 2014-10-11T13:35:10Z

This modification gives the alternative result (but still doesn't pass travis test)

ngoix · 2014-10-11T13:38:57Z

i.e :
print preprocessing.scale(np.arange(10) * 1e-100)
[-1.5666989 -1.21854359 -0.87038828 -0.52223297 -0.17407766 0.17407766
0.52223297 0.87038828 1.21854359 1.5666989 ]

coveralls · 2014-10-12T13:36:05Z

Coverage increased (+0.01%) when pulling d1373c8 on ngoix:preproc into 08d98f7 on scikit-learn:master.

ngoix · 2014-10-12T14:02:01Z

This version verifies
scale(np.arange(10) * 1e-100) = scale(np.arange(10)) = scale(np.arange(10) * 1e100)
and all issues raised by @maximsch2 in #3722 and #3725.

agramfort · 2014-10-12T19:26:31Z

sklearn/preprocessing/data.py

@@ -129,13 +129,15 @@ def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
    else:
        X = np.asarray(X)
        warn_if_not_float(X, estimator='The scale function')
+        preScale = abs(X).max()+1.*(abs(X).max() == 0.)


variable names are never in CamelCase. so "pre_scale"

abs(X) -> np.abs(X)

and run pep8 checker (spaces missing around +)

computing twice abs(X).max() should be avoided.

ngoix · 2014-10-13T11:35:14Z

I don't understand why travis test fails since abs()->np.abs()

agramfort · 2014-10-14T07:57:15Z

look at travis log. The error is unrelated.

agramfort · 2014-10-14T07:57:50Z

sklearn/preprocessing/data.py

        mean_, std_ = _mean_and_std(
-            X, axis, with_mean=with_mean, with_std=with_std)
+            X/pre_scale, axis, with_mean=with_mean, with_std=with_std)


spaces missing around / operator.

What I don't like here is that will make an extra copy of the data in memory. It can be a problem for large files.

coveralls · 2014-10-15T18:35:37Z

Coverage increased (+0.0%) when pulling 4e270ec on ngoix:preproc into 08d98f7 on scikit-learn:master.

agramfort · 2014-10-15T18:48:22Z

sklearn/preprocessing/data.py

+            # concerned feature is efficient, for instance by its mean or
+            # maximum.
+            if isinstance(mean_1, np.ndarray):
+                if not all([isclose(a, 0.) for a in mean_1]):


there exists np.allclose

coveralls · 2014-10-16T11:21:04Z

Coverage increased (+0.02%) when pulling e0cc360 on ngoix:preproc into 08d98f7 on scikit-learn:master.

amueller · 2015-01-16T20:56:17Z

sklearn/preprocessing/data.py

+            # maximum.
+            if isinstance(mean_1, np.ndarray):
+                if not np.allclose(mean_1, np.zeros(np.shape(mean_1)[0])):
+                    raise ValueError(


should we fail? We could also warn.

Yes but in this case we fail to return a centered result. Actually the obtained X_r is false and according to me it is then useless to keep running the program... isn't it ?

amueller · 2015-01-21T22:07:54Z

sklearn/preprocessing/data.py

+            # concerned feature is efficient, for instance by its mean or
+            # maximum.
+            if isinstance(mean_1, np.ndarray):
+                if not np.allclose(mean_1, np.zeros(np.shape(mean_1)[0])):


you can just use np.allclose(mean_1, 0)

And I don't think you need the if at all.

ok i'll change the np.allclose. The if is needed if we don't want to change X_r by useless subtraction (Xr -= mean_2 with a mean_2 very close to 0). But maybe the complexity of testing np.allclose is greater than just substracting, is that what you meant ?

amueller · 2015-01-21T22:10:06Z

I don't see why you want to raise in one case and silently correct in the other. Shouldn't we warn in both cases?

ngoix · 2015-01-22T12:39:23Z

Yes we should warn in the second case. But in the first case do you think we should keep running the program even if the intermediary result is false ?

ngoix · 2015-01-22T12:56:31Z

The main drawback of this PR is for me that it resolves bugs on very particular data (do they exist in the real world ? Can't the user notice it if one sample is very large, or if all the sample are almost equal?) and pay for it in complexity (by testing in all cases, or if not by substracting a second time in all cases). Does it worth the price ?

amueller · 2015-01-22T16:40:14Z

Thanks for the changes. Travis is failing, though.

amueller · 2015-01-22T16:40:32Z

sklearn/preprocessing/data.py

+            # maximum.
+            if not np.allclose(mean_1, 0):
+                raise ValueError(
+                    "Centering failed. Dataset may contain too"


there are spaces missing at both ends of lines.

amueller · 2015-01-22T16:42:13Z

I think it is worth computing the statistics again to make sure. If you add a warning that there were numerical issues in the case were you subtract the mean again, I think the fix is good.

amueller · 2015-01-26T23:41:10Z

the tests are currently failing...

ngoix · 2015-01-27T12:29:16Z

Yes I don't understand why. I suspect it is related to np.allclose but I don't see the point

ngoix · 2015-03-02T12:28:24Z

The problem comes from test_warning_scaling_integers when checking a warning in case of uint8 data. In this case, scale() is unable to standardize the data (since 0 - 1 = 255 in uint8). Whereas the previous version, this version raises an error if not able to scale the data correctly, so that test_warning_scaling_integers do not see the warning I think.

amueller · 2015-03-02T14:34:54Z

Ok. So you warn now? That is probably better given the previous behavior. Can you please rebase?

…0' (scikit-learn#3722) -np.zeros(8) + np.log(1e-5) -np.zeros(22) + np.log(1e-5) -np.zeros(10) + 1e100 -np.ones(10) * 1e-100 in function scale() : -after a convenient normalization by 1/(max(abs(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725). -just a small change in the code to satisfy pep8 requirements Now, scale(np.arange(10.)*1e-100)=scale(np.arange(10.)) remove isclose which is now unnecessary New test for extrem (1e100 and 1e-100) scaling modification on Xr instead of X abs->np.abs, pep8 checker, preScale->pre_scale max->np.max() Abandon of the prescale method (problematical extra copy of the data) -ValueError in the case of too large values in X which yields (X-X.mean()).mean() > 0. -But still cover the case of std() close to 0, by substracting again the (new) mean after scaling if needed. -isclose -> np.isclose -all([isclose()]) -> np.allclose np.isclose -> isclose to avoid bug np.allclose warning

maximsch2 · 2015-03-02T17:48:58Z

Does scaling even make sense in case of unsigned integers? Maybe it's better to just throw an exception in that case?

amueller · 2015-03-02T17:57:40Z

Currently we are warning. We could also convert to float64. I'd rather handle than throw an exception.

ogrisel · 2015-03-03T09:06:30Z

Could you also please wrap the lines that issue a warning with assert_warns_message or asser_warns?

added attempt to resolve when warning and corresponding tests

ngoix · 2015-03-03T12:05:12Z

Ok done, tell me if it would be better to create a new test function instead.
And now that we are warning instead of raising error – if not np.allclose(mean_1, 0) -- I think it is good to try (and it works in most of cases) to correct the problem by substracting the mean again, as in the step 2 'if with_std :'. I also added the corresponding tests since with this modification, scale() works (with a warning) for vectors like np.ones(10)*1e100 and np.arange(10.) * 1e100.

coveralls · 2015-03-03T12:15:47Z

Coverage increased (+0.0%) to 95.1% when pulling ae8f2e4 on ngoix:preproc into af1906f on scikit-learn:master.

amueller · 2015-03-03T22:20:27Z

sklearn/preprocessing/tests/test_data.py

@@ -100,6 +100,27 @@ def test_scaler_1d():
    X = np.ones(5)
    assert_array_equal(scale(X, with_mean=False), X)

+    X = np.zeros(8) + np.log(1e-5)


I don't understand this number. Isn't this a fancy way of saying -11. That is not what you intended, right?

amueller · 2015-03-03T22:21:58Z

Apart from the test adding -11, LGTM.

ngoix · 2015-03-04T09:04:33Z

Yes I took np.log(1e-5) and not -11 because of the size/precision of this float, in view of the original bug #3722.

amueller · 2015-03-04T16:40:07Z

Fair enough, let's keep it then.

amueller · 2015-03-05T22:06:15Z

@ogrisel review?

ogrisel · 2015-03-05T22:46:27Z

Fair enough, let's keep it then.

It's not trivial to understand when just reading the source code. We should add an inline comment to explain what's special about this value.

ogrisel · 2015-03-06T16:22:58Z

Please put the new checks for numberical stability regression in a new test function named "test_standard_scaler_numerical_stability" and make sure that all the warnings are checked.

At the moment I get the following uncaught warnings:

$ nosetests -v -s sklearn/preprocessing/tests/test_data.py
...
Test scaling of dataset along single axis ... /Users/ogrisel/code/scikit-learn/sklearn/preprocessing/data.py:167: UserWarning: Numerical issues were encountered and might not be solved. The standard deviation of the data is probably very close to 0.
  warnings.warn("Numerical issues were encountered "
/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/numpy/core/numeric.py:183: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  a = empty(shape, dtype, order)
/Users/ogrisel/code/scikit-learn/sklearn/preprocessing/data.py:152: UserWarning: Numerical issues were encountered and might not be solved. Dataset may contain too large values. You may need to prescale your features.
  warnings.warn("Numerical issues were encountered "
/Users/ogrisel/code/scikit-learn/sklearn/preprocessing/data.py:152: UserWarning: Numerical issues were encountered and might not be solved. Dataset may contain too large values. You may need to prescale your features.
  warnings.warn("Numerical issues were encountered "
...

ogrisel · 2015-03-06T16:24:30Z

sklearn/preprocessing/data.py

+            # concerned feature is efficient, for instance by its mean or
+            # maximum.
+            if not np.allclose(mean_1, 0):
+                warnings.warn("Numerical issues were encountered "


were encountered when centering the data

test_standard_scaler_numerical_stability

ngoix · 2015-03-07T13:59:45Z

like this? I don't really understand why clean_warning_registry() is needed but I put it just in case.

ogrisel · 2015-03-23T10:44:16Z

No I meant just using assert_warn_message and checking the return values instead of calling the functions twice. Let me squash, rebase and fix this PR as it's almost good to merge.

ogrisel · 2015-03-23T13:56:54Z

@ngoix I pushed my fixes into a new PR: #4436. Please let me know if you see anything that you did not originally intend. I will add a whats_new.rst entry prior to merging.

amueller · 2015-03-23T16:12:57Z

Merged via #4436.

agramfort mentioned this pull request Oct 9, 2014

[MRG]Fix nearly zero division #3725

Closed

agramfort reviewed Oct 12, 2014
View reviewed changes

agramfort reviewed Oct 14, 2014
View reviewed changes

agramfort reviewed Oct 15, 2014
View reviewed changes

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller added the Bug label Jan 16, 2015

amueller added this to the 0.16 milestone Jan 16, 2015

amueller reviewed Jan 16, 2015
View reviewed changes

amueller changed the title ~~Fix preprocessing.scale for arrays with near zero ratio variance/max~~ [MRG] Fix preprocessing.scale for arrays with near zero ratio variance/max Jan 21, 2015

amueller reviewed Jan 21, 2015
View reviewed changes

amueller reviewed Jan 22, 2015
View reviewed changes

Nicolas and others added 2 commits March 2, 2015 18:26

warning instead of raising an error

ef9e0db

ngoix force-pushed the preproc branch from 34d5d9e to ef9e0db Compare March 2, 2015 17:32

added test asserting warning

ae8f2e4

added attempt to resolve when warning and corresponding tests

amueller reviewed Mar 3, 2015
View reviewed changes

amueller changed the title ~~[MRG] Fix preprocessing.scale for arrays with near zero ratio variance/max~~ [MRG + 1] Fix preprocessing.scale for arrays with near zero ratio variance/max Mar 4, 2015

added comment about np.log(1e-5)

e25eaa8

ogrisel reviewed Mar 6, 2015
View reviewed changes

ngoix added 2 commits March 7, 2015 14:40

warnings caught

c715f2b

added new test function

1aca019

test_standard_scaler_numerical_stability

ogrisel mentioned this pull request Mar 23, 2015

[MRG+2] FIX make StandardScaler & scale more numerically stable #4436

Merged

amueller closed this Mar 23, 2015

Uh oh!

[MRG + 1] Fix preprocessing.scale for arrays with near zero ratio variance/max #3747

[MRG + 1] Fix preprocessing.scale for arrays with near zero ratio variance/max #3747

Uh oh!

Conversation

ngoix commented Oct 9, 2014

Uh oh!

agramfort commented Oct 9, 2014

Uh oh!

ngoix commented Oct 11, 2014

Uh oh!

ngoix commented Oct 11, 2014

Uh oh!

ngoix commented Oct 11, 2014

Uh oh!

coveralls commented Oct 12, 2014

Uh oh!

ngoix commented Oct 12, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoix commented Oct 13, 2014

Uh oh!

agramfort commented Oct 14, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Oct 15, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Oct 16, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Jan 21, 2015

Uh oh!

ngoix commented Jan 22, 2015

Uh oh!

ngoix commented Jan 22, 2015

Uh oh!

amueller commented Jan 22, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Jan 22, 2015

Uh oh!

amueller commented Jan 26, 2015

Uh oh!

ngoix commented Jan 27, 2015

Uh oh!

ngoix commented Mar 2, 2015

Uh oh!

amueller commented Mar 2, 2015

Uh oh!

maximsch2 commented Mar 2, 2015

Uh oh!

amueller commented Mar 2, 2015

Uh oh!

ogrisel commented Mar 3, 2015

Uh oh!

ngoix commented Mar 3, 2015

Uh oh!

coveralls commented Mar 3, 2015

Uh oh!

Choose a reason for hiding this comment