Skip to content

[MRG + 1] Fix preprocessing.scale for arrays with near zero ratio variance/max #3747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

ngoix
Copy link
Contributor

@ngoix ngoix commented Oct 9, 2014

Fix #3722, prolongating #3725.

@agramfort
Copy link
Member

travis is not happy

@ngoix
Copy link
Contributor Author

ngoix commented Oct 11, 2014

@maximsch2 :
I agree with you for scale(np.arrange(10)_1e100) = scale(np.arrange(10)), but do you think the same for scale(np.arrange(10)_1e-100) = scale(np.arrange(10))?

For now, my version gives:

print preprocessing.scale(np.arange(10) * 1e100)
[-1.5666989 -1.21854359 -0.87038828 -0.52223297 -0.17407766 0.17407766
0.52223297 0.87038828 1.21854359 1.5666989 ]

print preprocessing.scale(np.arange(10) * 1e-100)
[ -4.50000000e-100 -3.50000000e-100 -2.50000000e-100 -1.50000000e-100
-5.00000000e-101 5.00000000e-101 1.50000000e-100 2.50000000e-100
3.50000000e-100 4.50000000e-100]

@ngoix
Copy link
Contributor Author

ngoix commented Oct 11, 2014

This modification gives the alternative result (but still doesn't pass travis test)

@ngoix
Copy link
Contributor Author

ngoix commented Oct 11, 2014

i.e :
print preprocessing.scale(np.arange(10) * 1e-100)
[-1.5666989 -1.21854359 -0.87038828 -0.52223297 -0.17407766 0.17407766
0.52223297 0.87038828 1.21854359 1.5666989 ]

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling d1373c8 on ngoix:preproc into 08d98f7 on scikit-learn:master.

@ngoix
Copy link
Contributor Author

ngoix commented Oct 12, 2014

This version verifies
scale(np.arange(10) * 1e-100) = scale(np.arange(10)) = scale(np.arange(10) * 1e100)
and all issues raised by @maximsch2 in #3722 and #3725.

@@ -129,13 +129,15 @@ def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
else:
X = np.asarray(X)
warn_if_not_float(X, estimator='The scale function')
preScale = abs(X).max()+1.*(abs(X).max() == 0.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable names are never in CamelCase. so "pre_scale"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abs(X) -> np.abs(X)

and run pep8 checker (spaces missing around +)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computing twice abs(X).max() should be avoided.

@ngoix
Copy link
Contributor Author

ngoix commented Oct 13, 2014

I don't understand why travis test fails since abs()->np.abs()

@agramfort
Copy link
Member

look at travis log. The error is unrelated.

mean_, std_ = _mean_and_std(
X, axis, with_mean=with_mean, with_std=with_std)
X/pre_scale, axis, with_mean=with_mean, with_std=with_std)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces missing around / operator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't like here is that will make an extra copy of the data in memory. It can be a problem for large files.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 4e270ec on ngoix:preproc into 08d98f7 on scikit-learn:master.

# concerned feature is efficient, for instance by its mean or
# maximum.
if isinstance(mean_1, np.ndarray):
if not all([isclose(a, 0.) for a in mean_1]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there exists np.allclose

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling e0cc360 on ngoix:preproc into 08d98f7 on scikit-learn:master.

@amueller amueller added the Bug label Jan 16, 2015
@amueller amueller added this to the 0.16 milestone Jan 16, 2015
# maximum.
if isinstance(mean_1, np.ndarray):
if not np.allclose(mean_1, np.zeros(np.shape(mean_1)[0])):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fail? We could also warn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but in this case we fail to return a centered result. Actually the obtained X_r is false and according to me it is then useless to keep running the program... isn't it ?

@amueller amueller changed the title Fix preprocessing.scale for arrays with near zero ratio variance/max [MRG] Fix preprocessing.scale for arrays with near zero ratio variance/max Jan 21, 2015
# concerned feature is efficient, for instance by its mean or
# maximum.
if isinstance(mean_1, np.ndarray):
if not np.allclose(mean_1, np.zeros(np.shape(mean_1)[0])):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use np.allclose(mean_1, 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I don't think you need the if at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i'll change the np.allclose. The if is needed if we don't want to change X_r by useless subtraction (Xr -= mean_2 with a mean_2 very close to 0). But maybe the complexity of testing np.allclose is greater than just substracting, is that what you meant ?

@amueller
Copy link
Member

I don't see why you want to raise in one case and silently correct in the other. Shouldn't we warn in both cases?

@ngoix
Copy link
Contributor Author

ngoix commented Jan 22, 2015

Yes we should warn in the second case. But in the first case do you think we should keep running the program even if the intermediary result is false ?

@ngoix
Copy link
Contributor Author

ngoix commented Jan 22, 2015

The main drawback of this PR is for me that it resolves bugs on very particular data (do they exist in the real world ? Can't the user notice it if one sample is very large, or if all the sample are almost equal?) and pay for it in complexity (by testing in all cases, or if not by substracting a second time in all cases). Does it worth the price ?

@amueller
Copy link
Member

Thanks for the changes. Travis is failing, though.

# maximum.
if not np.allclose(mean_1, 0):
raise ValueError(
"Centering failed. Dataset may contain too"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are spaces missing at both ends of lines.

@amueller
Copy link
Member

I think it is worth computing the statistics again to make sure. If you add a warning that there were numerical issues in the case were you subtract the mean again, I think the fix is good.

@amueller
Copy link
Member

the tests are currently failing...

@ngoix
Copy link
Contributor Author

ngoix commented Jan 27, 2015

Yes I don't understand why. I suspect it is related to np.allclose but I don't see the point

@ngoix
Copy link
Contributor Author

ngoix commented Mar 2, 2015

The problem comes from test_warning_scaling_integers when checking a warning in case of uint8 data. In this case, scale() is unable to standardize the data (since 0 - 1 = 255 in uint8). Whereas the previous version, this version raises an error if not able to scale the data correctly, so that test_warning_scaling_integers do not see the warning I think.

@amueller
Copy link
Member

amueller commented Mar 2, 2015

Ok. So you warn now? That is probably better given the previous behavior. Can you please rebase?

Nicolas and others added 2 commits March 2, 2015 18:26
…0' (scikit-learn#3722)

-np.zeros(8) + np.log(1e-5)
-np.zeros(22) + np.log(1e-5)
-np.zeros(10) + 1e100
-np.ones(10) * 1e-100

in function scale() : -after a convenient normalization by 1/(max(abs(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725).
-just a small change in the code to satisfy pep8 requirements

Now, scale(np.arange(10.)*1e-100)=scale(np.arange(10.))

remove isclose which is now unnecessary

New test for extrem (1e100 and 1e-100) scaling

modification on Xr instead of X

abs->np.abs, pep8 checker, preScale->pre_scale

max->np.max()

Abandon of the prescale method (problematical extra copy of the data)
-ValueError in the case of too large values in X which yields
(X-X.mean()).mean() > 0.
-But still cover the case of std() close to 0, by substracting again the (new)
mean after scaling if needed.

-isclose -> np.isclose
-all([isclose()]) -> np.allclose

np.isclose -> isclose to avoid bug

np.allclose

warning
@maximsch2
Copy link

Does scaling even make sense in case of unsigned integers? Maybe it's better to just throw an exception in that case?

@amueller
Copy link
Member

amueller commented Mar 2, 2015

Currently we are warning. We could also convert to float64. I'd rather handle than throw an exception.

@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2015

Could you also please wrap the lines that issue a warning with assert_warns_message or asser_warns?

added attempt to resolve when warning and corresponding tests
@ngoix
Copy link
Contributor Author

ngoix commented Mar 3, 2015

Ok done, tell me if it would be better to create a new test function instead.
And now that we are warning instead of raising error – if not np.allclose(mean_1, 0) -- I think it is good to try (and it works in most of cases) to correct the problem by substracting the mean again, as in the step 2 'if with_std :'. I also added the corresponding tests since with this modification, scale() works (with a warning) for vectors like np.ones(10)*1e100 and np.arange(10.) * 1e100.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) to 95.1% when pulling ae8f2e4 on ngoix:preproc into af1906f on scikit-learn:master.

@@ -100,6 +100,27 @@ def test_scaler_1d():
X = np.ones(5)
assert_array_equal(scale(X, with_mean=False), X)

X = np.zeros(8) + np.log(1e-5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this number. Isn't this a fancy way of saying -11. That is not what you intended, right?

@amueller
Copy link
Member

amueller commented Mar 3, 2015

Apart from the test adding -11, LGTM.

@ngoix
Copy link
Contributor Author

ngoix commented Mar 4, 2015

Yes I took np.log(1e-5) and not -11 because of the size/precision of this float, in view of the original bug #3722.

@amueller
Copy link
Member

amueller commented Mar 4, 2015

Fair enough, let's keep it then.

@amueller amueller changed the title [MRG] Fix preprocessing.scale for arrays with near zero ratio variance/max [MRG + 1] Fix preprocessing.scale for arrays with near zero ratio variance/max Mar 4, 2015
@amueller
Copy link
Member

amueller commented Mar 5, 2015

@ogrisel review?

@ogrisel
Copy link
Member

ogrisel commented Mar 5, 2015

Fair enough, let's keep it then.

It's not trivial to understand when just reading the source code. We should add an inline comment to explain what's special about this value.

@ogrisel
Copy link
Member

ogrisel commented Mar 6, 2015

Please put the new checks for numberical stability regression in a new test function named "test_standard_scaler_numerical_stability" and make sure that all the warnings are checked.

At the moment I get the following uncaught warnings:

$ nosetests -v -s sklearn/preprocessing/tests/test_data.py
...
Test scaling of dataset along single axis ... /Users/ogrisel/code/scikit-learn/sklearn/preprocessing/data.py:167: UserWarning: Numerical issues were encountered and might not be solved. The standard deviation of the data is probably very close to 0.
  warnings.warn("Numerical issues were encountered "
/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/numpy/core/numeric.py:183: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  a = empty(shape, dtype, order)
/Users/ogrisel/code/scikit-learn/sklearn/preprocessing/data.py:152: UserWarning: Numerical issues were encountered and might not be solved. Dataset may contain too large values. You may need to prescale your features.
  warnings.warn("Numerical issues were encountered "
/Users/ogrisel/code/scikit-learn/sklearn/preprocessing/data.py:152: UserWarning: Numerical issues were encountered and might not be solved. Dataset may contain too large values. You may need to prescale your features.
  warnings.warn("Numerical issues were encountered "
...

# concerned feature is efficient, for instance by its mean or
# maximum.
if not np.allclose(mean_1, 0):
warnings.warn("Numerical issues were encountered "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were encountered when centering the data

ngoix added 2 commits March 7, 2015 14:40
test_standard_scaler_numerical_stability
@ngoix
Copy link
Contributor Author

ngoix commented Mar 7, 2015

like this? I don't really understand why clean_warning_registry() is needed but I put it just in case.

@ogrisel
Copy link
Member

ogrisel commented Mar 23, 2015

No I meant just using assert_warn_message and checking the return values instead of calling the functions twice. Let me squash, rebase and fix this PR as it's almost good to merge.

@ogrisel
Copy link
Member

ogrisel commented Mar 23, 2015

@ngoix I pushed my fixes into a new PR: #4436. Please let me know if you see anything that you did not originally intend. I will add a whats_new.rst entry prior to merging.

@amueller
Copy link
Member

Merged via #4436.

@amueller amueller closed this Mar 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

preprocessing.scale provides consistent results on arrays with zero variance
6 participants