FIX: normalizer l_inf should take maximum of absolute values #16633

maurapintor · 2020-03-04T13:10:40Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Takes the max of the absolute values as a norm before normalizing the vectors.
See here the norm definition.

…ore normalizing

rth

Thanks!

Please add a non regression test (e.g. adapting the example you provided in the issue) to sklearn/preprocessing/tests/test_preprocessing.py.

Also please add an entry to the change log at doc/whats_new/v0.23.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

… more cases are covered

This reverts commit c762532

This reverts commit f2fa526

maurapintor · 2020-03-04T15:32:53Z

@rth I fixed what you suggested. To be honest, it is my first contribution, so let me know if I did something wrong! Any feedback is more than welcome 😉

sklearn/preprocessing/tests/test_data.py

This reverts commit f15518f

rth

One remaining comment otherwise LGTM. Thanks @Maupin1991 !

sklearn/preprocessing/tests/test_data.py

jnothman

should it be max(abs(X)) or abs(max(X))?

rth · 2020-03-05T10:27:52Z

should it be max(abs(X)) or abs(max(X))?

abs(max(X)) as is now makes fewer copies and element operations, I think, and is faster,

In [1]: import numpy as np                                                                                                             

In [2]: X = np.random.RandomState(0).rand(1000, 100)                                                                                   

In [3]: %timeit np.abs(np.max(X, axis=1))                                                                                              
117 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [4]: %timeit np.max(np.abs(X), axis=1)                                                                                              
165 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

maurapintor · 2020-03-05T10:32:00Z

By definition, it should be the maximum absolute value, not the abs value of the max! Good catch @jnothman ! I'm fixing that.

…o 1.0

jnothman

It's not clear to me that the intention of norm='max' is equivalent to norm='Linf'. Though I can see why having a negative norm is unhelpful. Do you think it would be better to raise an error if the norm is negative and provide the Linf option separately? Or if we keep this as proposed we need to update the parameter documentation to clarify that it's the maximum absolute value being used

jnothman · 2020-03-05T11:00:01Z

sklearn/preprocessing/_data.py

@@ -1728,7 +1728,7 @@ def normalize(X, norm='l2', axis=1, copy=True, return_norm=False):
        elif norm == 'l2':
            norms = row_norms(X)
        elif norm == 'max':
-            norms = np.max(X, axis=1)
+            norms = np.max(np.abs(X), axis=1)


Should we instead avoid a copy with something like np.maximum(abs(np.min(X, axis=1)), np.max(X, axis=1))? Or is that too crazy?

As far as I can tell, for vectors, max norm and infinity norm are equal and correspond to taking the maximum of the absolute values: https://en.wikipedia.org/wiki/Norm_(mathematics)#Maximum_norm_(special_case_of:_infinity_norm,_uniform_norm,_or_supremum_norm)

My point here is that abs(X) copies X and transforms it. For large data, np.maximum(abs(np.min(X, axis=1)), np.max(X, axis=1)) would be more memory efficient but give the same result.

I agree with you that avoiding a copy is worth here, I was just answering to your question on norms on top of this one.

But for now we have kept it copying?

Sorry, my bad. Should be good now.

rth · 2020-03-05T11:42:14Z

It's not clear to me that the intention of norm='max' is equivalent to norm='Linf'.

Renaming this parameter to norm='inf' would be more correct but possibly a bit less explicit for an average user. I agree we should improve the documentation about the equivalence .

I can't think of a practical application where one would want to have norm='max' with a distinct functionality from norm='inf'. Maybe with some negative outliers but that sounds very specific... Implementing the same (or a subset of) norms from https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html would be more consistent IMO.

Also I think a norm can't be negative by definition https://math.stackexchange.com/a/1901893 while with the current norm='max' it can be, so it's not a valid norm.

norm='max' was introduced in 2015 in #4695

sklearn/preprocessing/_data.py

improvement: more efficient computation of the max norm Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

jnothman · 2020-03-05T20:47:02Z

sklearn/preprocessing/_data.py

@@ -1728,7 +1728,7 @@ def normalize(X, norm='l2', axis=1, copy=True, return_norm=False):
        elif norm == 'l2':
            norms = row_norms(X)
        elif norm == 'max':
-            norms = np.max(X, axis=1)
+            norms = np.max(np.abs(X), axis=1)


But for now we have kept it copying?

sklearn/preprocessing/tests/test_data.py

jnothman · 2020-03-05T20:49:10Z

Can we repurpose the MaxAbsScaler here?

…negative

maurapintor · 2020-03-06T11:56:39Z

Can we repurpose the MaxAbsScaler here?

If I am not mistaken, the MaxAbsScaler works per-feature, while the Normalizer works per-sample. I would keep this behavior separate, even though I agree that some code would be inevitably replicated. What is your suggestion @jnothman? Do you mean we should treat the cases as the same, applying the normalization in the transpose?

jnothman · 2020-03-07T11:07:53Z

I mean you can apply it to the transposed X

doc/whats_new/v0.23.rst

jnothman · 2020-03-08T20:36:49Z

I'm still unsure about whether we should consider making this more efficient. But let's merge as is for now

jnothman

Actually this isn't ready for merge yet. Please update the docstring of norm to make the behaviour clearer.

doc/whats_new/v0.23.rst

jnothman · 2020-03-10T04:24:35Z

Thanks!

…learn#16633)

fix: normalizer l_inf now takes the absolute value of the maximum bef…

e8d7f2b

…ore normalizing

rth reviewed Mar 4, 2020

View reviewed changes

github-actions bot added the module:preprocessing label Mar 4, 2020

maurapintor added 11 commits March 4, 2020 14:25

doc: added entry to change log

0ef9d13

fix: changed np method

a04622c

test: add check for all-negative norm

f15518f

test: add test for normalizer l inf

fe4ade3

test: the sparsity test should be performed inside the loop to ensure…

4c18e17

… more cases are covered

fix: fixed linting

c762532

Revert "fix: fixed linting"

2ee1b1b

This reverts commit c762532

minor: removed unused import

f2fa526

Revert "minor: removed unused import"

3a5fefc

This reverts commit f2fa526

lint: break line too long

aa59cfa

lint: break line too long

34b6b75

maurapintor requested a review from rth March 4, 2020 15:26

rth reviewed Mar 4, 2020

View reviewed changes

sklearn/preprocessing/tests/test_data.py Outdated Show resolved Hide resolved

maurapintor added 6 commits March 4, 2020 16:49

minor: typo in docs!

f8bc337

Revert "test: add check for all-negative norm"

745bf60

This reverts commit f15518f

test: added test for dense

c32d4c4

test: added test for sparse

f888597

fix: added fix also for sparse matrix

1d10f3a

minor: fix linter problem

45f54da

maurapintor requested a review from rth March 4, 2020 16:40

rth reviewed Mar 5, 2020

View reviewed changes

sklearn/preprocessing/tests/test_data.py Outdated Show resolved Hide resolved

refactor: make independent test for all-negatives vector

59aa70f

maurapintor changed the title ~~fix: normalizer l_inf now takes the absolute value of the maximum bef…~~ fix: normalizer l_inf should take absolute value of the max before division Mar 5, 2020

jnothman reviewed Mar 5, 2020

View reviewed changes

maurapintor added 2 commits March 5, 2020 11:49

fix: should take maximum of the abs values and not the opposite

c90dbbd

test: fix the test, it should compare the max of the absolute value t…

dbb5098

…o 1.0

maurapintor changed the title ~~fix: normalizer l_inf should take absolute value of the max before division~~ fix: normalizer l_inf should take maxumum of absolute values Mar 5, 2020

rth approved these changes Mar 5, 2020

View reviewed changes

jnothman reviewed Mar 5, 2020

View reviewed changes

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

Update sklearn/preprocessing/_data.py

24ae484

improvement: more efficient computation of the max norm Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

maurapintor requested a review from jnothman March 5, 2020 12:33

jnothman reviewed Mar 5, 2020

View reviewed changes

maurapintor added 2 commits March 6, 2020 09:11

fix: avoid copy

4598283

test: check for mixed data where the value with largest magnitude is …

c7ec07b

…negative

jnothman approved these changes Mar 7, 2020

View reviewed changes

battistabiggio reviewed Mar 7, 2020

View reviewed changes

doc/whats_new/v0.23.rst Outdated Show resolved Hide resolved

docs: updated whats new with bbiggio new username

5feeeb1

jnothman changed the title ~~fix: normalizer l_inf should take maxumum of absolute values~~ FIX: normalizer l_inf should take maxumum of absolute values Mar 8, 2020

jnothman reviewed Mar 8, 2020

View reviewed changes

doc/whats_new/v0.23.rst Outdated Show resolved Hide resolved

maurapintor added 3 commits March 9, 2020 09:40

docs: better specification of the bug in the docs what's new

cde2edb

docs: updated docstring of param norm of the normalizer

b0f7e19

minor: added 's' in docs

c456c3b

maurapintor changed the title ~~FIX: normalizer l_inf should take maxumum of absolute values~~ FIX: normalizer l_inf should take maximum of absolute values Mar 9, 2020

jnothman approved these changes Mar 10, 2020

View reviewed changes

jnothman merged commit b189bf6 into scikit-learn:master Mar 10, 2020

ashutosh1919 pushed a commit to ashutosh1919/scikit-learn that referenced this pull request Mar 13, 2020

FIX: normalizer l_inf should take maximum of absolute values (scikit-…

9c17a60

…learn#16633)

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

FIX: normalizer l_inf should take maximum of absolute values (scikit-…

ef5f05f

…learn#16633)

Uh oh!

FIX: normalizer l_inf should take maximum of absolute values #16633

FIX: normalizer l_inf should take maximum of absolute values #16633

Uh oh!

Conversation

maurapintor commented Mar 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

maurapintor commented Mar 4, 2020

Uh oh!

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Mar 5, 2020

Uh oh!

maurapintor commented Mar 5, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

battistabiggio Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

jnothman Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

battistabiggio Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

jnothman Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

maurapintor Mar 6, 2020

Choose a reason for hiding this comment

Uh oh!

rth commented Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jnothman Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman commented Mar 5, 2020

Uh oh!

maurapintor commented Mar 6, 2020

Uh oh!

jnothman commented Mar 7, 2020

Uh oh!

Uh oh!

jnothman commented Mar 8, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman commented Mar 10, 2020

Uh oh!

Uh oh!

maurapintor commented Mar 4, 2020 •

edited

Loading

rth commented Mar 5, 2020 •

edited

Loading