Skip to content

FIX: normalizer l_inf should take maximum of absolute values #16633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Mar 10, 2020

Conversation

maurapintor
Copy link
Contributor

@maurapintor maurapintor commented Mar 4, 2020

Reference Issues/PRs

Fixes #16632

What does this implement/fix? Explain your changes.

Takes the max of the absolute values as a norm before normalizing the vectors.
See here the norm definition.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Please add a non regression test (e.g. adapting the example you provided in the issue) to sklearn/preprocessing/tests/test_preprocessing.py.

Also please add an entry to the change log at doc/whats_new/v0.23.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@maurapintor maurapintor requested a review from rth March 4, 2020 15:26
@maurapintor
Copy link
Contributor Author

@rth I fixed what you suggested. To be honest, it is my first contribution, so let me know if I did something wrong! Any feedback is more than welcome 😉

@maurapintor maurapintor requested a review from rth March 4, 2020 16:40
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One remaining comment otherwise LGTM. Thanks @Maupin1991 !

@maurapintor maurapintor changed the title fix: normalizer l_inf now takes the absolute value of the maximum bef… fix: normalizer l_inf should take absolute value of the max before division Mar 5, 2020
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be max(abs(X)) or abs(max(X))?

@rth
Copy link
Member

rth commented Mar 5, 2020

should it be max(abs(X)) or abs(max(X))?

abs(max(X)) as is now makes fewer copies and element operations, I think, and is faster,

In [1]: import numpy as np                                                                                                             

In [2]: X = np.random.RandomState(0).rand(1000, 100)                                                                                   

In [3]: %timeit np.abs(np.max(X, axis=1))                                                                                              
117 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [4]: %timeit np.max(np.abs(X), axis=1)                                                                                              
165 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@maurapintor
Copy link
Contributor Author

By definition, it should be the maximum absolute value, not the abs value of the max! Good catch @jnothman ! I'm fixing that.

@maurapintor maurapintor changed the title fix: normalizer l_inf should take absolute value of the max before division fix: normalizer l_inf should take maxumum of absolute values Mar 5, 2020
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me that the intention of norm='max' is equivalent to norm='Linf'. Though I can see why having a negative norm is unhelpful. Do you think it would be better to raise an error if the norm is negative and provide the Linf option separately? Or if we keep this as proposed we need to update the parameter documentation to clarify that it's the maximum absolute value being used

@@ -1728,7 +1728,7 @@ def normalize(X, norm='l2', axis=1, copy=True, return_norm=False):
elif norm == 'l2':
norms = row_norms(X)
elif norm == 'max':
norms = np.max(X, axis=1)
norms = np.max(np.abs(X), axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead avoid a copy with something like np.maximum(abs(np.min(X, axis=1)), np.max(X, axis=1))? Or is that too crazy?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, for vectors, max norm and infinity norm are equal and correspond to taking the maximum of the absolute values: https://en.wikipedia.org/wiki/Norm_(mathematics)#Maximum_norm_(special_case_of:_infinity_norm,_uniform_norm,_or_supremum_norm)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point here is that abs(X) copies X and transforms it. For large data, np.maximum(abs(np.min(X, axis=1)), np.max(X, axis=1)) would be more memory efficient but give the same result.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that avoiding a copy is worth here, I was just answering to your question on norms on top of this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for now we have kept it copying?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad. Should be good now.

@rth
Copy link
Member

rth commented Mar 5, 2020

It's not clear to me that the intention of norm='max' is equivalent to norm='Linf'.

Renaming this parameter to norm='inf' would be more correct but possibly a bit less explicit for an average user. I agree we should improve the documentation about the equivalence .

I can't think of a practical application where one would want to have norm='max' with a distinct functionality from norm='inf'. Maybe with some negative outliers but that sounds very specific... Implementing the same (or a subset of) norms from https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html would be more consistent IMO.

Also I think a norm can't be negative by definition https://math.stackexchange.com/a/1901893 while with the current norm='max' it can be, so it's not a valid norm.

norm='max' was introduced in 2015 in #4695

improvement: more efficient computation of the max norm

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
@maurapintor maurapintor requested a review from jnothman March 5, 2020 12:33
@@ -1728,7 +1728,7 @@ def normalize(X, norm='l2', axis=1, copy=True, return_norm=False):
elif norm == 'l2':
norms = row_norms(X)
elif norm == 'max':
norms = np.max(X, axis=1)
norms = np.max(np.abs(X), axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for now we have kept it copying?

@jnothman
Copy link
Member

jnothman commented Mar 5, 2020

Can we repurpose the MaxAbsScaler here?

@maurapintor
Copy link
Contributor Author

Can we repurpose the MaxAbsScaler here?

If I am not mistaken, the MaxAbsScaler works per-feature, while the Normalizer works per-sample. I would keep this behavior separate, even though I agree that some code would be inevitably replicated. What is your suggestion @jnothman? Do you mean we should treat the cases as the same, applying the normalization in the transpose?

@jnothman
Copy link
Member

jnothman commented Mar 7, 2020

I mean you can apply it to the transposed X

@jnothman
Copy link
Member

jnothman commented Mar 8, 2020

I'm still unsure about whether we should consider making this more efficient. But let's merge as is for now

@jnothman jnothman changed the title fix: normalizer l_inf should take maxumum of absolute values FIX: normalizer l_inf should take maxumum of absolute values Mar 8, 2020
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this isn't ready for merge yet. Please update the docstring of norm to make the behaviour clearer.

@maurapintor maurapintor changed the title FIX: normalizer l_inf should take maxumum of absolute values FIX: normalizer l_inf should take maximum of absolute values Mar 9, 2020
@jnothman jnothman merged commit b189bf6 into scikit-learn:master Mar 10, 2020
@jnothman
Copy link
Member

Thanks!

ashutosh1919 pushed a commit to ashutosh1919/scikit-learn that referenced this pull request Mar 13, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preprocessing with L-inf Normalizer of all-negative elements vector returns all-positive elements vector
4 participants