Skip to content

[MRG] Added code for sklearn.preprocessing.RankScaler #2176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

turian
Copy link

@turian turian commented Jul 21, 2013

I wrote code for doing rank-scaling. This scaling technique is more robust than StandardScaler (unit variance, zero mean).

I believe that "scale" is the wrong term for this operation. It's actually feature "normalization". This name-conflicts with the "normalize" method, though.

I wrote documentation and tests. However, I was unable to get the doc-suite or test-suite to build for the current sklearn HEAD, so I couldn't double-check all my documentation and tests.

@larsmans
Copy link
Member

We use normalization to refer to normalized (unit) sample vectors. Standardizer would maybe be more apt.

if X.ndim != 2:
raise ValueError("Rank-standardization only tested on 2-D matrices.")
else:
self.sort_X_ = np.sort(X, axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an argsort if indices are supposed to come out? Also, do you need to store the full training set, or could a summary statistic over axis 0 suffice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be an argsort. In fit, I simply sort the feature values. In transform, I use np.searchsorted (which is like Python bisect) to find the index that a particular feature value would be inserted at.

It is possible that you do not need to store the full training set. The # of different feature values that you store will determine the granularity of the final transformed feature values. e.g. if you store only 100 values, the resolution of the transformed values should be 0.01.

However, I am not sure the best way to implement this in practice. Things get tricky when there is a large number of repeated values. One naive way to implement this would be to define a resolution parameter (e.g. 100), and compute the feature value for 0/resolution through resolution/resolution. This truncated table would be stored instead of Sort_X_. If you want to take advantage of many repeated values, this would require larger changes to the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a change that implements the summary statistic.

@GaelVaroquaux
Copy link
Member

Travis build failed on this branch:
https://travis-ci.org/scikit-learn/scikit-learn/builds/9334424
That's improper handling of sparse matrices (you need to fix the "sparse='csr'" argument)

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2013

Storing a (sorted) copy of the original data in memory seems wasteful to me. Wouldn't it make more sense to compute percentile bin boundaries at fit time to only store the bin boundary values on as an attribute on the scaler object to be reused a transform time to do the actual scaling (with linear interpolation)?

Also we should find a way to not call searchsorted twice if possible.

@turian
Copy link
Author

turian commented Jul 24, 2013

Points taken.

Could someone point me to documentation on how to do type-checking correctly?

I will think over how I can do the approximate fit correctly.

@ogrisel
Copy link
Member

ogrisel commented Jul 25, 2013

Could someone point me to documentation on how to do type-checking correctly?

Unfortunately I don't think there is a good doc for this and the current code base is not very consistent. As you don't support sparse matrices (at least not in the current state of this PR) you should probably use X = array2d(X) that comes from sklearn.utils.array2d.

@turian
Copy link
Author

turian commented Jul 27, 2013

  1. I have fixed PEP8 violations.
  2. I have improved type-checking, as suggesting by @ogrisel
  3. I have written an approximate transform, that is less memory-intensive. (Suggested by @ogrisel and @larsmans ) By default, the resolution is 1000. I have also implemented tests to make sure that the approximation does not differ too much from the exact version.

you want outliers to be given high importance (StandardScaler)
or not (RankScaler).

TODO: min and max parameters?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can keep it for later.

to keep a note add a commented line in the code e.g.

# XXX add min max parameters

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As requested, I have added a commented line in the code. Will push soon.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this hardly requires a sprint. All that's needed here is a bit of narrative documentation and more comparison to other scalers (particularly RobustScaler among others that have appeared since this PR), and perhaps handling the minor code smells I've raised.

for i in range(self.n_ranks):
for j in range(n_features):
# Find the corresponding i in the original ranking
iorig = i * 1. * n_samples / self.n_ranks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not need 1. *; need from __future__ import division

for j in range(X.shape[1]):
lidx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='left')
ridx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='right')
v = 1. * (lidx + ridx) / (2 * self.sort_X_.shape[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not need 1. *

# Approximate the stored sort_X_
self.sort_X_ = np.zeros((self.n_ranks, n_features))
for i in range(self.n_ranks):
for j in range(n_features):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation of iorig, ioriglo, iorighi can be taken out of this loop. Moreover, for some rank i I think sort_X_[i, :] can be easily calculated in vectorized operations.

@jnothman
Copy link
Member

Also, implementing inverse_transform wouldn't hurt.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 15, 2017 via email

@glemaitre
Copy link
Member

glemaitre commented Feb 15, 2017

@jnothman Would it not be more readable to use numpy.percentile and scipy.interpolation.interp1 to find the quantiles and learn the interpolation function in fit. transform would rely on the learned interpolation function.

@dengemann
Copy link
Contributor

Would it not be more readable to use numpy.percentile and scipy.interpolation.interp1 to find the quantile and learn the interpolation function in fit. transform would rely on the learned interpolation function.

this sounds more elegant to me than the current implementation. But maybe I'm missing something.

@glemaitre
Copy link
Member

By quantiles, I mean ranks for that PR.

@jnothman
Copy link
Member

I'd forgotten numpy.percentile existed, and looked at the implementation of scipy.stats.scoreatpercentile which was not more efficient than my proposal. One problem with np.percentile is that it does not support interpolation until 1.9.

@jnothman
Copy link
Member

Yes, interp1d seems advisable.

@jnothman
Copy link
Member

jnothman commented Feb 15, 2017

Sorry, mistaken again; I think the default interpolation in numpy.percentile before 1.9 is probably what we need, so yes: what you propose should work nicely, @glemaitre.

@jnothman
Copy link
Member

But the implementation is a minor issue here. The narrative docs and absent features like inverse_transform are more of an issue.

@glemaitre
Copy link
Member

Also, @ogrisel was telling that for a large amount of data, we could subsample X to find the percentiles.

But the implementation is a minor issue here. The narrative docs and absent features like inverse_transform are more of an issue.

+1

@jnothman
Copy link
Member

jnothman commented Feb 15, 2017 via email

@glemaitre
Copy link
Member

You mean subsample X to avoid a full sort? I suppose so.

Yes, exactly.

@dengemann
Copy link
Contributor

[@jnothman there you go, this is is almost getting a mini-sprint ;-)]

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 15, 2017 via email

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few inline comments that match some points already discussed previously.

@@ -294,6 +296,9 @@ class StandardScaler(BaseEstimator, TransformerMixin):
:func:`sklearn.preprocessing.scale` to perform centering and
scaling without using the ``Transformer`` object oriented API

:class:`sklearn.preprocessing.RankScaler` to perform standardization
that is more robust to outliers, but slower and more memory-intensive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be called standardization. In statistics standardization is specifically related to the Z-score transform (mean removal and standard deviation scaling), e.g.:

http://www.ats.ucla.edu/stat/stata/faq/standardize.htm

:class:sklearn.preprocessing.RankScaler is a feature-wise preprocessing operation that can be used as an alternative to standardization. :class:sklearn.preprocessing.RankScaler is more robust to outliers but slightly slower and more memory intensive.

assert whi >= 0 and whi <= 1
assert_almost_equal(wlo+whi, 1.)
self.sort_X_[i, j] = wlo * full_sort_X_[ioriglo, j] \
+ whi * full_sort_X_[iorighi, j]
Copy link
Member

@ogrisel ogrisel Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just use https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html but we need to be careful about pickling support (as done in the IsotonicRegression estimator class).

ridx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='right')
v = 1. * (lidx + ridx) / (2 * self.sort_X_.shape[0])
X2[:,j] = v
return X2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here we should just use interp1d.

"""
X = array2d(X)
n_samples, n_features = X.shape
full_sort_X_ = np.sort(X, axis=0)
Copy link
Member

@ogrisel ogrisel Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add an hyperparameter subsample=int(1e5) to subsample without replacement (e.g. using np.random.RandomState(self.random_state).choice) when X.shape[0] is larger than subsample as the CDF estimate won't change much. No need to suffer the n.log(n) complexity when n is very large.


Attributes
----------
`sort_X_` : array of ints, shape (n_samples, n_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should better be renamed landmarks_.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should just store an attribute named quantiles_ with shape (n_quantiles, n_features) as suggested by @jnothman .

@@ -399,6 +404,120 @@ def __init__(self, copy=True, with_mean=True, with_std=True):
super(Scaler, self).__init__(copy, with_mean, with_std)


class RankScaler(BaseEstimator, TransformerMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather rename this to QuantileTransformer or QuantileNormalizer.


Parameters
----------
n_ranks : int, 1000 by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_landmarks ?

X : array-like, shape (n_samples, n_features)
The data used to scale along the features axis.
"""
X = array2d(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also support sparse data, at least when the data is positive. The quantile transformation would not break the sparsity pattern in that case.

@raghavrv
Copy link
Member

We (@tguillemot @glemaitre and myself) are planning a mini sprint to resurrect this one...

With Guillaume fixing the code, Thierry making the test and myself making an example comparing the min max scaler with this PR... @GaelVaroquaux @ogrisel @jnothman @dengemann Sounds okay to you?

@dengemann
Copy link
Contributor

We (@tguillemot @glemaitre and myself) are planning a mini sprint to resurrect this one...

Let me know when you plan to do so I might join you.

@ogrisel
Copy link
Member

ogrisel commented Feb 15, 2017

+1

@tguillemot
Copy link
Contributor

@dengemann Certainly now ;)

@jnothman
Copy link
Member

jnothman commented Feb 15, 2017 via email

@jnothman
Copy link
Member

Well with this sudden (and somewhat alarming) groundswell, I'm going to sleep and am curious to see what I find when I awake!

@ogrisel
Copy link
Member

ogrisel commented Feb 15, 2017

For a large amount of data, turning off interpolation in np.percentile
might also help (using eg the "middle" rule).

@GaelVaroquaux do you mean at transform time? At train time I think sorting is going to be the CPU bottleneck.

We could possibly reduce the transform throughput by removing the linear interpolation but I am not sure that np.searchsorted is that much faster than interp1d. The expensive step is the search, not the interpolation IMHO (but this should be checked with a benchmark).

@ogrisel
Copy link
Member

ogrisel commented Feb 15, 2017

Well with this sudden (and somewhat alarming) groundswell, I'm going to sleep and am curious to see what I find when I awake!

The 5 of us did chat about it at the coffee machine after lunch :)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 15, 2017 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 15, 2017 via email

@dengemann
Copy link
Contributor

Well with this sudden (and somewhat alarming) groundswell, I'm going to sleep and am curious to see what I find when I awake!

@jnothman I'm outing myself as the fire starter and apologise for peaking anomaly detection alerts and take any complaints about email overflow :)

@amueller
Copy link
Member

merged as #8363.

@amueller amueller closed this Jun 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.