[MRG] Added code for sklearn.preprocessing.RankScaler #2176

turian · 2013-07-21T23:03:23Z

I wrote code for doing rank-scaling. This scaling technique is more robust than StandardScaler (unit variance, zero mean).

I believe that "scale" is the wrong term for this operation. It's actually feature "normalization". This name-conflicts with the "normalize" method, though.

I wrote documentation and tests. However, I was unable to get the doc-suite or test-suite to build for the current sklearn HEAD, so I couldn't double-check all my documentation and tests.

larsmans · 2013-07-22T08:35:38Z

We use normalization to refer to normalized (unit) sample vectors. Standardizer would maybe be more apt.

larsmans · 2013-07-22T08:41:20Z

sklearn/preprocessing.py

+        if X.ndim != 2:
+            raise ValueError("Rank-standardization only tested on 2-D matrices.")
+        else:
+            self.sort_X_ = np.sort(X, axis=0)


Shouldn't this be an argsort if indices are supposed to come out? Also, do you need to store the full training set, or could a summary statistic over axis 0 suffice?

This shouldn't be an argsort. In fit, I simply sort the feature values. In transform, I use np.searchsorted (which is like Python bisect) to find the index that a particular feature value would be inserted at.

It is possible that you do not need to store the full training set. The # of different feature values that you store will determine the granularity of the final transformed feature values. e.g. if you store only 100 values, the resolution of the transformed values should be 0.01.

However, I am not sure the best way to implement this in practice. Things get tricky when there is a large number of repeated values. One naive way to implement this would be to define a resolution parameter (e.g. 100), and compute the feature value for 0/resolution through resolution/resolution. This truncated table would be stored instead of Sort_X_. If you want to take advantage of many repeated values, this would require larger changes to the code.

I have pushed a change that implements the summary statistic.

GaelVaroquaux · 2013-07-23T05:54:10Z

Travis build failed on this branch:
https://travis-ci.org/scikit-learn/scikit-learn/builds/9334424
That's improper handling of sparse matrices (you need to fix the "sparse='csr'" argument)

ogrisel · 2013-07-23T15:48:30Z

Storing a (sorted) copy of the original data in memory seems wasteful to me. Wouldn't it make more sense to compute percentile bin boundaries at fit time to only store the bin boundary values on as an attribute on the scaler object to be reused a transform time to do the actual scaling (with linear interpolation)?

Also we should find a way to not call searchsorted twice if possible.

turian · 2013-07-24T06:15:56Z

Points taken.

Could someone point me to documentation on how to do type-checking correctly?

I will think over how I can do the approximate fit correctly.

ogrisel · 2013-07-25T10:10:50Z

Could someone point me to documentation on how to do type-checking correctly?

Unfortunately I don't think there is a good doc for this and the current code base is not very consistent. As you don't support sparse matrices (at least not in the current state of this PR) you should probably use X = array2d(X) that comes from sklearn.utils.array2d.

turian · 2013-07-27T01:37:32Z

I have fixed PEP8 violations.
I have improved type-checking, as suggesting by @ogrisel
I have written an approximate transform, that is less memory-intensive. (Suggested by @ogrisel and @larsmans ) By default, the resolution is 1000. I have also implemented tests to make sure that the approximation does not differ too much from the exact version.

agramfort · 2013-07-27T07:24:59Z

sklearn/preprocessing.py

+    you want outliers to be given high importance (StandardScaler)
+    or not (RankScaler).
+
+    TODO: min and max parameters?


maybe you can keep it for later.

to keep a note add a commented line in the code e.g.

# XXX add min max parameters

As requested, I have added a commented line in the code. Will push soon.

jnothman

I think this hardly requires a sprint. All that's needed here is a bit of narrative documentation and more comparison to other scalers (particularly RobustScaler among others that have appeared since this PR), and perhaps handling the minor code smells I've raised.

jnothman · 2017-02-15T12:51:34Z

sklearn/preprocessing.py

+            for i in range(self.n_ranks):
+                for j in range(n_features):
+                    # Find the corresponding i in the original ranking
+                    iorig = i * 1. * n_samples / self.n_ranks


do not need 1. *; need from __future__ import division

jnothman · 2017-02-15T12:52:02Z

sklearn/preprocessing.py

+        for j in range(X.shape[1]):
+            lidx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='left')
+            ridx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='right')
+            v = 1. * (lidx + ridx) / (2 * self.sort_X_.shape[0])


do not need 1. *

jnothman · 2017-02-15T12:54:42Z

sklearn/preprocessing.py

+            # Approximate the stored sort_X_
+            self.sort_X_ = np.zeros((self.n_ranks, n_features))
+            for i in range(self.n_ranks):
+                for j in range(n_features):


The calculation of iorig, ioriglo, iorighi can be taken out of this loop. Moreover, for some rank i I think sort_X_[i, :] can be easily calculated in vectorized operations.

jnothman · 2017-02-15T12:56:35Z

Also, implementing inverse_transform wouldn't hurt.

GaelVaroquaux · 2017-02-15T12:58:21Z

Also, implementing inverse_transform wouldn't hurt.

Yes, that would be super useful.

glemaitre · 2017-02-15T12:59:05Z

@jnothman Would it not be more readable to use numpy.percentile and scipy.interpolation.interp1 to find the quantiles and learn the interpolation function in fit. transform would rely on the learned interpolation function.

dengemann · 2017-02-15T13:00:57Z

Would it not be more readable to use numpy.percentile and scipy.interpolation.interp1 to find the quantile and learn the interpolation function in fit. transform would rely on the learned interpolation function.

this sounds more elegant to me than the current implementation. But maybe I'm missing something.

glemaitre · 2017-02-15T13:01:02Z

By quantiles, I mean ranks for that PR.

jnothman · 2017-02-15T13:02:36Z

I'd forgotten numpy.percentile existed, and looked at the implementation of scipy.stats.scoreatpercentile which was not more efficient than my proposal. One problem with np.percentile is that it does not support interpolation until 1.9.

jnothman · 2017-02-15T13:04:56Z

Yes, interp1d seems advisable.

jnothman · 2017-02-15T13:06:17Z

Sorry, mistaken again; I think the default interpolation in numpy.percentile before 1.9 is probably what we need, so yes: what you propose should work nicely, @glemaitre.

jnothman · 2017-02-15T13:07:00Z

But the implementation is a minor issue here. The narrative docs and absent features like inverse_transform are more of an issue.

glemaitre · 2017-02-15T13:11:00Z

Also, @ogrisel was telling that for a large amount of data, we could subsample X to find the percentiles.

But the implementation is a minor issue here. The narrative docs and absent features like inverse_transform are more of an issue.

+1

jnothman · 2017-02-15T13:12:48Z

You mean subsample X to avoid a full sort? I suppose so.

…

On 16 February 2017 at 00:11, Guillaume Lemaitre ***@***.***> wrote: Also, @ogrisel <https://github.com/ogrisel> was telling that for a large amount of data, we could subsample X to find the percentiles. But the implementation is a minor issue here. The narrative docs and absent features like inverse_transform are more of an issue. +1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2176 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xQQv-gtQfzxmvxmxEofLIyy_31bks5rcvllgaJpZM4A1XQj> .

glemaitre · 2017-02-15T13:14:19Z

You mean subsample X to avoid a full sort? I suppose so.

Yes, exactly.

dengemann · 2017-02-15T13:16:05Z

[@jnothman there you go, this is is almost getting a mini-sprint ;-)]

GaelVaroquaux · 2017-02-15T13:25:01Z

Also, @ogrisel was telling that for a large amount of data, we could subsampleX to find the percentiles.

For a large amount of data, turning off interpolation in np.percentile might also help (using eg the "middle" rule).

ogrisel

Here are a few inline comments that match some points already discussed previously.

ogrisel · 2017-02-15T13:42:35Z

sklearn/preprocessing.py

@@ -294,6 +296,9 @@ class StandardScaler(BaseEstimator, TransformerMixin):
    :func:`sklearn.preprocessing.scale` to perform centering and
    scaling without using the ``Transformer`` object oriented API

+    :class:`sklearn.preprocessing.RankScaler` to perform standardization
+    that is more robust to outliers, but slower and more memory-intensive.


I think this should not be called standardization. In statistics standardization is specifically related to the Z-score transform (mean removal and standard deviation scaling), e.g.:

http://www.ats.ucla.edu/stat/stata/faq/standardize.htm

:class:sklearn.preprocessing.RankScaler is a feature-wise preprocessing operation that can be used as an alternative to standardization. :class:sklearn.preprocessing.RankScaler is more robust to outliers but slightly slower and more memory intensive.

ogrisel · 2017-02-15T13:43:53Z

sklearn/preprocessing.py

+                        assert whi >= 0 and whi <= 1
+                        assert_almost_equal(wlo+whi, 1.)
+                        self.sort_X_[i, j] = wlo * full_sort_X_[ioriglo, j] \
+                                           + whi * full_sort_X_[iorighi, j]


We should just use https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html but we need to be careful about pickling support (as done in the IsotonicRegression estimator class).

ogrisel · 2017-02-15T13:44:58Z

sklearn/preprocessing.py

+            ridx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='right')
+            v = 1. * (lidx + ridx) / (2 * self.sort_X_.shape[0])
+            X2[:,j] = v
+        return X2


Same here we should just use interp1d.

ogrisel · 2017-02-15T13:48:40Z

sklearn/preprocessing.py

+        """
+        X = array2d(X)
+        n_samples, n_features = X.shape
+        full_sort_X_ = np.sort(X, axis=0)


We should add an hyperparameter subsample=int(1e5) to subsample without replacement (e.g. using np.random.RandomState(self.random_state).choice) when X.shape[0] is larger than subsample as the CDF estimate won't change much. No need to suffer the n.log(n) complexity when n is very large.

ogrisel · 2017-02-15T13:50:04Z

sklearn/preprocessing.py

+
+    Attributes
+    ----------
+    `sort_X_` : array of ints, shape (n_samples, n_features)


This should better be renamed landmarks_.

Actually we should just store an attribute named quantiles_ with shape (n_quantiles, n_features) as suggested by @jnothman .

ogrisel · 2017-02-15T13:50:57Z

sklearn/preprocessing.py

@@ -399,6 +404,120 @@ def __init__(self, copy=True, with_mean=True, with_std=True):
        super(Scaler, self).__init__(copy, with_mean, with_std)


+class RankScaler(BaseEstimator, TransformerMixin):


I would rather rename this to QuantileTransformer or QuantileNormalizer.

ogrisel · 2017-02-15T13:51:24Z

sklearn/preprocessing.py

+
+    Parameters
+    ----------
+    n_ranks : int, 1000 by default


n_landmarks ?

ogrisel · 2017-02-15T13:52:26Z

sklearn/preprocessing.py

+        X : array-like, shape (n_samples, n_features)
+            The data used to scale along the features axis.
+        """
+        X = array2d(X)


I think we should also support sparse data, at least when the data is positive. The quantile transformation would not break the sparsity pattern in that case.

raghavrv · 2017-02-15T13:54:32Z

We (@tguillemot @glemaitre and myself) are planning a mini sprint to resurrect this one...

With Guillaume fixing the code, Thierry making the test and myself making an example comparing the min max scaler with this PR... @GaelVaroquaux @ogrisel @jnothman @dengemann Sounds okay to you?

dengemann · 2017-02-15T13:56:19Z

We (@tguillemot @glemaitre and myself) are planning a mini sprint to resurrect this one...

Let me know when you plan to do so I might join you.

ogrisel · 2017-02-15T13:58:49Z

+1

tguillemot · 2017-02-15T13:59:17Z

@dengemann Certainly now ;)

jnothman · 2017-02-15T13:59:31Z

I agree with all your comments, @ogrisel, except that if we're calling it `QuantileBlah`, then n_ranks should become n_quantiles.

…

On 16 February 2017 at 00:56, Denis A. Engemann ***@***.***> wrote: We ***@***.*** <https://github.com/tguillemot> @glemaitre <https://github.com/glemaitre> and myself) are planning a mini sprint to resurrect this one... Let me know when you plan to do so I might join you. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2176 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yAlJklMBLo9e66RbDK5DxHjlNYEks5rcwQGgaJpZM4A1XQj> .

jnothman · 2017-02-15T14:01:39Z

Well with this sudden (and somewhat alarming) groundswell, I'm going to sleep and am curious to see what I find when I awake!

ogrisel · 2017-02-15T14:02:07Z

For a large amount of data, turning off interpolation in np.percentile
might also help (using eg the "middle" rule).

@GaelVaroquaux do you mean at transform time? At train time I think sorting is going to be the CPU bottleneck.

We could possibly reduce the transform throughput by removing the linear interpolation but I am not sure that np.searchsorted is that much faster than interp1d. The expensive step is the search, not the interpolation IMHO (but this should be checked with a benchmark).

ogrisel · 2017-02-15T14:02:43Z

Well with this sudden (and somewhat alarming) groundswell, I'm going to sleep and am curious to see what I find when I awake!

The 5 of us did chat about it at the coffee machine after lunch :)

GaelVaroquaux · 2017-02-15T14:03:44Z

@GaelVaroquaux do you mean at transform time? At train time I think sorting is going to be the CPU bottleneck.

Yes sorting is n log n.

We could possibly reduce the transform throughput by removing the linear interpolation but I am not sure that np.searchsorted is that much faster than interp1d. The expensive step is the search, not the interpolation IMHO (but this should be checked with a benchmark).

Agreed.

GaelVaroquaux · 2017-02-15T14:04:06Z

Well with this sudden (and somewhat alarming) groundswell, I'm going to sleep and am curious to see what I find when I awake!

Good night!

dengemann · 2017-02-15T14:19:47Z

Well with this sudden (and somewhat alarming) groundswell, I'm going to sleep and am curious to see what I find when I awake!

@jnothman I'm outing myself as the fire starter and apologise for peaking anomaly detection alerts and take any complaints about email overflow :)

amueller · 2017-06-10T10:44:31Z

merged as #8363.

Added code for sklearn.preprocessing.RankScaler

c6a8954

turian mentioned this pull request Jul 21, 2013

Rank normalization of features #1062

Closed

larsmans reviewed Jul 22, 2013
View reviewed changes

turian added 3 commits July 26, 2013 17:50

Fixed PEP8 violations, and type-checking

c2fe494

Wrote a linear approximation in RankScaler

e004691

PEP8 fixes

f1f6403

agramfort reviewed Jul 27, 2013
View reviewed changes

jnothman reviewed Feb 15, 2017

View reviewed changes

ogrisel reviewed Feb 15, 2017

View reviewed changes

glemaitre mentioned this pull request Feb 15, 2017

[MRG+1] QuantileTransformer #8363

Merged

amueller closed this Jun 10, 2017

		@@ -399,6 +404,120 @@ def __init__(self, copy=True, with_mean=True, with_std=True):
		super(Scaler, self).__init__(copy, with_mean, with_std)


		class RankScaler(BaseEstimator, TransformerMixin):

Uh oh!

[MRG] Added code for sklearn.preprocessing.RankScaler #2176

[MRG] Added code for sklearn.preprocessing.RankScaler #2176

Uh oh!

Conversation

turian commented Jul 21, 2013

Uh oh!

larsmans commented Jul 22, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Jul 23, 2013

Uh oh!

ogrisel commented Jul 23, 2013

Uh oh!

turian commented Jul 24, 2013

Uh oh!

ogrisel commented Jul 25, 2013

Uh oh!

turian commented Jul 27, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 15, 2017

Uh oh!

GaelVaroquaux commented Feb 15, 2017 via email

Uh oh!

glemaitre commented Feb 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dengemann commented Feb 15, 2017

Uh oh!

glemaitre commented Feb 15, 2017

Uh oh!

jnothman commented Feb 15, 2017

Uh oh!

jnothman commented Feb 15, 2017

Uh oh!

jnothman commented Feb 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Feb 15, 2017

Uh oh!

glemaitre commented Feb 15, 2017

Uh oh!

jnothman commented Feb 15, 2017 via email

Uh oh!

glemaitre commented Feb 15, 2017

Uh oh!

dengemann commented Feb 15, 2017

Uh oh!

GaelVaroquaux commented Feb 15, 2017 via email

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Feb 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Feb 15, 2017 •

edited

Loading

jnothman commented Feb 15, 2017 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel Feb 15, 2017 •

edited

Loading

ogrisel Feb 15, 2017 •

edited

Loading