-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Added code for sklearn.preprocessing.RankScaler #2176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We use normalization to refer to normalized (unit) sample vectors. Standardizer would maybe be more apt. |
if X.ndim != 2: | ||
raise ValueError("Rank-standardization only tested on 2-D matrices.") | ||
else: | ||
self.sort_X_ = np.sort(X, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be an argsort
if indices are supposed to come out? Also, do you need to store the full training set, or could a summary statistic over axis 0 suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be an argsort. In fit, I simply sort the feature values. In transform, I use np.searchsorted (which is like Python bisect) to find the index that a particular feature value would be inserted at.
It is possible that you do not need to store the full training set. The # of different feature values that you store will determine the granularity of the final transformed feature values. e.g. if you store only 100 values, the resolution of the transformed values should be 0.01.
However, I am not sure the best way to implement this in practice. Things get tricky when there is a large number of repeated values. One naive way to implement this would be to define a resolution parameter (e.g. 100), and compute the feature value for 0/resolution through resolution/resolution. This truncated table would be stored instead of Sort_X_. If you want to take advantage of many repeated values, this would require larger changes to the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed a change that implements the summary statistic.
Travis build failed on this branch: |
Storing a (sorted) copy of the original data in memory seems wasteful to me. Wouldn't it make more sense to compute percentile bin boundaries at fit time to only store the bin boundary values on as an attribute on the scaler object to be reused a transform time to do the actual scaling (with linear interpolation)? Also we should find a way to not call searchsorted twice if possible. |
Points taken. Could someone point me to documentation on how to do type-checking correctly? I will think over how I can do the approximate fit correctly. |
Unfortunately I don't think there is a good doc for this and the current code base is not very consistent. As you don't support sparse matrices (at least not in the current state of this PR) you should probably use |
|
you want outliers to be given high importance (StandardScaler) | ||
or not (RankScaler). | ||
|
||
TODO: min and max parameters? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can keep it for later.
to keep a note add a commented line in the code e.g.
# XXX add min max parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As requested, I have added a commented line in the code. Will push soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this hardly requires a sprint. All that's needed here is a bit of narrative documentation and more comparison to other scalers (particularly RobustScaler
among others that have appeared since this PR), and perhaps handling the minor code smells I've raised.
for i in range(self.n_ranks): | ||
for j in range(n_features): | ||
# Find the corresponding i in the original ranking | ||
iorig = i * 1. * n_samples / self.n_ranks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not need 1. *
; need from __future__ import division
for j in range(X.shape[1]): | ||
lidx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='left') | ||
ridx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='right') | ||
v = 1. * (lidx + ridx) / (2 * self.sort_X_.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not need 1. *
# Approximate the stored sort_X_ | ||
self.sort_X_ = np.zeros((self.n_ranks, n_features)) | ||
for i in range(self.n_ranks): | ||
for j in range(n_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation of iorig
, ioriglo
, iorighi
can be taken out of this loop. Moreover, for some rank i
I think sort_X_[i, :]
can be easily calculated in vectorized operations.
Also, implementing |
Also, implementing inverse_transform wouldn't hurt.
Yes, that would be super useful.
|
@jnothman Would it not be more readable to use |
this sounds more elegant to me than the current implementation. But maybe I'm missing something. |
By quantiles, I mean ranks for that PR. |
I'd forgotten |
Yes, interp1d seems advisable. |
Sorry, mistaken again; I think the default interpolation in numpy.percentile before 1.9 is probably what we need, so yes: what you propose should work nicely, @glemaitre. |
But the implementation is a minor issue here. The narrative docs and absent features like |
Also, @ogrisel was telling that for a large amount of data, we could subsample
+1 |
You mean subsample X to avoid a full sort? I suppose so.
…On 16 February 2017 at 00:11, Guillaume Lemaitre ***@***.***> wrote:
Also, @ogrisel <https://github.com/ogrisel> was telling that for a large
amount of data, we could subsample X to find the percentiles.
But the implementation is a minor issue here. The narrative docs and
absent features like inverse_transform are more of an issue.
+1
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2176 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xQQv-gtQfzxmvxmxEofLIyy_31bks5rcvllgaJpZM4A1XQj>
.
|
Yes, exactly. |
[@jnothman there you go, this is is almost getting a mini-sprint ;-)] |
Also, @ogrisel was telling that for a large amount of data, we could subsampleX to find the percentiles.
For a large amount of data, turning off interpolation in np.percentile
might also help (using eg the "middle" rule).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few inline comments that match some points already discussed previously.
@@ -294,6 +296,9 @@ class StandardScaler(BaseEstimator, TransformerMixin): | |||
:func:`sklearn.preprocessing.scale` to perform centering and | |||
scaling without using the ``Transformer`` object oriented API | |||
|
|||
:class:`sklearn.preprocessing.RankScaler` to perform standardization | |||
that is more robust to outliers, but slower and more memory-intensive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should not be called standardization. In statistics standardization is specifically related to the Z-score transform (mean removal and standard deviation scaling), e.g.:
http://www.ats.ucla.edu/stat/stata/faq/standardize.htm
:class:sklearn.preprocessing.RankScaler
is a feature-wise preprocessing operation that can be used as an alternative to standardization. :class:sklearn.preprocessing.RankScaler
is more robust to outliers but slightly slower and more memory intensive.
assert whi >= 0 and whi <= 1 | ||
assert_almost_equal(wlo+whi, 1.) | ||
self.sort_X_[i, j] = wlo * full_sort_X_[ioriglo, j] \ | ||
+ whi * full_sort_X_[iorighi, j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just use https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html but we need to be careful about pickling support (as done in the IsotonicRegression
estimator class).
ridx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='right') | ||
v = 1. * (lidx + ridx) / (2 * self.sort_X_.shape[0]) | ||
X2[:,j] = v | ||
return X2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here we should just use interp1d
.
""" | ||
X = array2d(X) | ||
n_samples, n_features = X.shape | ||
full_sort_X_ = np.sort(X, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add an hyperparameter subsample=int(1e5)
to subsample without replacement (e.g. using np.random.RandomState(self.random_state).choice
) when X.shape[0] is larger than subsample
as the CDF estimate won't change much. No need to suffer the n.log(n)
complexity when n is very large.
|
||
Attributes | ||
---------- | ||
`sort_X_` : array of ints, shape (n_samples, n_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should better be renamed landmarks_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should just store an attribute named quantiles_
with shape (n_quantiles, n_features)
as suggested by @jnothman .
@@ -399,6 +404,120 @@ def __init__(self, copy=True, with_mean=True, with_std=True): | |||
super(Scaler, self).__init__(copy, with_mean, with_std) | |||
|
|||
|
|||
class RankScaler(BaseEstimator, TransformerMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather rename this to QuantileTransformer
or QuantileNormalizer
.
|
||
Parameters | ||
---------- | ||
n_ranks : int, 1000 by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_landmarks
?
X : array-like, shape (n_samples, n_features) | ||
The data used to scale along the features axis. | ||
""" | ||
X = array2d(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also support sparse data, at least when the data is positive. The quantile transformation would not break the sparsity pattern in that case.
We (@tguillemot @glemaitre and myself) are planning a mini sprint to resurrect this one... With Guillaume fixing the code, Thierry making the test and myself making an example comparing the min max scaler with this PR... @GaelVaroquaux @ogrisel @jnothman @dengemann Sounds okay to you? |
Let me know when you plan to do so I might join you. |
+1 |
@dengemann Certainly now ;) |
I agree with all your comments, @ogrisel, except that if we're calling it
`QuantileBlah`, then n_ranks should become n_quantiles.
…On 16 February 2017 at 00:56, Denis A. Engemann ***@***.***> wrote:
We ***@***.*** <https://github.com/tguillemot> @glemaitre
<https://github.com/glemaitre> and myself) are planning a mini sprint to
resurrect this one...
Let me know when you plan to do so I might join you.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2176 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yAlJklMBLo9e66RbDK5DxHjlNYEks5rcwQGgaJpZM4A1XQj>
.
|
Well with this sudden (and somewhat alarming) groundswell, I'm going to sleep and am curious to see what I find when I awake! |
@GaelVaroquaux do you mean at transform time? At train time I think sorting is going to be the CPU bottleneck. We could possibly reduce the transform throughput by removing the linear interpolation but I am not sure that |
The 5 of us did chat about it at the coffee machine after lunch :) |
@GaelVaroquaux do you mean at transform time? At train time I think sorting is
going to be the CPU bottleneck.
Yes sorting is n log n.
We could possibly reduce the transform throughput by removing the linear
interpolation but I am not sure that np.searchsorted is that much faster than
interp1d. The expensive step is the search, not the interpolation IMHO (but
this should be checked with a benchmark).
Agreed.
|
Well with this sudden (and somewhat alarming) groundswell, I'm going to sleep and am curious to see what I find when I awake!
Good night!
|
@jnothman I'm outing myself as the fire starter and apologise for peaking anomaly detection alerts and take any complaints about email overflow :) |
merged as #8363. |
I wrote code for doing rank-scaling. This scaling technique is more robust than StandardScaler (unit variance, zero mean).
I believe that "scale" is the wrong term for this operation. It's actually feature "normalization". This name-conflicts with the "normalize" method, though.
I wrote documentation and tests. However, I was unable to get the doc-suite or test-suite to build for the current sklearn HEAD, so I couldn't double-check all my documentation and tests.