1d array support in MinMaxScaler #1549

mblondel · 2013-01-09T09:47:15Z

MinMaxScaler is useful for regression to scale y. For this reason, it would be nice to support 1d arrays in fit and transform.

kyleabeauchamp · 2013-01-10T21:40:31Z

So to me, the best way to do this is to:

Check if np.rank(X) == 1
Reshape X to have rank 2
Apply fit
reshape to have rank 1.

Is there a more elegant way to deal with rank 1 vs rank n issues in Numpy?

kyleabeauchamp · 2013-01-15T04:43:06Z

Another related question: does it make sense to give MinMaxScaler a Numpy array with ndim > 2?

Right now, this doesn't raise any error. I wonder if there might be "hidden logic" that assumes ndim = 2 in various preprocessing routines.

amueller · 2013-01-15T08:45:03Z

@kyleabeauchamp that is a good comment. I would have imagined that check_arrays tests whether ndim=2, but I am not actually sure it does.

mblondel · 2013-01-15T09:08:36Z

One thing that worries me in retrospect about this issue is that reshaping a 1d X or 1d y is not the same (reshape(1, -1) for one and reshape(-1, 1) for the other). So if we do decide to implement this feature, we need to document clearly that 1d array support if for y, not X (or we could add a constructor option to specify the way to reshape).

kyleabeauchamp · 2013-01-15T18:17:10Z

To me, I would think that 1D X and 1D y would be the same. My thinking is that if you reshape a 1D y vector, you are actually treating y as a single feature from a hypothetical feature matrix.

If you have a 1D X, my first guess would be the case of linear regression, where you have a single feature with several observations.

I think regardless, we might want to add tighter checking of ndim.

mblondel · 2013-01-15T18:54:53Z

Our utility function array2d (used throughout the scikit on X) behaves like this:

>>> from sklearn.utils.validation import array2d
>>> array2d([1,2,3]).shape
(1, 3)

So [1,2,3] is treated as a single 3-dimensional sample. It's useful when you want to call predict on a single sample, for instance.

For y, if we want to apply it to MinMaxScaler, I would think that we need to treat it as n samples from the same feature:

>>> np.array([1,2,3]).reshape(-1,1).shape
(3, 1)

Or maybe I'm just confused.

amueller · 2013-01-15T19:22:05Z

@mblondel I think you are right. Though the docs and tests on this issue (treating single column X) are not so great :-/

mblondel · 2013-01-16T08:43:49Z

I think that a single sample (1, n_features) is more useful for predict and n_samples one-dimensional samples (n_samples, 1) is more useful for fit (fitting only one sample is not very useful except for partial_fit). However, in terms of API, that would be a bit confusing.

amueller · 2013-01-16T09:24:13Z

we need to add tests that the treatment of 1D arrays is consistent everywhere. preferrably before merging any code that changes the behavior.

Mathieu Blondel notifications@github.com schrieb:

I think that a single sample (1, n_features) is more useful for
predict and n_samples one-dimensional samples (n_samples, 1) is more
useful for fit (fitting only one sample is not very useful except for
partial_fit). However, in terms of API, that would be a bit
confusing.

Reply to this email directly or view it on GitHub:
#1549 (comment)

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

larsmans · 2013-01-18T16:57:50Z

-1 on overloading the meaning of 1-d arrays; I'd rather issue a warning (or even an exception) with an explanatory message when one is handed to fit.

kyleabeauchamp · 2013-01-18T18:22:18Z

So utils.check_arrays() doesn't actually check the shapes of arrays. The docstring is somewhat misleading: "Checks whether all objects in arrays have the same shape or length."

To me, I think what we need is to break check_arrays() into several sub-functions or maybe a class:

Check sparse / dense
Check dtype
Check first dimension
Check all dimensions

It's not clear how we would go about checking y versus x--in that case, we want to basically ignore point 4.

amueller · 2013-01-18T18:42:54Z

Number 4. is not done by check_arrays - and we don't do this any where, do we?

Interestingly, when you start a line with 4. github changes it to 1.. Way to go?!

kyleabeauchamp · 2013-01-19T22:51:29Z

To avoid taking over the original intent of this issue (MinMaxScaler), I created a new issue for us to discuss the ndim / shape stuff: #1597 .

amueller · 2013-01-23T23:14:56Z

thanks @kyleabeauchamp, sorry I have been super busy the last two days.

linbianxiaocao · 2013-11-13T19:53:50Z

Hi, I tried inputting 1D array and found that function MinMaxScaler.fit does not support arrays if its shape is (N, ). However, it works for 1D array if its shape is (N, 1) or (1, N).

For instance, the following code won't work


                                      x1d = np.array([1., 2., 0.])
                                      MinMaxScaler().fit(x1d)

The error occurs because data_range in MinMaxScaler.fit is equal to a float number, which does not support indexing as data_range[].

However if we reshape x1d by either
x1d = x1d.reshape(1, -1)
or x1d = x1d.reshape(-1, 1)
then MinMaxScaler().fit(x1d) gives you no error.

So I think we could simply do a reshape if len(x1d.shape) == 1. I also think we should reshape it to a column vector (N, 1) by x1d = x1d.reshape(-1, 1). In this way, the 1D input array is considered as N samples for one single feature. And it makes more sense this way as compared with reshaping to row vector of shape (1, N). That's because we want to fit data that have a number of samples, but not to fit data that have only one sample.

I am new here but just wanted to start working on some easy issues:). I could be wrong and I'd be appreciated to hear your opinions.

GaelVaroquaux · 2013-11-13T21:48:57Z

Hi, I tried inputting 1D array and found that function MinMax.fit does not
support arrays if its shape is (N, ). However, it works for 1D array if its
shape is (N, 1) or (1, N).

That's expected. X should be a 2D array, of shape (n_samples, n_features)

mblondel · 2013-11-14T01:32:31Z

I'm closing the issue since reshaping X and y is not consistent (#1549 (comment)).

kyleabeauchamp mentioned this issue Jan 19, 2013

Check ndim of input arrays. (utils.check_arrays) #1597

Closed

amueller mentioned this issue Feb 13, 2013

Check consistency / correctness of 1d input and n-d input for all estimators #1678

Closed

mblondel closed this as completed Nov 14, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1d array support in MinMaxScaler #1549

1d array support in MinMaxScaler #1549

mblondel commented Jan 9, 2013

kyleabeauchamp commented Jan 10, 2013

kyleabeauchamp commented Jan 15, 2013

amueller commented Jan 15, 2013

mblondel commented Jan 15, 2013

kyleabeauchamp commented Jan 15, 2013

mblondel commented Jan 15, 2013

amueller commented Jan 15, 2013

mblondel commented Jan 16, 2013

amueller commented Jan 16, 2013

larsmans commented Jan 18, 2013

kyleabeauchamp commented Jan 18, 2013

amueller commented Jan 18, 2013

kyleabeauchamp commented Jan 19, 2013

amueller commented Jan 23, 2013

linbianxiaocao commented Nov 13, 2013

GaelVaroquaux commented Nov 13, 2013

mblondel commented Nov 14, 2013

1d array support in MinMaxScaler #1549

1d array support in MinMaxScaler #1549

Comments

mblondel commented Jan 9, 2013

kyleabeauchamp commented Jan 10, 2013

kyleabeauchamp commented Jan 15, 2013

amueller commented Jan 15, 2013

mblondel commented Jan 15, 2013

kyleabeauchamp commented Jan 15, 2013

mblondel commented Jan 15, 2013

amueller commented Jan 15, 2013

mblondel commented Jan 16, 2013

amueller commented Jan 16, 2013

larsmans commented Jan 18, 2013

kyleabeauchamp commented Jan 18, 2013

amueller commented Jan 18, 2013

kyleabeauchamp commented Jan 19, 2013

amueller commented Jan 23, 2013

linbianxiaocao commented Nov 13, 2013

GaelVaroquaux commented Nov 13, 2013

mblondel commented Nov 14, 2013