Skip to content

1d array support in MinMaxScaler #1549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mblondel opened this issue Jan 9, 2013 · 17 comments
Closed

1d array support in MinMaxScaler #1549

mblondel opened this issue Jan 9, 2013 · 17 comments
Labels
Easy Well-defined and straightforward way to resolve Enhancement

Comments

@mblondel
Copy link
Member

mblondel commented Jan 9, 2013

MinMaxScaler is useful for regression to scale y. For this reason, it would be nice to support 1d arrays in fit and transform.

@kyleabeauchamp
Copy link
Contributor

So to me, the best way to do this is to:

  1. Check if np.rank(X) == 1
  2. Reshape X to have rank 2
  3. Apply fit
  4. reshape to have rank 1.

Is there a more elegant way to deal with rank 1 vs rank n issues in Numpy?

@kyleabeauchamp
Copy link
Contributor

Another related question: does it make sense to give MinMaxScaler a Numpy array with ndim > 2?

Right now, this doesn't raise any error. I wonder if there might be "hidden logic" that assumes ndim = 2 in various preprocessing routines.

@amueller
Copy link
Member

@kyleabeauchamp that is a good comment. I would have imagined that check_arrays tests whether ndim=2, but I am not actually sure it does.

@mblondel
Copy link
Member Author

One thing that worries me in retrospect about this issue is that reshaping a 1d X or 1d y is not the same (reshape(1, -1) for one and reshape(-1, 1) for the other). So if we do decide to implement this feature, we need to document clearly that 1d array support if for y, not X (or we could add a constructor option to specify the way to reshape).

@kyleabeauchamp
Copy link
Contributor

To me, I would think that 1D X and 1D y would be the same. My thinking is that if you reshape a 1D y vector, you are actually treating y as a single feature from a hypothetical feature matrix.

If you have a 1D X, my first guess would be the case of linear regression, where you have a single feature with several observations.

I think regardless, we might want to add tighter checking of ndim.

@mblondel
Copy link
Member Author

Our utility function array2d (used throughout the scikit on X) behaves like this:

>>> from sklearn.utils.validation import array2d
>>> array2d([1,2,3]).shape
(1, 3)

So [1,2,3] is treated as a single 3-dimensional sample. It's useful when you want to call predict on a single sample, for instance.

For y, if we want to apply it to MinMaxScaler, I would think that we need to treat it as n samples from the same feature:

>>> np.array([1,2,3]).reshape(-1,1).shape
(3, 1)

Or maybe I'm just confused.

@amueller
Copy link
Member

@mblondel I think you are right. Though the docs and tests on this issue (treating single column X) are not so great :-/

@mblondel
Copy link
Member Author

I think that a single sample (1, n_features) is more useful for predict and n_samples one-dimensional samples (n_samples, 1) is more useful for fit (fitting only one sample is not very useful except for partial_fit). However, in terms of API, that would be a bit confusing.

@amueller
Copy link
Member

we need to add tests that the treatment of 1D arrays is consistent everywhere. preferrably before merging any code that changes the behavior.

Mathieu Blondel notifications@github.com schrieb:

I think that a single sample (1, n_features) is more useful for
predict and n_samples one-dimensional samples (n_samples, 1) is more
useful for fit (fitting only one sample is not very useful except for
partial_fit). However, in terms of API, that would be a bit
confusing.


Reply to this email directly or view it on GitHub:
#1549 (comment)

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

@larsmans
Copy link
Member

-1 on overloading the meaning of 1-d arrays; I'd rather issue a warning (or even an exception) with an explanatory message when one is handed to fit.

@kyleabeauchamp
Copy link
Contributor

So utils.check_arrays() doesn't actually check the shapes of arrays. The docstring is somewhat misleading: "Checks whether all objects in arrays have the same shape or length."

To me, I think what we need is to break check_arrays() into several sub-functions or maybe a class:

  1. Check sparse / dense
  2. Check dtype
  3. Check first dimension
  4. Check all dimensions

It's not clear how we would go about checking y versus x--in that case, we want to basically ignore point 4.

@amueller
Copy link
Member

Number 4. is not done by check_arrays - and we don't do this any where, do we?

Interestingly, when you start a line with 4. github changes it to 1.. Way to go?!

@kyleabeauchamp
Copy link
Contributor

To avoid taking over the original intent of this issue (MinMaxScaler), I created a new issue for us to discuss the ndim / shape stuff: #1597 .

@amueller
Copy link
Member

thanks @kyleabeauchamp, sorry I have been super busy the last two days.

@linbianxiaocao
Copy link

Hi, I tried inputting 1D array and found that function MinMaxScaler.fit does not support arrays if its shape is (N, ). However, it works for 1D array if its shape is (N, 1) or (1, N).

For instance, the following code won't work


                                      x1d = np.array([1., 2., 0.])
                                      MinMaxScaler().fit(x1d)
                   
The error occurs because data_range in MinMaxScaler.fit is equal to a float number, which does not support indexing as data_range[].

However if we reshape x1d by either
x1d = x1d.reshape(1, -1)
or x1d = x1d.reshape(-1, 1)
then MinMaxScaler().fit(x1d) gives you no error.

So I think we could simply do a reshape if len(x1d.shape) == 1. I also think we should reshape it to a column vector (N, 1) by x1d = x1d.reshape(-1, 1). In this way, the 1D input array is considered as N samples for one single feature. And it makes more sense this way as compared with reshaping to row vector of shape (1, N). That's because we want to fit data that have a number of samples, but not to fit data that have only one sample.

I am new here but just wanted to start working on some easy issues:). I could be wrong and I'd be appreciated to hear your opinions.

@GaelVaroquaux
Copy link
Member

Hi, I tried inputting 1D array and found that function MinMax.fit does not
support arrays if its shape is (N, ). However, it works for 1D array if its
shape is (N, 1) or (1, N).

That's expected. X should be a 2D array, of shape (n_samples, n_features)

@mblondel
Copy link
Member Author

I'm closing the issue since reshaping X and y is not consistent (#1549 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Enhancement
Projects
None yet
Development

No branches or pull requests

6 participants