Skip to content

Check ndim of input arrays. (utils.check_arrays) #1597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kyleabeauchamp opened this issue Jan 19, 2013 · 10 comments
Closed

Check ndim of input arrays. (utils.check_arrays) #1597

kyleabeauchamp opened this issue Jan 19, 2013 · 10 comments

Comments

@kyleabeauchamp
Copy link
Contributor

As discussed in #1549, utils.check_arrays() does not check the shape or ndim of input arrays. However, the docstring suggests that it should: "Checks whether all objects in arrays have the same shape or length."

To me, I think what we need is to break check_arrays() into several sub-functions or maybe a class:

  1. Check sparse / dense
  2. Check dtype
  3. Check first dimension
  4. Check all dimensions
@amueller
Copy link
Member

The docstring is just a bit off: what it means is have the same shape[0] or length.
It converts 1d to 2d. But we don't check about ndim >2 and we should.

What would be the benefit of having several functions?

@kyleabeauchamp
Copy link
Contributor Author

It's not really necessary to separate things into separate functions, but I was thinking that one might want to check only certain subsets of (1,2,3, and 4). For example, some functions might not care if you send in a 1D or a 2D array, but they might still require that the first dimension be the same in both inputs. Another reason might be readability / code reuse. Again, not really necessary.

As you said, I think the main point is to check for ndim.

@larsmans
Copy link
Member

+1 for breaking check_arrays up. I've been avoiding that function because I found its use complicated, esp. in estimators accepting sparse matrices. E.g. in LinearClassifierMixin I used atleast2d_or_csr and a custom check for n_features. (That's bad, because it produces a different error message than other estimators.)

@amueller
Copy link
Member

why did you find the use complicated?

@larsmans
Copy link
Member

It has a lot of options and the code is quite complex. It also doesn't do all the input validation that is required at predict time (n_features checking), so I use simpler function there instead. atleast2d_or_csr does what its name suggests.

@amueller
Copy link
Member

.. but doesn't check for finite numbers and always enforces the same sparsity type...

I think if we will split up the function, that will lead to code duplication and will make it harder to have consistent checks. Feel free to try, though ;)

@hamsal
Copy link
Contributor

hamsal commented Mar 11, 2014

Hi I am new and trying to make a contribution. I would like to work on a PR for this issue, I think since the situations where ndim > 2 should be consistent across all estimators it should be the case that the check_arrays function raises an error along the lines of "max array dimension has been exceeded". Guidance will be much appreciated.

@larsmans
Copy link
Member

Hi @hamsal,

Indeed, you'd want to insert a check for ndim > 2 and raise a ValueError in check_arrays. Then put an appropriate unit test in sklearn/utils/tests/test_validation.py (or adapt an existing test) and run the full test suite on your local clone to see if anything fails and send a PR if it doesn't (or even if it does, but you got close). A PR would be much appreciated!

@hamsal
Copy link
Contributor

hamsal commented Mar 11, 2014

@larsmans Thanks for the note to include a unit test.

@larsmans
Copy link
Member

Closing as solved by #2958.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants