-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Remove _check_1d_array and _is_1d private function #2002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
y1 : array-like, | ||
y1 must be a "vector". | ||
y_true : array-like, | ||
y_true must be a "vector". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by vector? A 1D-array?
I don't like the terminology "vector": a matrix is a vector of a matrix vector-space, same thing for an image. In the parameter description, we usually specify the characteristics (shape, possibly type of values, e.g. integers or floats). Also, we give the intuitive meaning of the parameter, if there is any.
@arjoly : I am not too enthusiastic about what I am seeing in the metrics.py file. I can only blame myself, as I should have done the reviews of the previous PRs. I don't see the usecase of the _is_1d function. If it is applied on ndarrays, all there is to do is test 'a.ndim == 1'. The value might be to operate on lists, but I don't really see why this is useful, as the internals of the _is_1d function will create temporary arrays from these lists (for instance in the call to np.shape). Ah! I think that I got it !! You want to catter for n-D arrays that are 'flat'. OK, you cannot call a function like this, as in numpy speak a 1D array is an array that has only one dimension. Such naming will induce a lot of confusion on hard core numpy users like me (also known as 'old farts'?). Do we really need to catter for this. It adds a lot of complexity for not much. Can we not just use np.squeeze on the input array at some point, and stop worrying about all this? The code is probably designed like this for a reason, but it is hard to follow: the comments say 'handle mix 1d representation', but I don't see where this mixed representation is defined, and as it is not in the docstrings, the user won't find it. It seems to me that you could get rid of _check_1d_array and replace it by a call to np.squeeze on the arrays. What do you think? |
Thanks, this seems exactly what I want! |
Thanks a lot @GaelVaroquaux for your input. |
I think that I would use the squeeze + atleast_1d code rather than the ravel as it is more explicit and will probably give errors easier to diagnostic if people pass in stupid things. Apart from this, I say 👍 to merge. The code seems much simplified to me. Thanks a lot for being so receptive to my remarks. It's an absolute pleasure to work with people like you. |
I haven't taken a look in detail at the PR... I agree that I have wanted to clarify: what is the use-case accepting 2d row/column vectors to metrics? (I could easily understand it if it were the output of a slice over a sparse matrix, but we don't handle sparse here.) And are we only interested in 1-2d? |
LGTM. |
Actually, should we not be verifying that the output of squeeze is (precisely, not at least) 1d? |
The mix of one 2d row vector and a column / list / "1d" vector raise an error. SInce it is considered that they have a different number of samples. From a user perspective, it would be an unpleasant experience to discover that using a column vector could give a different value. Many estimators work correctly with a 2d column vector, e.g. see
Everything work with 1-2d at the moment. I have seen a pr to work with 3-d numpy array. However as far as I know, this hasn't been merged yet. |
Okay. I don't then understand why we support row vectors as targets at all, rather than just banning their mix with other formats.
What I meant is whether a 3d array with two 1s in its shape might be meaningfully passed to a metric and squeezed to 1d. It still seems to me you need to confirm that |
I disagree there. I don't think that being transparent to a transposition |
A column should be okay then, but not a row? |
Ok, mix of row vectors need to launch a ValueError. Except when it is meaningful with multilabel binary indicator format or multi-output regression. |
A row vector at all should raise a I think we should be explicitly validating for: |
Yes, this looks right to me. Thanks |
(And if in the future we find reason to extend it, that's better than having it over-permissive now.) |
Oy. I get mentioned in a commit. I personally think that type of checking would be better refactored, which is why |
The shape check is missing from a number of the metrics. Is that intentional? |
With those metrics, a row vector means that it is either a multioutput output or a multilabel indicator output. |
By the time you get to Also, |
@jnothman I revert the definition of |
y1, y2 = check_arrays(y1, y2, allow_lists=True) | ||
|
||
if not is_multilabel(y1): | ||
y1, y2 = check_arrays(y1, y2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is repeated.
ok good, I will use it! |
@jnothman : happy with this version? If you are, I think that we can merge. |
Quite :) Sorry, I won't be very active here for the next few weeks... poor timing, I know! Go ahead, merge already! |
I am in Airports, in between countries. I don't want to attempt a merge, |
Rebase on top of master |
I'll be looking at this now. |
cool ! |
Merged by rebase. Thanks! Also, removed the '^' notation for sets: explicit is better than implicit. |
ok thanks !! |
Can you point me to the commits? |
Now it's good !!! |
Improve the test coverage and test formally this utilities.
I have also improved the
ValueError
message.edit : Thanks to @GaelVaroquaux, I ended remove those two.