Skip to content

[MRG] Remove _check_1d_array and _is_1d private function #2002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

arjoly
Copy link
Member

@arjoly arjoly commented May 26, 2013

Improve the test coverage and test formally this utilities.

I have also improved the ValueError message.

edit : Thanks to @GaelVaroquaux, I ended remove those two.

y1 : array-like,
y1 must be a "vector".
y_true : array-like,
y_true must be a "vector".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by vector? A 1D-array?

I don't like the terminology "vector": a matrix is a vector of a matrix vector-space, same thing for an image. In the parameter description, we usually specify the characteristics (shape, possibly type of values, e.g. integers or floats). Also, we give the intuitive meaning of the parameter, if there is any.

@GaelVaroquaux
Copy link
Member

@arjoly : I am not too enthusiastic about what I am seeing in the metrics.py file. I can only blame myself, as I should have done the reviews of the previous PRs.

I don't see the usecase of the _is_1d function. If it is applied on ndarrays, all there is to do is test 'a.ndim == 1'. The value might be to operate on lists, but I don't really see why this is useful, as the internals of the _is_1d function will create temporary arrays from these lists (for instance in the call to np.shape).

Ah! I think that I got it !! You want to catter for n-D arrays that are 'flat'. OK, you cannot call a function like this, as in numpy speak a 1D array is an array that has only one dimension. Such naming will induce a lot of confusion on hard core numpy users like me (also known as 'old farts'?).

Do we really need to catter for this. It adds a lot of complexity for not much. Can we not just use np.squeeze on the input array at some point, and stop worrying about all this? The code is probably designed like this for a reason, but it is hard to follow: the comments say 'handle mix 1d representation', but I don't see where this mixed representation is defined, and as it is not in the docstrings, the user won't find it. It seems to me that you could get rid of _check_1d_array and replace it by a call to np.squeeze on the arrays. What do you think?

@arjoly
Copy link
Member Author

arjoly commented May 26, 2013

Do we really need to catter for this. It adds a lot of complexity for not much. Can we not just use np.squeeze on the input array at some point, and stop worrying about all this? The code is probably designed like this for a reason, but it is hard to follow: the comments say 'handle mix 1d representation', but I don't see where this mixed representation is defined, and as it is not in the docstrings, the user won't find it. It seems to me that you could get rid of _check_1d_array and replace it by a call to np.squeeze on the arrays. What do you think?

Thanks, this seems exactly what I want!

@arjoly
Copy link
Member Author

arjoly commented May 26, 2013

Thanks a lot @GaelVaroquaux for your input.
I highly appreciate that you took the time to review.

@GaelVaroquaux
Copy link
Member

I think that I would use the squeeze + atleast_1d code rather than the ravel as it is more explicit and will probably give errors easier to diagnostic if people pass in stupid things.

Apart from this, I say 👍 to merge. The code seems much simplified to me. Thanks a lot for being so receptive to my remarks. It's an absolute pleasure to work with people like you.

@jnothman
Copy link
Member

I haven't taken a look in detail at the PR... I agree that squeeze followed by asserting that ndim == 1 is more appropriate than ravel, but I regularly forget that the former exists.

I have wanted to clarify: what is the use-case accepting 2d row/column vectors to metrics? (I could easily understand it if it were the output of a slice over a sparse matrix, but we don't handle sparse here.) And are we only interested in 1-2d?

@jnothman
Copy link
Member

LGTM.

@jnothman
Copy link
Member

Actually, should we not be verifying that the output of squeeze is (precisely, not at least) 1d?

@arjoly
Copy link
Member Author

arjoly commented May 27, 2013

I have wanted to clarify: what is the use-case accepting 2d row/column vectors to metrics? (I could easily understand it if it were the output of a slice over a sparse matrix, but we don't handle sparse here.)

The mix of one 2d row vector and a column / list / "1d" vector raise an error. SInce it is considered that they have a different number of samples.

From a user perspective, it would be an unpleasant experience to discover that using a column vector could give a different value. Many estimators work correctly with a 2d column vector, e.g. see DecisionTreeClassifier.

And are we only interested in 1-2d?

Everything work with 1-2d at the moment. I have seen a pr to work with 3-d numpy array. However as far as I know, this hasn't been merged yet.

@jnothman
Copy link
Member

Many estimators work correctly with a 2d column vector

Okay. I don't then understand why we support row vectors as targets at all, rather than just banning their mix with other formats.

Everything work with 1-2d at the moment. I have seen a pr to work with 3-d numpy array. However as far as I know, this hasn't been merged yet.

What I meant is whether a 3d array with two 1s in its shape might be meaningfully passed to a metric and squeezed to 1d.

It still seems to me you need to confirm that squeeze's output is 1d in metrics.

@GaelVaroquaux
Copy link
Member

From a user perspective, it would be an unpleasant experience to discover that
using a column vector would give a different value. Many estimators work
correctly with a 2d column vector, e.g. see DecisionTreeClassifier.

I disagree there. I don't think that being transparent to a transposition
is a good idea. The docs should be pretty clear that the first dimension
is the number of samples. Trying to catter for this will make our
codebase more complex, induce undefined behavior and make it harder to
capture bugs both in our codebase and in the user's.

@jnothman
Copy link
Member

I don't think that being transparent to a transposition is a good idea.

A column should be okay then, but not a row?

@arjoly
Copy link
Member Author

arjoly commented May 27, 2013

Ok, mix of row vectors need to launch a ValueError.

Except when it is meaningful with multilabel binary indicator format or multi-output regression.

@jnothman
Copy link
Member

A row vector at all should raise a ValueError, it seems.

I think we should be explicitly validating for: y.ndim == 1 or (y.ndim == 2 and y.shape[1] == 1) (if as you say we must handle the column vector case). Anything else should raise a ValueError, and the latter should be squeezed.

@GaelVaroquaux
Copy link
Member

I think we should be explicitly validating for: y.ndim == 1 or (y.ndim == 2 and
y.shape[1] == 1)

Yes, this looks right to me. Thanks

@jnothman
Copy link
Member

(And if in the future we find reason to extend it, that's better than having it over-permissive now.)

@jnothman
Copy link
Member

Oy. I get mentioned in a commit. I personally think that type of checking would be better refactored, which is why _check_1d_array existed, but yes, that's functionally much better. But you shouldn't need np.atleast_1d after that check.

@jnothman
Copy link
Member

The shape check is missing from a number of the metrics. Is that intentional?

@arjoly
Copy link
Member Author

arjoly commented May 27, 2013

With those metrics, a row vector means that it is either a multioutput output or a multilabel indicator output.

@jnothman
Copy link
Member

With those metrics, a row vector means that it is either a multioutput output or a multilabel indicator output.

By the time you get to squeeze you must have decided this is not the case, or it would not be safe to squeeze.

Also, is_label_indicator_matrix specifically requires y to have more than one sample. Whether that's right or not is a different matter.

@arjoly
Copy link
Member Author

arjoly commented May 27, 2013

@jnothman I revert the definition of is_label_indicator_matrix.

y1, y2 = check_arrays(y1, y2, allow_lists=True)

if not is_multilabel(y1):
y1, y2 = check_arrays(y1, y2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated.

@arjoly
Copy link
Member Author

arjoly commented Jul 10, 2013

ok good, I will use it!

@GaelVaroquaux
Copy link
Member

@jnothman : happy with this version? If you are, I think that we can merge.

@jnothman
Copy link
Member

happy with this version?

Quite :)

Sorry, I won't be very active here for the next few weeks... poor timing, I know!

Go ahead, merge already!

@GaelVaroquaux
Copy link
Member

Go ahead, merge already!

I am in Airports, in between countries. I don't want to attempt a merge,
especially since master has gone haywire on 2.6. I'll get to it later,
maybe in a few days, if no one beats me to it.

@arjoly
Copy link
Member Author

arjoly commented Jul 15, 2013

Rebase on top of master

@GaelVaroquaux
Copy link
Member

I'll be looking at this now.

@arjoly
Copy link
Member Author

arjoly commented Jul 22, 2013

cool !

@GaelVaroquaux
Copy link
Member

Merged by rebase. Thanks!

Also, removed the '^' notation for sets: explicit is better than implicit.

@arjoly
Copy link
Member Author

arjoly commented Jul 22, 2013

ok thanks !!

@arjoly arjoly closed this Jul 22, 2013
@arjoly
Copy link
Member Author

arjoly commented Jul 22, 2013

Can you point me to the commits?

@arjoly arjoly reopened this Jul 22, 2013
@arjoly
Copy link
Member Author

arjoly commented Jul 22, 2013

Now it's good !!!

@arjoly arjoly closed this Jul 22, 2013
@arjoly arjoly deleted the tst-metric-utilities branch July 22, 2013 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants