[MRG] Remove _check_1d_array and _is_1d private function #2002

arjoly · 2013-05-26T16:33:57Z

Improve the test coverage and test formally this utilities.

I have also improved the ValueError message.

edit : Thanks to @GaelVaroquaux, I ended remove those two.

GaelVaroquaux · 2013-05-26T16:42:46Z

sklearn/metrics/metrics.py

-    y1 : array-like,
-        y1 must be a "vector".
+    y_true : array-like,
+        y_true must be a "vector".


What do you mean by vector? A 1D-array?

I don't like the terminology "vector": a matrix is a vector of a matrix vector-space, same thing for an image. In the parameter description, we usually specify the characteristics (shape, possibly type of values, e.g. integers or floats). Also, we give the intuitive meaning of the parameter, if there is any.

GaelVaroquaux · 2013-05-26T17:03:55Z

@arjoly : I am not too enthusiastic about what I am seeing in the metrics.py file. I can only blame myself, as I should have done the reviews of the previous PRs.

I don't see the usecase of the _is_1d function. If it is applied on ndarrays, all there is to do is test 'a.ndim == 1'. The value might be to operate on lists, but I don't really see why this is useful, as the internals of the _is_1d function will create temporary arrays from these lists (for instance in the call to np.shape).

Ah! I think that I got it !! You want to catter for n-D arrays that are 'flat'. OK, you cannot call a function like this, as in numpy speak a 1D array is an array that has only one dimension. Such naming will induce a lot of confusion on hard core numpy users like me (also known as 'old farts'?).

Do we really need to catter for this. It adds a lot of complexity for not much. Can we not just use np.squeeze on the input array at some point, and stop worrying about all this? The code is probably designed like this for a reason, but it is hard to follow: the comments say 'handle mix 1d representation', but I don't see where this mixed representation is defined, and as it is not in the docstrings, the user won't find it. It seems to me that you could get rid of _check_1d_array and replace it by a call to np.squeeze on the arrays. What do you think?

arjoly · 2013-05-26T17:27:01Z

Do we really need to catter for this. It adds a lot of complexity for not much. Can we not just use np.squeeze on the input array at some point, and stop worrying about all this? The code is probably designed like this for a reason, but it is hard to follow: the comments say 'handle mix 1d representation', but I don't see where this mixed representation is defined, and as it is not in the docstrings, the user won't find it. It seems to me that you could get rid of _check_1d_array and replace it by a call to np.squeeze on the arrays. What do you think?

Thanks, this seems exactly what I want!

arjoly · 2013-05-26T19:20:43Z

Thanks a lot @GaelVaroquaux for your input.
I highly appreciate that you took the time to review.

GaelVaroquaux · 2013-05-26T21:30:19Z

I think that I would use the squeeze + atleast_1d code rather than the ravel as it is more explicit and will probably give errors easier to diagnostic if people pass in stupid things.

Apart from this, I say 👍 to merge. The code seems much simplified to me. Thanks a lot for being so receptive to my remarks. It's an absolute pleasure to work with people like you.

jnothman · 2013-05-26T23:03:46Z

I haven't taken a look in detail at the PR... I agree that squeeze followed by asserting that ndim == 1 is more appropriate than ravel, but I regularly forget that the former exists.

I have wanted to clarify: what is the use-case accepting 2d row/column vectors to metrics? (I could easily understand it if it were the output of a slice over a sparse matrix, but we don't handle sparse here.) And are we only interested in 1-2d?

jnothman · 2013-05-27T00:26:16Z

LGTM.

jnothman · 2013-05-27T00:28:09Z

Actually, should we not be verifying that the output of squeeze is (precisely, not at least) 1d?

arjoly · 2013-05-27T06:38:34Z

I have wanted to clarify: what is the use-case accepting 2d row/column vectors to metrics? (I could easily understand it if it were the output of a slice over a sparse matrix, but we don't handle sparse here.)

The mix of one 2d row vector and a column / list / "1d" vector raise an error. SInce it is considered that they have a different number of samples.

From a user perspective, it would be an unpleasant experience to discover that using a column vector could give a different value. Many estimators work correctly with a 2d column vector, e.g. see DecisionTreeClassifier.

And are we only interested in 1-2d?

Everything work with 1-2d at the moment. I have seen a pr to work with 3-d numpy array. However as far as I know, this hasn't been merged yet.

jnothman · 2013-05-27T06:54:12Z

Many estimators work correctly with a 2d column vector

Okay. I don't then understand why we support row vectors as targets at all, rather than just banning their mix with other formats.

Everything work with 1-2d at the moment. I have seen a pr to work with 3-d numpy array. However as far as I know, this hasn't been merged yet.

What I meant is whether a 3d array with two 1s in its shape might be meaningfully passed to a metric and squeezed to 1d.

It still seems to me you need to confirm that squeeze's output is 1d in metrics.

GaelVaroquaux · 2013-05-27T06:54:53Z

From a user perspective, it would be an unpleasant experience to discover that
using a column vector would give a different value. Many estimators work
correctly with a 2d column vector, e.g. see DecisionTreeClassifier.

I disagree there. I don't think that being transparent to a transposition
is a good idea. The docs should be pretty clear that the first dimension
is the number of samples. Trying to catter for this will make our
codebase more complex, induce undefined behavior and make it harder to
capture bugs both in our codebase and in the user's.

jnothman · 2013-05-27T06:55:40Z

I don't think that being transparent to a transposition is a good idea.

A column should be okay then, but not a row?

arjoly · 2013-05-27T06:56:24Z

Ok, mix of row vectors need to launch a ValueError.

Except when it is meaningful with multilabel binary indicator format or multi-output regression.

jnothman · 2013-05-27T07:01:27Z

A row vector at all should raise a ValueError, it seems.

I think we should be explicitly validating for: y.ndim == 1 or (y.ndim == 2 and y.shape[1] == 1) (if as you say we must handle the column vector case). Anything else should raise a ValueError, and the latter should be squeezed.

GaelVaroquaux · 2013-05-27T07:05:32Z

I think we should be explicitly validating for: y.ndim == 1 or (y.ndim == 2 and
y.shape[1] == 1)

Yes, this looks right to me. Thanks

jnothman · 2013-05-27T07:06:29Z

(And if in the future we find reason to extend it, that's better than having it over-permissive now.)

jnothman · 2013-05-27T08:36:06Z

Oy. I get mentioned in a commit. I personally think that type of checking would be better refactored, which is why _check_1d_array existed, but yes, that's functionally much better. But you shouldn't need np.atleast_1d after that check.

jnothman · 2013-05-27T08:37:39Z

The shape check is missing from a number of the metrics. Is that intentional?

arjoly · 2013-05-27T08:42:52Z

With those metrics, a row vector means that it is either a multioutput output or a multilabel indicator output.

jnothman · 2013-05-27T09:20:54Z

With those metrics, a row vector means that it is either a multioutput output or a multilabel indicator output.

By the time you get to squeeze you must have decided this is not the case, or it would not be safe to squeeze.

Also, is_label_indicator_matrix specifically requires y to have more than one sample. Whether that's right or not is a different matter.

arjoly · 2013-05-27T10:34:17Z

@jnothman I revert the definition of is_label_indicator_matrix.

jnothman · 2013-05-27T12:43:31Z

sklearn/metrics/metrics.py

+    y1, y2 = check_arrays(y1, y2, allow_lists=True)
+
+    if not is_multilabel(y1):
+        y1, y2 = check_arrays(y1, y2)


This is repeated.

arjoly · 2013-07-10T15:11:25Z

ok good, I will use it!

GaelVaroquaux · 2013-07-14T16:38:09Z

@jnothman : happy with this version? If you are, I think that we can merge.

jnothman · 2013-07-14T21:31:10Z

happy with this version?

Quite :)

Sorry, I won't be very active here for the next few weeks... poor timing, I know!

Go ahead, merge already!

GaelVaroquaux · 2013-07-14T23:13:05Z

Go ahead, merge already!

I am in Airports, in between countries. I don't want to attempt a merge,
especially since master has gone haywire on 2.6. I'll get to it later,
maybe in a few days, if no one beats me to it.

…not supported

arjoly · 2013-07-15T08:00:05Z

Rebase on top of master

GaelVaroquaux · 2013-07-22T12:26:47Z

I'll be looking at this now.

arjoly · 2013-07-22T12:30:01Z

cool !

GaelVaroquaux · 2013-07-22T14:06:18Z

Merged by rebase. Thanks!

Also, removed the '^' notation for sets: explicit is better than implicit.

arjoly · 2013-07-22T14:08:54Z

ok thanks !!

arjoly · 2013-07-22T14:10:12Z

Can you point me to the commits?

arjoly · 2013-07-22T14:18:13Z

Now it's good !!!

GaelVaroquaux reviewed May 26, 2013
View reviewed changes

arjoly mentioned this pull request May 27, 2013

FIX helper to check multilabel types #1985

Closed

jnothman reviewed May 27, 2013
View reviewed changes

arjoly added 14 commits July 15, 2013 09:59

ENH remove _is_1d and _check_1d_array thanks to @GaelVaroquaux

eb4eebd

flake8

23ca714

ENH raise ValueError with row vector if multilabel or multioutput is …

2982f08

…not supported

ENH being less permissive thanks to @jnothman

39ad1d7

DOC add example is_multilabel

ca8e803

ENH handle properly row vector

e83f73d

Flake8

d884512

ENH better error message

9e0c896

FIX switch to the new format syntax

07be655

ENH prettier error message for _binary_clf_curve with bad input shape

9555fe7

ENH use ravel instead of atleast_1d and squeeze whenever possible

f68d270

ENH coherently input checking for regression metrics

4319919

ENH dryer thanks to @jnothman

a735344

TST stronger test for _column_or_1d function

93335af

arjoly closed this Jul 22, 2013

arjoly reopened this Jul 22, 2013

arjoly closed this Jul 22, 2013

arjoly deleted the tst-metric-utilities branch July 22, 2013 14:18

arjoly mentioned this pull request Jul 29, 2013

[MRG] Missing contributions #2324

Merged

Uh oh!

[MRG] Remove _check_1d_array and _is_1d private function #2002

[MRG] Remove _check_1d_array and _is_1d private function #2002

Uh oh!

Conversation

arjoly commented May 26, 2013

Uh oh!

GaelVaroquaux May 26, 2013

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented May 26, 2013

Uh oh!

arjoly commented May 26, 2013

Uh oh!

arjoly commented May 26, 2013

Uh oh!

GaelVaroquaux commented May 26, 2013

Uh oh!

jnothman commented May 26, 2013

Uh oh!

jnothman commented May 27, 2013

Uh oh!

jnothman commented May 27, 2013

Uh oh!

arjoly commented May 27, 2013

Uh oh!

jnothman commented May 27, 2013

Uh oh!

GaelVaroquaux commented May 27, 2013

Uh oh!

jnothman commented May 27, 2013

Uh oh!

arjoly commented May 27, 2013

Uh oh!

jnothman commented May 27, 2013

Uh oh!

GaelVaroquaux commented May 27, 2013

Uh oh!

jnothman commented May 27, 2013

Uh oh!

jnothman commented May 27, 2013

Uh oh!

jnothman commented May 27, 2013

Uh oh!

arjoly commented May 27, 2013

Uh oh!

jnothman commented May 27, 2013

Uh oh!

arjoly commented May 27, 2013

Uh oh!

jnothman May 27, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly commented Jul 10, 2013

Uh oh!

GaelVaroquaux commented Jul 14, 2013

Uh oh!

jnothman commented Jul 14, 2013

Uh oh!

GaelVaroquaux commented Jul 14, 2013

Uh oh!

arjoly commented Jul 15, 2013

Uh oh!

GaelVaroquaux commented Jul 22, 2013

Uh oh!

arjoly commented Jul 22, 2013

Uh oh!

GaelVaroquaux commented Jul 22, 2013

Uh oh!

arjoly commented Jul 22, 2013

Uh oh!

arjoly commented Jul 22, 2013

Uh oh!

arjoly commented Jul 22, 2013

Uh oh!

Uh oh!