[MRG] Refactor accuracy score + Fix bugs with 1d array #1750

arjoly · 2013-03-07T10:24:48Z

This pull request intends to reduce the amount of redundant code between accuracy_score and zero_one_loss (see issue #1748). In order to achieve this, a normalize option has been added to accuracy_score.

Furthermore, mix representation of 1d vector are now properly handle by all the metrics that comparse y_true to y_pred.

arjoly · 2013-03-07T10:39:05Z

I have some ImportError: No module named fixes, but I don't think it comes from this patch.

amueller · 2013-03-09T11:48:59Z

sklearn/metrics/metrics.py

@@ -33,6 +35,29 @@
 ###############################################################################
 # General utilities
 ###############################################################################
+def _is_1d(y):
+    return np.size(y) == np.max(np.shape(y))


couldn't a higher n-dim with shape 1 lead to weird broadcasting?
Maybe check np.shape(y)[0] ?

With np.shape(y)[0], it will fail to recognize a row vector as a 1d vector.

With a row vector, I mean something like np.array([[1, 2, 3, 4]]).

np.array([[1, 2, 3, 4]]) == np.array([1, 2, 3, 4]) would result in broadcasting, but it doesn't.
But
np.array([[0, 1, 2]]) == np.array([[0], [1], [2]]) does, and these both go through the check.

Is this case handled correctly?

With the np.max(np.shape(x)), this is properly handle, but not with np.shape(x)[0].

In [3]: np.shape(np.array([[1, 2, 3, 4]]))[0] Out[3]: 1 In [4]: np.max(np.shape(np.array([[1, 2, 3, 4]]))) Out[4]: 4 In [5]: np.shape(np.array([[0, 1, 2]]))[0] Out[5]: 1 In [6]: np.max(np.shape(np.array([[0, 1, 2]]))) Out[6]: 3

Sorry, I was a bit unclear. Is this handled correctly in the accuracy?
So your check is whether there is only one dimension with shape != 1.
Checking just that might lead to surprising results:

In [4]: np.mean(np.array([[0, 1, 2]]) == np.array([[0], [1], [2]])) Out[4]: 0.33333333333333331

I think that this case is checked in test_format_invariance_with_1d_vectors() for all metrics.

For the moment, this raised an error. I will investigate further.
But I think that it can lead to ambiguity issues.

In [4]: import numpy as np In [5]: accuracy_score(np.array([[0, 1, 2]]) , np.array([[0], [1], [2]])) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-5-c70daa59450a> in <module>() ----> 1 accuracy_score(np.array([[0, 1, 2]]) , np.array([[0], [1], [2]])) /home/ajoly/git/scikit-learn/sklearn/metrics/metrics.pyc in accuracy_score(y_true, y_pred, normalize) 932 933 """ --> 934 y_true, y_pred = check_arrays(y_true, y_pred, allow_lists=True) 935 936 # Compute accuracy for each possible representation /home/ajoly/git/scikit-learn/sklearn/utils/validation.pyc in check_arrays(*arrays, **options) 192 if size != n_samples: 193 raise ValueError("Found array with dim %d. Expected %d" --> 194 % (size, n_samples)) 195 196 if not allow_lists or hasattr(array, "shape"): ValueError: Found array with dim 3. Expected 1

ok great :)

The reason why I don't fix those problems is that the mix of 1d row vector with other format is not allowed by check_arrays.

Furthermore, it can lead to ambiguity issue (when n_samples=1) with binary indicator format and binary classification.

# At the moment, these mix representations aren't allowed assert_raises(ValueError, metric, y1_1d, y2_row) assert_raises(ValueError, metric, y1_row, y2_1d) assert_raises(ValueError, metric, y1_list, y2_row) assert_raises(ValueError, metric, y1_row, y2_list) assert_raises(ValueError, metric, y1_column, y2_row) assert_raises(ValueError, metric, y1_row, y2_column)

amueller · 2013-03-09T11:52:19Z

Awesome, thanks. 👍 or merge!

arjoly · 2013-03-09T12:02:21Z

Should I put the _is_1d() and _check_1d in the validation module?
In that case, I would add some tests for those two functions.

amueller · 2013-03-09T12:04:40Z

I thought about that. The _check_1d might be a bit to specific. I think it is ok to leave it here for the moment.

arjoly · 2013-03-10T08:26:30Z

@amueller I think that I have taken your comments into account. I have also added some doc and doctest for the two private utility functions.

amueller · 2013-03-10T12:04:24Z

Thanks. They were only minor any way. Let's wait until someone else had a look :) (probably a lot of people busy with pycon).

larsmans · 2013-03-10T18:19:14Z

+1 for the code, but I'm not to sure about the parameter name normalize. How about fraction?

arjoly · 2013-03-10T20:01:13Z

+1 for the code, but I'm not to sure about the parameter name normalize. How about fraction?

The keyword normalize has been introduced in the previous release.
Why do you want to rename it to fraction?

ogrisel · 2013-03-10T20:07:04Z

sklearn/metrics/metrics.py

@@ -33,6 +35,124 @@
 ###############################################################################
 # General utilities
 ###############################################################################
+def _is_1d(x):
+    """ Return True if x can be considered as a 1d vector.


Cosmit: no learning whitespace on the first line of the docstring.

larsmans · 2013-03-10T20:10:34Z

Excuse me, I thought you just introduced it. (I never use that option :)

Well, +1 for merge then.

ogrisel · 2013-03-10T20:11:11Z

sklearn/metrics/metrics.py

+    y2 : array-like
+
+    ravel : boolean, optional (default=False),
+        If ``y1`` and ``y2`` are vectors and ``ravel``` is set to ``True``,


"are vectors": don't you mean "2D arrays or more"?

What I called a vector is anything that return True for _is _1d.

Alright, I read the examples and understood afterward. The purpose of this helper is really non intuitive to me though. I need to re-read where it is actually used to understand the motivation. Maybe it could be made more explicit in the docstring.

I will add a comment about it. The main motivation was to be able to reshape, if needed, mix of vector representation. It is also used to infer correctly the number of samples with "row" vectors.

The function is indeed very specific but it is private and it is in the file it is used in, so I think that should be fine.

If you have a better way to describe what I call vector, don't hesitate!! :-)

arjoly · 2013-03-12T10:10:35Z

I have taken your remarks into account which lead to simplify everything.
Thanks :-)

arjoly · 2013-03-12T10:12:43Z

The name of the function would make the caller think that non-1D arrays should be invalid and I would have expected a ValueError myself.

This behavior is now implemented. I have also tried to clarify the doc.

jaquesgrobler · 2013-03-14T15:31:55Z

Very nice. Nice docstrings too. All looks good to me +1

arjoly · 2013-03-20T08:32:25Z

Since people agree that it could be merged and @jaquesgrobler is ok with the new version, I would like to squash everything and push to master.

I would like to have a last +1 and end this pr.

larsmans · 2013-03-20T10:08:02Z

Go ahead!

jaquesgrobler · 2013-03-20T10:13:48Z

👍

jaquesgrobler · 2013-03-20T10:14:30Z

oops.. close button by accident.. go ahead 😊

arjoly · 2013-03-20T10:25:18Z

Waiting for Travis, then I will merge into the trunk.

arjoly · 2013-03-20T10:37:57Z

Merge by rebase! Thanks for the reviews :-)

amueller reviewed Mar 9, 2013
View reviewed changes

arjoly mentioned this pull request Mar 10, 2013

Check consistency / correctness of 1d input and n-d input for all estimators #1678

Closed

ogrisel reviewed Mar 10, 2013
View reviewed changes

jaquesgrobler closed this Mar 20, 2013

jaquesgrobler reopened this Mar 20, 2013

ENH add normalize option to accuracy_score + FIX bug with 1d array

5cc8032

arjoly merged commit 5cc8032 into scikit-learn:master Mar 20, 2013

arjoly deleted the refactor-accuracy-zero branch March 20, 2013 10:38

arjoly mentioned this pull request Mar 27, 2013

Refactor accuracy_score #1748

Closed

arjoly mentioned this pull request Jul 29, 2013

[MRG] Missing contributions #2324

Merged

Uh oh!

[MRG] Refactor accuracy score + Fix bugs with 1d array #1750

[MRG] Refactor accuracy score + Fix bugs with 1d array #1750

Uh oh!

Conversation

arjoly commented Mar 7, 2013

Uh oh!

arjoly commented Mar 7, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Mar 9, 2013

Uh oh!

arjoly commented Mar 9, 2013

Uh oh!

amueller commented Mar 9, 2013

Uh oh!

arjoly commented Mar 10, 2013

Uh oh!

amueller commented Mar 10, 2013

Uh oh!

larsmans commented Mar 10, 2013

Uh oh!

arjoly commented Mar 10, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larsmans commented Mar 10, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Mar 12, 2013

Uh oh!

arjoly commented Mar 12, 2013

Uh oh!

jaquesgrobler commented Mar 14, 2013

Uh oh!

arjoly commented Mar 20, 2013

Uh oh!

larsmans commented Mar 20, 2013

Uh oh!

jaquesgrobler commented Mar 20, 2013

Uh oh!

jaquesgrobler commented Mar 20, 2013

Uh oh!

arjoly commented Mar 20, 2013

Uh oh!

arjoly commented Mar 20, 2013

Uh oh!

Uh oh!