-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Refactor accuracy score + Fix bugs with 1d array #1750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I have some |
@@ -33,6 +35,29 @@ | |||
############################################################################### | |||
# General utilities | |||
############################################################################### | |||
def _is_1d(y): | |||
return np.size(y) == np.max(np.shape(y)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't a higher n-dim with shape 1 lead to weird broadcasting?
Maybe check np.shape(y)[0]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With np.shape(y)[0]
, it will fail to recognize a row vector as a 1d vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a row vector, I mean something like np.array([[1, 2, 3, 4]])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.array([[1, 2, 3, 4]]) == np.array([1, 2, 3, 4])
would result in broadcasting, but it doesn't.
But
np.array([[0, 1, 2]]) == np.array([[0], [1], [2]])
does, and these both go through the check.
Is this case handled correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the np.max(np.shape(x))
, this is properly handle, but not with np.shape(x)[0]
.
In [3]: np.shape(np.array([[1, 2, 3, 4]]))[0]
Out[3]: 1
In [4]: np.max(np.shape(np.array([[1, 2, 3, 4]])))
Out[4]: 4
In [5]: np.shape(np.array([[0, 1, 2]]))[0]
Out[5]: 1
In [6]: np.max(np.shape(np.array([[0, 1, 2]])))
Out[6]: 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was a bit unclear. Is this handled correctly in the accuracy?
So your check is whether there is only one dimension with shape != 1.
Checking just that might lead to surprising results:
In [4]: np.mean(np.array([[0, 1, 2]]) == np.array([[0], [1], [2]]))
Out[4]: 0.33333333333333331
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this case is checked in test_format_invariance_with_1d_vectors()
for all metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the moment, this raised an error. I will investigate further.
But I think that it can lead to ambiguity issues.
In [4]: import numpy as np
In [5]: accuracy_score(np.array([[0, 1, 2]]) , np.array([[0], [1], [2]]))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-c70daa59450a> in <module>()
----> 1 accuracy_score(np.array([[0, 1, 2]]) , np.array([[0], [1], [2]]))
/home/ajoly/git/scikit-learn/sklearn/metrics/metrics.pyc in accuracy_score(y_true, y_pred, normalize)
932
933 """
--> 934 y_true, y_pred = check_arrays(y_true, y_pred, allow_lists=True)
935
936 # Compute accuracy for each possible representation
/home/ajoly/git/scikit-learn/sklearn/utils/validation.pyc in check_arrays(*arrays, **options)
192 if size != n_samples:
193 raise ValueError("Found array with dim %d. Expected %d"
--> 194 % (size, n_samples))
195
196 if not allow_lists or hasattr(array, "shape"):
ValueError: Found array with dim 3. Expected 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok great :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I don't fix those problems is that the mix of 1d row vector with other format is not allowed by check_arrays
.
Furthermore, it can lead to ambiguity issue (when n_samples=1
) with binary indicator format and binary classification.
# At the moment, these mix representations aren't allowed
assert_raises(ValueError, metric, y1_1d, y2_row)
assert_raises(ValueError, metric, y1_row, y2_1d)
assert_raises(ValueError, metric, y1_list, y2_row)
assert_raises(ValueError, metric, y1_row, y2_list)
assert_raises(ValueError, metric, y1_column, y2_row)
assert_raises(ValueError, metric, y1_row, y2_column)
Awesome, thanks. 👍 or merge! |
Should I put the _is_1d() and _check_1d in the validation module? |
I thought about that. The |
@amueller I think that I have taken your comments into account. I have also added some doc and doctest for the two private utility functions. |
Thanks. They were only minor any way. Let's wait until someone else had a look :) (probably a lot of people busy with pycon). |
+1 for the code, but I'm not to sure about the parameter name |
The keyword |
@@ -33,6 +35,124 @@ | |||
############################################################################### | |||
# General utilities | |||
############################################################################### | |||
def _is_1d(x): | |||
""" Return True if x can be considered as a 1d vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmit: no learning whitespace on the first line of the docstring.
Excuse me, I thought you just introduced it. (I never use that option :) Well, +1 for merge then. |
y2 : array-like | ||
|
||
ravel : boolean, optional (default=False), | ||
If ``y1`` and ``y2`` are vectors and ``ravel``` is set to ``True``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"are vectors": don't you mean "2D arrays or more"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I called a vector is anything that return True for _is _1d
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I read the examples and understood afterward. The purpose of this helper is really non intuitive to me though. I need to re-read where it is actually used to understand the motivation. Maybe it could be made more explicit in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a comment about it. The main motivation was to be able to reshape, if needed, mix of vector representation. It is also used to infer correctly the number of samples with "row" vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is indeed very specific but it is private and it is in the file it is used in, so I think that should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a better way to describe what I call vector, don't hesitate!! :-)
I have taken your remarks into account which lead to simplify everything. |
This behavior is now implemented. I have also tried to clarify the doc. |
Very nice. Nice docstrings too. All looks good to me +1 |
Since people agree that it could be merged and @jaquesgrobler is ok with the new version, I would like to squash everything and push to master. I would like to have a last +1 and end this pr. |
Go ahead! |
👍 |
oops.. close button by accident.. go ahead 😊 |
Waiting for Travis, then I will merge into the trunk. |
Merge by rebase! Thanks for the reviews :-) |
This pull request intends to reduce the amount of redundant code between
accuracy_score
andzero_one_loss
(see issue #1748). In order to achieve this, a normalize option has been added toaccuracy_score
.Furthermore, mix representation of 1d vector are now properly handle by all the metrics that comparse
y_true
toy_pred
.