Skip to content

[MRG] Refactor accuracy score + Fix bugs with 1d array #1750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 20, 2013

Conversation

arjoly
Copy link
Member

@arjoly arjoly commented Mar 7, 2013

This pull request intends to reduce the amount of redundant code between accuracy_score and zero_one_loss (see issue #1748). In order to achieve this, a normalize option has been added to accuracy_score.

Furthermore, mix representation of 1d vector are now properly handle by all the metrics that comparse y_true to y_pred.

@arjoly
Copy link
Member Author

arjoly commented Mar 7, 2013

I have some ImportError: No module named fixes, but I don't think it comes from this patch.

@@ -33,6 +35,29 @@
###############################################################################
# General utilities
###############################################################################
def _is_1d(y):
return np.size(y) == np.max(np.shape(y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't a higher n-dim with shape 1 lead to weird broadcasting?
Maybe check np.shape(y)[0] ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With np.shape(y)[0], it will fail to recognize a row vector as a 1d vector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a row vector, I mean something like np.array([[1, 2, 3, 4]]).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.array([[1, 2, 3, 4]]) == np.array([1, 2, 3, 4]) would result in broadcasting, but it doesn't.
But
np.array([[0, 1, 2]]) == np.array([[0], [1], [2]]) does, and these both go through the check.

Is this case handled correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the np.max(np.shape(x)), this is properly handle, but not with np.shape(x)[0].

In [3]: np.shape(np.array([[1, 2, 3, 4]]))[0]
Out[3]: 1

In [4]: np.max(np.shape(np.array([[1, 2, 3, 4]])))
Out[4]: 4

In [5]: np.shape(np.array([[0, 1, 2]]))[0]
Out[5]: 1

In [6]: np.max(np.shape(np.array([[0, 1, 2]])))
Out[6]: 3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was a bit unclear. Is this handled correctly in the accuracy?
So your check is whether there is only one dimension with shape != 1.
Checking just that might lead to surprising results:

In [4]: np.mean(np.array([[0, 1, 2]]) == np.array([[0], [1], [2]]))
Out[4]: 0.33333333333333331

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this case is checked in test_format_invariance_with_1d_vectors() for all metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the moment, this raised an error. I will investigate further.
But I think that it can lead to ambiguity issues.

In [4]: import numpy as np

In [5]: accuracy_score(np.array([[0, 1, 2]]) ,  np.array([[0], [1], [2]]))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-c70daa59450a> in <module>()
----> 1 accuracy_score(np.array([[0, 1, 2]]) ,  np.array([[0], [1], [2]]))

/home/ajoly/git/scikit-learn/sklearn/metrics/metrics.pyc in accuracy_score(y_true, y_pred, normalize)
    932 
    933     """
--> 934     y_true, y_pred = check_arrays(y_true, y_pred, allow_lists=True)
    935 
    936     # Compute accuracy for each possible representation

/home/ajoly/git/scikit-learn/sklearn/utils/validation.pyc in check_arrays(*arrays, **options)
    192         if size != n_samples:
    193             raise ValueError("Found array with dim %d. Expected %d"
--> 194                              % (size, n_samples))
    195 
    196         if not allow_lists or hasattr(array, "shape"):

ValueError: Found array with dim 3. Expected 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I don't fix those problems is that the mix of 1d row vector with other format is not allowed by check_arrays.

Furthermore, it can lead to ambiguity issue (when n_samples=1) with binary indicator format and binary classification.

# At the moment, these mix representations aren't allowed
assert_raises(ValueError, metric, y1_1d, y2_row)
assert_raises(ValueError, metric, y1_row, y2_1d)
assert_raises(ValueError, metric, y1_list, y2_row)
assert_raises(ValueError, metric, y1_row, y2_list)
assert_raises(ValueError, metric, y1_column, y2_row)
assert_raises(ValueError, metric, y1_row, y2_column)

@amueller
Copy link
Member

amueller commented Mar 9, 2013

Awesome, thanks. 👍 or merge!

@arjoly
Copy link
Member Author

arjoly commented Mar 9, 2013

Should I put the _is_1d() and _check_1d in the validation module?
In that case, I would add some tests for those two functions.

@amueller
Copy link
Member

amueller commented Mar 9, 2013

I thought about that. The _check_1d might be a bit to specific. I think it is ok to leave it here for the moment.

@arjoly
Copy link
Member Author

arjoly commented Mar 10, 2013

@amueller I think that I have taken your comments into account. I have also added some doc and doctest for the two private utility functions.

@amueller
Copy link
Member

Thanks. They were only minor any way. Let's wait until someone else had a look :) (probably a lot of people busy with pycon).

@larsmans
Copy link
Member

+1 for the code, but I'm not to sure about the parameter name normalize. How about fraction?

@arjoly
Copy link
Member Author

arjoly commented Mar 10, 2013

+1 for the code, but I'm not to sure about the parameter name normalize. How about fraction?

The keyword normalize has been introduced in the previous release.
Why do you want to rename it to fraction?

@@ -33,6 +35,124 @@
###############################################################################
# General utilities
###############################################################################
def _is_1d(x):
""" Return True if x can be considered as a 1d vector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmit: no learning whitespace on the first line of the docstring.

@larsmans
Copy link
Member

Excuse me, I thought you just introduced it. (I never use that option :)

Well, +1 for merge then.

y2 : array-like

ravel : boolean, optional (default=False),
If ``y1`` and ``y2`` are vectors and ``ravel``` is set to ``True``,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"are vectors": don't you mean "2D arrays or more"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I called a vector is anything that return True for _is _1d.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I read the examples and understood afterward. The purpose of this helper is really non intuitive to me though. I need to re-read where it is actually used to understand the motivation. Maybe it could be made more explicit in the docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment about it. The main motivation was to be able to reshape, if needed, mix of vector representation. It is also used to infer correctly the number of samples with "row" vectors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is indeed very specific but it is private and it is in the file it is used in, so I think that should be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a better way to describe what I call vector, don't hesitate!! :-)

@arjoly
Copy link
Member Author

arjoly commented Mar 12, 2013

I have taken your remarks into account which lead to simplify everything.
Thanks :-)

@arjoly
Copy link
Member Author

arjoly commented Mar 12, 2013

The name of the function would make the caller think that non-1D arrays should be invalid and I would have expected a ValueError myself.

This behavior is now implemented. I have also tried to clarify the doc.

@jaquesgrobler
Copy link
Member

Very nice. Nice docstrings too. All looks good to me +1

@arjoly
Copy link
Member Author

arjoly commented Mar 20, 2013

Since people agree that it could be merged and @jaquesgrobler is ok with the new version, I would like to squash everything and push to master.

I would like to have a last +1 and end this pr.

@larsmans
Copy link
Member

Go ahead!

@jaquesgrobler
Copy link
Member

👍

@jaquesgrobler
Copy link
Member

oops.. close button by accident.. go ahead 😊

@arjoly
Copy link
Member Author

arjoly commented Mar 20, 2013

Waiting for Travis, then I will merge into the trunk.

@arjoly arjoly merged commit 5cc8032 into scikit-learn:master Mar 20, 2013
@arjoly
Copy link
Member Author

arjoly commented Mar 20, 2013

Merge by rebase! Thanks for the reviews :-)

@arjoly arjoly deleted the refactor-accuracy-zero branch March 20, 2013 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants