-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 2] make check_array convert object to float. #4057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 2] make check_array convert object to float. #4057
Conversation
2c54c5e
to
7d2224c
Compare
Heisenbug in LogisticRegresion.... |
It looks a bit like converting to float is not that big of an issue, as sting kernels already didn't work, see http://stackoverflow.com/questions/26391367/how-to-use-string-kernels-in-scikit-learn |
b018097
to
e9a998b
Compare
NaN/Inf error in |
@ogrisel if you have time, I'd love to get your input on this, as this is a somewhat major behavior change. |
@GaelVaroquaux might also have an opinion about this (and will hopefully forgive me from distracting him from grand proposals) |
e9a998b
to
8485f77
Compare
no objection on my side |
travis seems unstable again :-/ |
if dtype == "numeric": | ||
if getattr(array, "dtype", None) is np.dtype(object): | ||
# if input is object, convert to float. | ||
dtype = np.float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to ban the use of np.float
which is just an alias for Python float
. Instead we should use the more explicit and platform-independent np.float64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. FYI:
git grep "np.float" | grep -v float64 | grep -v float32 | grep py | wc -l
138
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, those other np.float
will have to be dealt with separately ;)
8485f77
to
a354b65
Compare
suggestion: when the kernel is an arbitrary callable, we don't enforce the input dtype to be numeric and keep it a pass-through ( |
We could do it like that, but currently we don't have a working example of that afaik. |
@@ -1143,10 +1142,12 @@ def feature_importances_(self): | |||
|
|||
def _validate_y(self, y): | |||
self.n_classes_ = 1 | |||
|
|||
if y.dtype is np.dtype(object): | |||
y = y.astype(np.float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more np.float
here.
I wonder if we should use |
Damn, must have been rebase issues. Should be fixed now. |
As this is a somewhat major change, but will fix several issues, I feel it would be good to get a couple more reviews.... |
Not sure if the OMP test failure is related. |
aa59c60
to
a23d978
Compare
LGTM but you'll need to rebase. |
@@ -1146,7 +1146,8 @@ def feature_importances_(self): | |||
|
|||
def _validate_y(self, y): | |||
self.n_classes_ = 1 | |||
|
|||
if y.dtype is np.dtype(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this logic is present in a classifier, it is problematic; strings should be valid as dtype object (e.g. this is default for a pandas.Series of string). If it's only relevant to regressors, it should probably not be in Base
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I would prefer if the check was written as:
y.dtype.kind == 'O'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman this is how the code is factored. The regressor implementation is in Base, and overwritten in the classifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could move the regressor implementation from base to the regessor but that seems pretty unrelated.
This LGTM apart from those minor things. |
array = _ensure_sparse_format(array, accept_sparse, dtype, order, | ||
copy, force_all_finite) | ||
else: | ||
if ensure_2d: | ||
array = np.atleast_2d(array) | ||
if dtype == "numeric": | ||
if getattr(array, "dtype", None) is np.dtype(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here: I think that checking dtype.kind is more robust, in particular to future evolutions of numpy.
fix dtype check, add test. unfriend all multi-output estimators on facebook. try to fix what is happening to y (by doing nothing to y) make test work... Make everything accept object y or say "invalid label" fix multioutput linear models add test for sensible error message.
25214ad
to
3f10f94
Compare
rebased, used |
9e13310
to
657a657
Compare
657a657
to
96d4b3e
Compare
Should be good now. |
@ogrisel merge? |
[MRG + 2] make check_array convert object to float.
Merged! |
Thanks :) That should close some issues. |
think about people passing strings as objects to SVM with callable kernelNot too concerned about this any more. Maybe @larsmans can comment?The problem with y is that for regressors we might want to convert y to float, whereas for classifiers we map any unique objects to ints anyhow, so object is fine. Fixes #4006, #3142 and #4055.