Skip to content

[MRG + 2] make check_array convert object to float. #4057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 24, 2015

Conversation

amueller
Copy link
Member

@amueller amueller commented Jan 6, 2015

  • add tests
  • think about people passing strings as objects to SVM with callable kernel Not too concerned about this any more. Maybe @larsmans can comment?
  • think about what to do with y

The problem with y is that for regressors we might want to convert y to float, whereas for classifiers we map any unique objects to ints anyhow, so object is fine. Fixes #4006, #3142 and #4055.

@amueller amueller force-pushed the dtype_object_conversion branch 2 times, most recently from 2c54c5e to 7d2224c Compare January 7, 2015 00:20
@amueller
Copy link
Member Author

amueller commented Jan 7, 2015

Heisenbug in LogisticRegresion....

@amueller
Copy link
Member Author

amueller commented Jan 7, 2015

It looks a bit like converting to float is not that big of an issue, as sting kernels already didn't work, see http://stackoverflow.com/questions/26391367/how-to-use-string-kernels-in-scikit-learn
It would be better to fix the issue, but that seems independent of the PR. It is a bit hard to test for features that we don't about.

@amueller amueller changed the title WIP make check_array convert object to float. [MRG] make check_array convert object to float. Jan 9, 2015
@amueller amueller force-pushed the dtype_object_conversion branch 4 times, most recently from b018097 to e9a998b Compare January 15, 2015 16:11
@amueller
Copy link
Member Author

NaN/Inf error in OrthogonalMatchingPursuitCV on Python2.6.... I don't really understand....

@amueller
Copy link
Member Author

@ogrisel if you have time, I'd love to get your input on this, as this is a somewhat major behavior change.

@amueller
Copy link
Member Author

@GaelVaroquaux might also have an opinion about this (and will hopefully forgive me from distracting him from grand proposals)

@agramfort
Copy link
Member

no objection on my side

@amueller
Copy link
Member Author

travis seems unstable again :-/

if dtype == "numeric":
if getattr(array, "dtype", None) is np.dtype(object):
# if input is object, convert to float.
dtype = np.float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to ban the use of np.float which is just an alias for Python float. Instead we should use the more explicit and platform-independent np.float64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. FYI:

git grep "np.float" | grep -v float64 | grep -v float32 | grep py | wc -l
138

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, those other np.float will have to be dealt with separately ;)

@amueller amueller force-pushed the dtype_object_conversion branch from 8485f77 to a354b65 Compare February 6, 2015 13:36
@ogrisel
Copy link
Member

ogrisel commented Feb 6, 2015

think about people passing strings as objects to SVM with callable kernel Not too concerned about this any more. Maybe @larsmans can comment?

suggestion: when the kernel is an arbitrary callable, we don't enforce the input dtype to be numeric and keep it a pass-through (dtype=None).

@amueller
Copy link
Member Author

amueller commented Feb 6, 2015

We could do it like that, but currently we don't have a working example of that afaik.

@@ -1143,10 +1142,12 @@ def feature_importances_(self):

def _validate_y(self, y):
self.n_classes_ = 1

if y.dtype is np.dtype(object):
y = y.astype(np.float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more np.float here.

@ogrisel
Copy link
Member

ogrisel commented Feb 6, 2015

This looks good to me. Any more comments @jnothman @arjoly @larsmans?

@ogrisel
Copy link
Member

ogrisel commented Feb 6, 2015

I wonder if we should use y_dtype=None instead of y_numeric=False though. That would make it possible to implement finer control of the y_dtype precision level for some cython code. But maybe this is a YAGNI.

@amueller
Copy link
Member Author

Damn, must have been rebase issues. Should be fixed now.

@amueller
Copy link
Member Author

As this is a somewhat major change, but will fix several issues, I feel it would be good to get a couple more reviews....

@amueller
Copy link
Member Author

Not sure if the OMP test failure is related.

@amueller amueller force-pushed the dtype_object_conversion branch from aa59c60 to a23d978 Compare February 14, 2015 22:55
@agramfort
Copy link
Member

LGTM but you'll need to rebase.

@@ -1146,7 +1146,8 @@ def feature_importances_(self):

def _validate_y(self, y):
self.n_classes_ = 1

if y.dtype is np.dtype(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this logic is present in a classifier, it is problematic; strings should be valid as dtype object (e.g. this is default for a pandas.Series of string). If it's only relevant to regressors, it should probably not be in Base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I would prefer if the check was written as:

y.dtype.kind == 'O'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman this is how the code is factored. The regressor implementation is in Base, and overwritten in the classifier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could move the regressor implementation from base to the regessor but that seems pretty unrelated.

@jnothman
Copy link
Member

This LGTM apart from those minor things.

array = _ensure_sparse_format(array, accept_sparse, dtype, order,
copy, force_all_finite)
else:
if ensure_2d:
array = np.atleast_2d(array)
if dtype == "numeric":
if getattr(array, "dtype", None) is np.dtype(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here: I think that checking dtype.kind is more robust, in particular to future evolutions of numpy.

fix dtype check, add test. unfriend all multi-output estimators on facebook.

try to fix what is happening to y (by doing nothing to y)

make test work...

Make everything accept object y or say "invalid label"

fix multioutput linear models

add test for sensible error message.
@amueller amueller force-pushed the dtype_object_conversion branch 2 times, most recently from 25214ad to 3f10f94 Compare February 15, 2015 17:21
@amueller
Copy link
Member Author

rebased, used dtype.kind == "O".

@amueller amueller force-pushed the dtype_object_conversion branch 3 times, most recently from 9e13310 to 657a657 Compare February 18, 2015 00:51
@amueller amueller force-pushed the dtype_object_conversion branch from 657a657 to 96d4b3e Compare February 18, 2015 22:12
@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) to 95.04% when pulling 96d4b3e on amueller:dtype_object_conversion into 3d86d1f on scikit-learn:master.

@amueller
Copy link
Member Author

Should be good now.

@amueller amueller changed the title [MRG] make check_array convert object to float. [MRG + 2] make check_array convert object to float. Feb 19, 2015
@amueller
Copy link
Member Author

@ogrisel merge?

ogrisel added a commit that referenced this pull request Feb 24, 2015
[MRG + 2] make check_array convert object to float.
@ogrisel ogrisel merged commit 3468e00 into scikit-learn:master Feb 24, 2015
@ogrisel
Copy link
Member

ogrisel commented Feb 24, 2015

Merged!

@amueller
Copy link
Member Author

Thanks :) That should close some issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QDA may meet broken 2d array
6 participants