[MRG + 2] make check_array convert object to float. #4057

amueller · 2015-01-06T22:46:53Z

add tests
~~think about people passing strings as objects to SVM with callable kernel~~ Not too concerned about this any more. Maybe @larsmans can comment?
think about what to do with y

The problem with y is that for regressors we might want to convert y to float, whereas for classifiers we map any unique objects to ints anyhow, so object is fine. Fixes #4006, #3142 and #4055.

amueller · 2015-01-07T20:01:23Z

Heisenbug in LogisticRegresion....

amueller · 2015-01-07T20:41:53Z

It looks a bit like converting to float is not that big of an issue, as sting kernels already didn't work, see http://stackoverflow.com/questions/26391367/how-to-use-string-kernels-in-scikit-learn
It would be better to fix the issue, but that seems independent of the PR. It is a bit hard to test for features that we don't about.

amueller · 2015-01-15T20:58:07Z

NaN/Inf error in OrthogonalMatchingPursuitCV on Python2.6.... I don't really understand....

amueller · 2015-01-16T17:14:01Z

@ogrisel if you have time, I'd love to get your input on this, as this is a somewhat major behavior change.

amueller · 2015-01-16T17:18:27Z

@GaelVaroquaux might also have an opinion about this (and will hopefully forgive me from distracting him from grand proposals)

agramfort · 2015-01-17T08:25:24Z

no objection on my side

amueller · 2015-01-20T16:26:41Z

travis seems unstable again :-/

ogrisel · 2015-02-06T13:26:52Z

sklearn/utils/validation.py

+        if dtype == "numeric":
+            if getattr(array, "dtype", None) is np.dtype(object):
+                # if input is object, convert to float.
+                dtype = np.float


I would like to ban the use of np.float which is just an alias for Python float. Instead we should use the more explicit and platform-independent np.float64.

Sounds like a good idea.

Fixed. FYI:

git grep "np.float" | grep -v float64 | grep -v float32 | grep py | wc -l
138

Well, those other np.float will have to be dealt with separately ;)

ogrisel · 2015-02-06T13:37:18Z

think about people passing strings as objects to SVM with callable kernel Not too concerned about this any more. Maybe @larsmans can comment?

suggestion: when the kernel is an arbitrary callable, we don't enforce the input dtype to be numeric and keep it a pass-through (dtype=None).

amueller · 2015-02-06T13:38:04Z

We could do it like that, but currently we don't have a working example of that afaik.

ogrisel · 2015-02-06T13:38:34Z

sklearn/ensemble/gradient_boosting.py

@@ -1143,10 +1142,12 @@ def feature_importances_(self):

    def _validate_y(self, y):
        self.n_classes_ = 1
-
+        if y.dtype is np.dtype(object):
+            y = y.astype(np.float)


One more np.float here.

ogrisel · 2015-02-06T13:50:41Z

This looks good to me. Any more comments @jnothman @arjoly @larsmans?

ogrisel · 2015-02-06T13:52:22Z

I wonder if we should use y_dtype=None instead of y_numeric=False though. That would make it possible to implement finer control of the y_dtype precision level for some cython code. But maybe this is a YAGNI.

amueller · 2015-02-13T22:30:14Z

Damn, must have been rebase issues. Should be fixed now.

amueller · 2015-02-13T22:31:03Z

As this is a somewhat major change, but will fix several issues, I feel it would be good to get a couple more reviews....

amueller · 2015-02-14T22:54:48Z

Not sure if the OMP test failure is related.

agramfort · 2015-02-15T07:47:41Z

LGTM but you'll need to rebase.

jnothman · 2015-02-15T08:39:14Z

sklearn/ensemble/gradient_boosting.py

@@ -1146,7 +1146,8 @@ def feature_importances_(self):

    def _validate_y(self, y):
        self.n_classes_ = 1
-
+        if y.dtype is np.dtype(object):


If this logic is present in a classifier, it is problematic; strings should be valid as dtype object (e.g. this is default for a pandas.Series of string). If it's only relevant to regressors, it should probably not be in Base.

I think that I would prefer if the check was written as:

y.dtype.kind == 'O'

@jnothman this is how the code is factored. The regressor implementation is in Base, and overwritten in the classifier.

I could move the regressor implementation from base to the regessor but that seems pretty unrelated.

jnothman · 2015-02-15T09:37:47Z

This LGTM apart from those minor things.

GaelVaroquaux · 2015-02-15T11:19:22Z

sklearn/utils/validation.py

        array = _ensure_sparse_format(array, accept_sparse, dtype, order,
                                      copy, force_all_finite)
    else:
        if ensure_2d:
            array = np.atleast_2d(array)
+        if dtype == "numeric":
+            if getattr(array, "dtype", None) is np.dtype(object):


Same thing here: I think that checking dtype.kind is more robust, in particular to future evolutions of numpy.

fix dtype check, add test. unfriend all multi-output estimators on facebook. try to fix what is happening to y (by doing nothing to y) make test work... Make everything accept object y or say "invalid label" fix multioutput linear models add test for sensible error message.

amueller · 2015-02-15T17:22:55Z

rebased, used dtype.kind == "O".

coveralls · 2015-02-18T22:28:45Z

Coverage increased (+0.0%) to 95.04% when pulling 96d4b3e on amueller:dtype_object_conversion into 3d86d1f on scikit-learn:master.

amueller · 2015-02-19T03:47:50Z

Should be good now.

amueller · 2015-02-22T23:57:40Z

@ogrisel merge?

[MRG + 2] make check_array convert object to float.

ogrisel · 2015-02-24T21:33:25Z

Merged!

amueller · 2015-02-24T21:37:39Z

Thanks :) That should close some issues.

amueller mentioned this pull request Jan 6, 2015

GMM bic crashes on dtype object #4055

Closed

amueller force-pushed the dtype_object_conversion branch 2 times, most recently from 2c54c5e to 7d2224c Compare January 7, 2015 00:20

amueller mentioned this pull request Jan 7, 2015

QDA may meet broken 2d array #4006

Closed

amueller changed the title ~~WIP make check_array convert object to float.~~ [MRG] make check_array convert object to float. Jan 9, 2015

amueller force-pushed the dtype_object_conversion branch 4 times, most recently from b018097 to e9a998b Compare January 15, 2015 16:11

amueller added this to the 0.16 milestone Jan 16, 2015

This was referenced Jan 16, 2015

Broken handling of numeric data with dtype=object #3617

Closed

DBScan Clustering of string data #3737

Closed

cross_val_score can't handle type 'float' when a scoring parameter is given #3616

Closed

amueller force-pushed the dtype_object_conversion branch from e9a998b to 8485f77 Compare January 16, 2015 22:47

ogrisel reviewed Feb 6, 2015
View reviewed changes

amueller force-pushed the dtype_object_conversion branch from 8485f77 to a354b65 Compare February 6, 2015 13:36

ogrisel reviewed Feb 6, 2015
View reviewed changes

amueller force-pushed the dtype_object_conversion branch from aa59c60 to a23d978 Compare February 14, 2015 22:55

jnothman reviewed Feb 15, 2015
View reviewed changes

GaelVaroquaux reviewed Feb 15, 2015
View reviewed changes

amueller force-pushed the dtype_object_conversion branch 2 times, most recently from 25214ad to 3f10f94 Compare February 15, 2015 17:21

amueller force-pushed the dtype_object_conversion branch 3 times, most recently from 9e13310 to 657a657 Compare February 18, 2015 00:51

more robust check for dtype object

96d4b3e

amueller force-pushed the dtype_object_conversion branch from 657a657 to 96d4b3e Compare February 18, 2015 22:12

amueller changed the title ~~[MRG] make check_array convert object to float.~~ [MRG + 2] make check_array convert object to float. Feb 19, 2015

ogrisel added a commit that referenced this pull request Feb 24, 2015

Merge pull request #4057 from amueller/dtype_object_conversion

3468e00

[MRG + 2] make check_array convert object to float.

ogrisel merged commit 3468e00 into scikit-learn:master Feb 24, 2015

amueller mentioned this pull request Feb 24, 2015

linear_model giving AttributeError: 'numpy.float64' object has no attribute 'exp' #3142

Closed

amueller deleted the dtype_object_conversion branch February 24, 2015 21:39

amueller mentioned this pull request Mar 18, 2015

Test fails in least_angle on master #4399

Closed

amueller mentioned this pull request May 7, 2015

Revisit dtype="numeric" #4685

Closed

Uh oh!

[MRG + 2] make check_array convert object to float. #4057

[MRG + 2] make check_array convert object to float. #4057

Uh oh!

Conversation

amueller commented Jan 6, 2015

Uh oh!

amueller commented Jan 7, 2015

Uh oh!

amueller commented Jan 7, 2015

Uh oh!

amueller commented Jan 15, 2015

Uh oh!

amueller commented Jan 16, 2015

Uh oh!

amueller commented Jan 16, 2015

Uh oh!

agramfort commented Jan 17, 2015

Uh oh!

amueller commented Jan 20, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Feb 6, 2015

Uh oh!

amueller commented Feb 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Feb 6, 2015

Uh oh!

ogrisel commented Feb 6, 2015

Uh oh!

amueller commented Feb 13, 2015

Uh oh!

amueller commented Feb 13, 2015

Uh oh!

amueller commented Feb 14, 2015

Uh oh!

agramfort commented Feb 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Feb 15, 2015

Uh oh!

coveralls commented Feb 18, 2015

Uh oh!

amueller commented Feb 19, 2015

Uh oh!

amueller commented Feb 22, 2015

Uh oh!

ogrisel commented Feb 24, 2015

Uh oh!

amueller commented Feb 24, 2015

Uh oh!

Uh oh!