`predict` fails for multioutput ensemble models with non-numeric DVs #12831

elsander · 2018-12-19T16:58:58Z

Description

Multioutput forest models assume that the dependent variables are numeric. Passing string DVs returns the following error:

ValueError: could not convert string to float:

I'm going to take a stab at submitting a fix today, but I wanted to file an issue to document the problem in case I'm not able to finish a fix.

Steps/Code to Reproduce

I wrote a test based on ensemble/tests/test_forest:test_multioutput which currently fails:

def check_multioutput_string(name):
    # Check estimators on multi-output problems with string outputs.

    X_train = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1], [-2, 1],
               [-1, 1], [-1, 2], [2, -1], [1, -1], [1, -2]]
    y_train = [["red", "blue"], ["red", "blue"], ["red", "blue"], ["green", "green"],
               ["green", "green"], ["green", "green"], ["red", "purple"],
               ["red", "purple"], ["red", "purple"], ["green", "yellow"],
               ["green", "yellow"], ["green", "yellow"]]
    X_test = [[-1, -1], [1, 1], [-1, 1], [1, -1]]
    y_test = [["red", "blue"], ["green", "green"], ["red", "purple"], ["green", "yellow"]]

    est = FOREST_ESTIMATORS[name](random_state=0, bootstrap=False)
    y_pred = est.fit(X_train, y_train).predict(X_test)
    assert_array_almost_equal(y_pred, y_test)

    if name in FOREST_CLASSIFIERS:
        with np.errstate(divide="ignore"):
            proba = est.predict_proba(X_test)
            assert_equal(len(proba), 2)
            assert_equal(proba[0].shape, (4, 2))
            assert_equal(proba[1].shape, (4, 4))

            log_proba = est.predict_log_proba(X_test)
            assert_equal(len(log_proba), 2)
            assert_equal(log_proba[0].shape, (4, 2))
            assert_equal(log_proba[1].shape, (4, 4))


@pytest.mark.filterwarnings('ignore:The default value of n_estimators')
@pytest.mark.parametrize('name', FOREST_CLASSIFIERS_REGRESSORS)
def test_multioutput_string(name):
    check_multioutput_string(name)

Expected Results

No error is thrown, can run predict for all ensemble multioutput models

Actual Results

ValueError: could not convert string to float: <DV class>

Versions

I replicated this error using the current master branch of sklearn (0.21.dev0).

The text was updated successfully, but these errors were encountered:

elsander · 2018-12-19T20:55:56Z

Is numeric-only an intentional limitation in this case? There are lines that explicitly cast to double (

scikit-learn/sklearn/ensemble/forest.py

Line 279 in e73acef

y = np.ascontiguousarray(y, dtype=DOUBLE)

). It's not an issue for single-output models, though.

amueller · 2018-12-19T21:31:39Z

Sorry what do you mean by "DV"?

amueller · 2018-12-19T21:33:14Z

You're using the regressors:
FOREST_CLASSIFIERS_REGRESSORS

They are not supposed to work with classes, you want the classifiers.

Can you please provide a minimum self-contained example?

elsander · 2018-12-19T22:01:13Z

Ah, sorry. "DV" is "dependent variable".

Here's an example:

X_train = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1], [-2, 1],
               [-1, 1], [-1, 2], [2, -1], [1, -1], [1, -2]]
y_train = [["red", "blue"], ["red", "blue"], ["red", "blue"], ["green", "green"],
               ["green", "green"], ["green", "green"], ["red", "purple"],
               ["red", "purple"], ["red", "purple"], ["green", "yellow"],
               ["green", "yellow"], ["green", "yellow"]]
X_test = [[-1, -1], [1, 1], [-1, 1], [1, -1]]
est = RandomForestClassifier()
est.fit(X_train, y_train)
y_pred = est.predict(X_test)

Returns:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-a3b5313a012b> in <module>
----> 1 y_pred = est.predict(X_test)

~/repos/forks/scikit-learn/sklearn/ensemble/forest.py in predict(self, X)
    553                 predictions[:, k] = self.classes_[k].take(np.argmax(proba[k],
    554                                                                     axis=1),
--> 555                                                           axis=0)
    556
    557             return predictions

ValueError: could not convert string to float: 'green'

amueller · 2018-12-19T22:12:18Z

Thanks, this indeed looks like bug.

The multi-output multi-class support is fairly untested tbh and I'm not a big fan of it (we don't implement score for that case!).
So even if we fix this it's likely you'll run into more issues. I have been arguing for removing this feature for a while (which is probably not what you wanted to hear ;)

elsander · 2018-12-19T22:15:47Z

For what it's worth, I think this specific issue may be resolved with a one-line fix. For my use case, having predict and predict_proba work is all I need.

amueller · 2018-12-19T22:16:59Z

Feel free to submit a PR if you like

jnothman · 2018-12-20T10:29:05Z

I suspect this case is handled just fine in KNNClassifier and DecisionTreeClassifier, so we should probably handle it here... and add common tests.

amueller added the Bug label Dec 19, 2018

elsander mentioned this issue Dec 19, 2018

Fix predict method for multiclass multioutput ensemble models #12834

Merged

jnothman closed this as completed in #12834 Jan 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`predict` fails for multioutput ensemble models with non-numeric DVs #12831

`predict` fails for multioutput ensemble models with non-numeric DVs #12831

elsander commented Dec 19, 2018

elsander commented Dec 19, 2018

amueller commented Dec 19, 2018

amueller commented Dec 19, 2018

elsander commented Dec 19, 2018

amueller commented Dec 19, 2018

elsander commented Dec 19, 2018

amueller commented Dec 19, 2018

jnothman commented Dec 20, 2018 via email

predict fails for multioutput ensemble models with non-numeric DVs #12831

predict fails for multioutput ensemble models with non-numeric DVs #12831

Comments

elsander commented Dec 19, 2018

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

elsander commented Dec 19, 2018

amueller commented Dec 19, 2018

amueller commented Dec 19, 2018

elsander commented Dec 19, 2018

amueller commented Dec 19, 2018

elsander commented Dec 19, 2018

amueller commented Dec 19, 2018

jnothman commented Dec 20, 2018 via email

`predict` fails for multioutput ensemble models with non-numeric DVs #12831

`predict` fails for multioutput ensemble models with non-numeric DVs #12831