Skip to content

predict fails for multioutput ensemble models with non-numeric DVs #12831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elsander opened this issue Dec 19, 2018 · 8 comments · Fixed by #12834
Closed

predict fails for multioutput ensemble models with non-numeric DVs #12831

elsander opened this issue Dec 19, 2018 · 8 comments · Fixed by #12834
Labels

Comments

@elsander
Copy link
Contributor

Description

Multioutput forest models assume that the dependent variables are numeric. Passing string DVs returns the following error:

ValueError: could not convert string to float:

I'm going to take a stab at submitting a fix today, but I wanted to file an issue to document the problem in case I'm not able to finish a fix.

Steps/Code to Reproduce

I wrote a test based on ensemble/tests/test_forest:test_multioutput which currently fails:

def check_multioutput_string(name):
    # Check estimators on multi-output problems with string outputs.

    X_train = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1], [-2, 1],
               [-1, 1], [-1, 2], [2, -1], [1, -1], [1, -2]]
    y_train = [["red", "blue"], ["red", "blue"], ["red", "blue"], ["green", "green"],
               ["green", "green"], ["green", "green"], ["red", "purple"],
               ["red", "purple"], ["red", "purple"], ["green", "yellow"],
               ["green", "yellow"], ["green", "yellow"]]
    X_test = [[-1, -1], [1, 1], [-1, 1], [1, -1]]
    y_test = [["red", "blue"], ["green", "green"], ["red", "purple"], ["green", "yellow"]]

    est = FOREST_ESTIMATORS[name](random_state=0, bootstrap=False)
    y_pred = est.fit(X_train, y_train).predict(X_test)
    assert_array_almost_equal(y_pred, y_test)

    if name in FOREST_CLASSIFIERS:
        with np.errstate(divide="ignore"):
            proba = est.predict_proba(X_test)
            assert_equal(len(proba), 2)
            assert_equal(proba[0].shape, (4, 2))
            assert_equal(proba[1].shape, (4, 4))

            log_proba = est.predict_log_proba(X_test)
            assert_equal(len(log_proba), 2)
            assert_equal(log_proba[0].shape, (4, 2))
            assert_equal(log_proba[1].shape, (4, 4))


@pytest.mark.filterwarnings('ignore:The default value of n_estimators')
@pytest.mark.parametrize('name', FOREST_CLASSIFIERS_REGRESSORS)
def test_multioutput_string(name):
    check_multioutput_string(name)

Expected Results

No error is thrown, can run predict for all ensemble multioutput models

Actual Results

ValueError: could not convert string to float: <DV class>

Versions

I replicated this error using the current master branch of sklearn (0.21.dev0).

@elsander
Copy link
Contributor Author

Is numeric-only an intentional limitation in this case? There are lines that explicitly cast to double (

y = np.ascontiguousarray(y, dtype=DOUBLE)
). It's not an issue for single-output models, though.

@amueller
Copy link
Member

Sorry what do you mean by "DV"?

@amueller
Copy link
Member

You're using the regressors:
FOREST_CLASSIFIERS_REGRESSORS

They are not supposed to work with classes, you want the classifiers.

Can you please provide a minimum self-contained example?

@elsander
Copy link
Contributor Author

Ah, sorry. "DV" is "dependent variable".

Here's an example:

X_train = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1], [-2, 1],
               [-1, 1], [-1, 2], [2, -1], [1, -1], [1, -2]]
y_train = [["red", "blue"], ["red", "blue"], ["red", "blue"], ["green", "green"],
               ["green", "green"], ["green", "green"], ["red", "purple"],
               ["red", "purple"], ["red", "purple"], ["green", "yellow"],
               ["green", "yellow"], ["green", "yellow"]]
X_test = [[-1, -1], [1, 1], [-1, 1], [1, -1]]
est = RandomForestClassifier()
est.fit(X_train, y_train)
y_pred = est.predict(X_test)

Returns:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-a3b5313a012b> in <module>
----> 1 y_pred = est.predict(X_test)

~/repos/forks/scikit-learn/sklearn/ensemble/forest.py in predict(self, X)
    553                 predictions[:, k] = self.classes_[k].take(np.argmax(proba[k],
    554                                                                     axis=1),
--> 555                                                           axis=0)
    556
    557             return predictions

ValueError: could not convert string to float: 'green'

@amueller
Copy link
Member

Thanks, this indeed looks like bug.

The multi-output multi-class support is fairly untested tbh and I'm not a big fan of it (we don't implement score for that case!).
So even if we fix this it's likely you'll run into more issues. I have been arguing for removing this feature for a while (which is probably not what you wanted to hear ;)

@amueller amueller added the Bug label Dec 19, 2018
@elsander
Copy link
Contributor Author

For what it's worth, I think this specific issue may be resolved with a one-line fix. For my use case, having predict and predict_proba work is all I need.

@amueller
Copy link
Member

Feel free to submit a PR if you like

@jnothman
Copy link
Member

jnothman commented Dec 20, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants