Skip to content

FIX check_estimator fails when validating SGDClassifier with log_loss #24071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

MaxwellLZH
Copy link
Contributor

Reference Issues/PRs

Fixes #24025.

What does this implement/fix? Explain your changes.

The test that failed was check_decision_proba_consistency which checks that the output of predict_prob and decision function has perfect rank correlation. Meanwhile the probability output may have ties, which causes the rank to be different.

Instead of checking the rank to be exactly the same, the proposed fix checks that the average decision score is strictly increasing after grouped by the rank of predicted probability.

Comment on lines 3463 to 3470
a = estimator.predict_proba(X_test)[:, 1].round(decimals=10)
b = estimator.decision_function(X_test).round(decimals=10)
assert_array_equal(rankdata(a), rankdata(b))

prob_rank = rankdata(a)
# calculate the average decision score groupby the rank of predicted probability
avg_decision_score = [np.mean(b[prob_rank == i]) for i in np.unique(prob_rank)]
# check if the average decision score is strictly increasing
assert all(x < y for x, y in zip(avg_decision_score, avg_decision_score[1:]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can relax in this manner.
I think that just for readability, I would propose the following:

        y_proba = estimator.predict_proba(X_test)[:, 1].round(decimals=10)
        y_score = estimator.decision_function(X_test).round(decimals=10)

        rank_proba = rankdata(y_proba)
        rank_score = rankdata(y_score)

        try:
            assert_array_almost_equal(rank_proba, rank_score)
        except AssertionError:
            # Sometimes, the rounding applied on the probabilities will results
            # on ties that are not present in the scores because it is
            # numerically more precise. In this case, we relax the test by
            # grouping the decision function scores based on the probability
            # rank and check that the score is monotonically increasing.
            grouped_y_score = np.array(
                [y_score[rank_proba == group].mean() for group in np.unique(rank_proba)]
            )
            sorted_idx = np.argsort(grouped_y_score)
            assert_array_equal(sorted_idx, np.arange(len(sorted_idx)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much more readable, updated :)

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also add a non-regression test in sklearn/utils/tests/test_estimator_checks.py:

def test_decision_proba_tie_ranking():
    """Check that in case with some probabilities ties, we relax the
    ranking comparison with the decision function.
    Non-regression test for:
    https://github.com/scikit-learn/scikit-learn/issues/24025
    """
    # Move these imports to the top of the file
    from sklearn.linear_model import SGDClassifier
    from sklearn.utils.estimator_checks import check_decision_proba_consistency
    estimator = SGDClassifier(loss="log_loss")
    check_decision_proba_consistency("SGDClassifier", estimator)

@glemaitre glemaitre added this to the 1.2 milestone Oct 18, 2022
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I put this PR in the 1.2 milestones.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit 0e4e418 into scikit-learn:main Oct 25, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2022
andportnoy pushed a commit to andportnoy/scikit-learn that referenced this pull request Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

check_estimator cannot validate SGDClassifier with log_loss
4 participants