FIX check_estimator fails when validating SGDClassifier with log_loss #24071

MaxwellLZH · 2022-08-01T15:15:59Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The test that failed was check_decision_proba_consistency which checks that the output of predict_prob and decision function has perfect rank correlation. Meanwhile the probability output may have ties, which causes the rank to be different.

Instead of checking the rank to be exactly the same, the proposed fix checks that the average decision score is strictly increasing after grouped by the rank of predicted probability.

glemaitre · 2022-10-17T08:52:28Z

sklearn/utils/estimator_checks.py

        a = estimator.predict_proba(X_test)[:, 1].round(decimals=10)
        b = estimator.decision_function(X_test).round(decimals=10)
-        assert_array_equal(rankdata(a), rankdata(b))
+
+        prob_rank = rankdata(a)
+        # calculate the average decision score groupby the rank of predicted probability
+        avg_decision_score = [np.mean(b[prob_rank == i]) for i in np.unique(prob_rank)]
+        # check if the average decision score is strictly increasing
+        assert all(x < y for x, y in zip(avg_decision_score, avg_decision_score[1:]))


I think that we can relax in this manner.
I think that just for readability, I would propose the following:

y_proba = estimator.predict_proba(X_test)[:, 1].round(decimals=10) y_score = estimator.decision_function(X_test).round(decimals=10) rank_proba = rankdata(y_proba) rank_score = rankdata(y_score) try: assert_array_almost_equal(rank_proba, rank_score) except AssertionError: # Sometimes, the rounding applied on the probabilities will results # on ties that are not present in the scores because it is # numerically more precise. In this case, we relax the test by # grouping the decision function scores based on the probability # rank and check that the score is monotonically increasing. grouped_y_score = np.array( [y_score[rank_proba == group].mean() for group in np.unique(rank_proba)] ) sorted_idx = np.argsort(grouped_y_score) assert_array_equal(sorted_idx, np.arange(len(sorted_idx)))

That's much more readable, updated :)

glemaitre

We can also add a non-regression test in sklearn/utils/tests/test_estimator_checks.py:

def test_decision_proba_tie_ranking():
    """Check that in case with some probabilities ties, we relax the
    ranking comparison with the decision function.
    Non-regression test for:
    https://github.com/scikit-learn/scikit-learn/issues/24025
    """
    # Move these imports to the top of the file
    from sklearn.linear_model import SGDClassifier
    from sklearn.utils.estimator_checks import check_decision_proba_consistency
    estimator = SGDClassifier(loss="log_loss")
    check_decision_proba_consistency("SGDClassifier", estimator)

glemaitre

LGTM. I put this PR in the 1.2 milestones.

thomasjpfan

LGTM

…scikit-learn#24071) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

fix check_decision_proba_consistency

b019fdd

github-actions bot added the module:utils label Aug 1, 2022

fix linting error

97f4ba9

cmarmo added the Waiting for Reviewer label Aug 24, 2022

jeremiedbb requested a review from glemaitre October 14, 2022 11:45

glemaitre reviewed Oct 17, 2022

View reviewed changes

glemaitre removed the Waiting for Reviewer label Oct 17, 2022

MaxwellLZH added 3 commits October 18, 2022 13:35

improve readability

3fbdf24

add non-regression test

6ab2187

resolve merge conflict

dda080d

glemaitre added the No Changelog Needed label Oct 18, 2022

glemaitre added this to the 1.2 milestone Oct 18, 2022

glemaitre approved these changes Oct 18, 2022

View reviewed changes

thomasjpfan added 2 commits October 25, 2022 10:42

Merge remote-tracking branch 'upstream/main' into pr/24071

c3e3360

DOC Adjust grammar

b1d99a5

thomasjpfan approved these changes Oct 25, 2022

View reviewed changes

thomasjpfan merged commit 0e4e418 into scikit-learn:main Oct 25, 2022

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2022

FIX check_estimator fails when validating SGDClassifier with log_loss (…

b8ffee1

…scikit-learn#24071) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

andportnoy pushed a commit to andportnoy/scikit-learn that referenced this pull request Nov 5, 2022

FIX check_estimator fails when validating SGDClassifier with log_loss (…

534a8b8

…scikit-learn#24071) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX check_estimator fails when validating SGDClassifier with log_loss #24071

FIX check_estimator fails when validating SGDClassifier with log_loss #24071

MaxwellLZH commented Aug 1, 2022

glemaitre Oct 17, 2022

MaxwellLZH Oct 18, 2022

glemaitre left a comment

glemaitre left a comment

thomasjpfan left a comment

FIX check_estimator fails when validating SGDClassifier with log_loss #24071

FIX check_estimator fails when validating SGDClassifier with log_loss #24071

Conversation

MaxwellLZH commented Aug 1, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

glemaitre Oct 17, 2022

Choose a reason for hiding this comment

MaxwellLZH Oct 18, 2022

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment