-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Remove median_absolute_error
from METRICS_WITHOUT_SAMPLE_WEIGHT
#30787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/metrics/tests/test_common.py
Outdated
@@ -1550,12 +1549,13 @@ def check_sample_weight_invariance(name, metric, y1, y2): | |||
), | |||
) | |||
def test_regression_sample_weight_invariance(name): | |||
n_samples = 50 | |||
random_state = check_random_state(0) | |||
n_samples = 51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be an even number due to #17370, inspired by @ogrisel last paragraph in #17370 (comment), even though I agree it's a temporary fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since #29907 was merged, it should be possible to update median_absolute_error
to handle even number of data points (with or without weights) as explained below.
sklearn/metrics/tests/test_common.py
Outdated
n_samples = 50 | ||
random_state = check_random_state(0) | ||
n_samples = 51 | ||
random_state = check_random_state(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand why this test would be 'flaky' for median_absolute_error
for the 'check that the weighted and unweighted scores are unequal' check in check_sample_weight_invariance
. Both 0 and 2 fail here 🤷
To check that it was not our _weighted_percentile
doing something wrong, I used numpys quantile
with sample weights (which I trust, because I saw that PR and the scrutiny and all the tests):
import numpy as np
from sklearn.metrics import median_absolute_error
from sklearn.utils.validation import check_random_state
n_samples = 51
random_state = check_random_state(0)
# regression
y_true = random_state.random_sample(size=(n_samples,))
y_pred = random_state.random_sample(size=(n_samples,))
rng = np.random.RandomState(0)
sample_weight = rng.randint(1, 10, size=len(y_true))
no_weights = np.median(np.abs(y_pred - y_true), axis=0)
# gives np.float64(0.26234528145390845)
weights = np.quantile(
np.abs(y_pred - y_true), 0.5, axis=0, method='inverted_cdf',
weights=sample_weight
)
# gives array(0.26234528)
I may be missing something, but I don't know why this would be likely to be the same with and without wights.
Note that changing max value in sample_weight = rng.randint(1, 10, size=len(y_true))
from 10 to a smaller value, 5 or 8 etc, was also very effective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On even numbers of samples (taking weights into account), inverted_cdf
yields an asymmetric result, while np.median
(on repeated samples) yields a symmetric result (as it uses the "linear" interpolation definition of "in between data points" quantiles).
We need to use a symmetric version of _weighted_percentile
which named _averaged_weighted_percentile
which was merged yesterday as part of #29907 which is equivalent to the method="averaged_inverted_cdf"
of np.quantile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was not clear, random state seed being 0 or 2 fail. But 1 passes.
The specific check that is failing is:
scikit-learn/sklearn/metrics/tests/test_common.py
Lines 1466 to 1475 in 2c2e970
# check that the weighted and unweighted scores are unequal | |
weighted_score = metric(y1, y2, sample_weight=sample_weight) | |
# use context manager to supply custom error message | |
with pytest.raises(AssertionError): | |
assert_allclose(unweighted_score, weighted_score) | |
raise ValueError( | |
"Unweighted and weighted scores are unexpectedly " | |
"almost equal (%s) and (%s) " | |
"for %s" % (unweighted_score, weighted_score, name) |
Which is odd right? See above sanity check.
Also I agree that it would be better to use numpy implementation of weights but our min support version is not high enough. Do you think we should vendor the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I looked into this more, and it seems that this is just not a good check for median_absolute_error
(red line is median value, weighted 'median' is calculated with np.quantile(method="inverted_cdf")
):
Code
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(12, 10))
# Different weight ranges for each row
weight_ranges = [(None, "No Weights"), (np.arange(1, 10), "Weights (1-9)"), (np.arange(1, 5), "Weights (1-5)")]
# Different random seeds for each column
seeds = [0, 1, 2]
# Loop over rows (weights) and columns (seeds)
for row, (weight_range, weight_label) in enumerate(weight_ranges):
for col, seed in enumerate(seeds):
random_state = check_random_state(seed)
y_true = random_state.random_sample(size=(51,))
y_pred = random_state.random_sample(size=(51,))
values = np.abs(y_pred - y_true)
rng = np.random.RandomState(seed)
sample_weight = rng.randint(weight_range.min(), weight_range.max() + 1, size=len(y_true)) if weight_range is not None else None
# Plot histogram
if sample_weight is None:
axes[row, col].hist(values, bins=15, edgecolor="black", alpha=0.7, density=True)
median_value = np.median(values)
else:
axes[row, col].hist(values, bins=15, weights=sample_weight, edgecolor="black", alpha=0.7, density=True)
median_value = np.quantile(values, 0.5, method="inverted_cdf", weights=sample_weight)
# Add median line
axes[row, col].axvline(median_value, color="red", linestyle="dashed", linewidth=1, label="Median")
# Set title
axes[row, col].set_title(f"Seed {seed} - {weight_label}")
# Adjust layout for readability
plt.tight_layout()
plt.show()
Weights don't change the distribution of abs(y_pred - y_true)
enough to change the median value.
We can change the test to make this check (e.g., we could try adding only high or low weights to a subset of the values), but I am not 100% sure about changing checking this for all regression metrics. To be fair this test is just ensuring that weighted and un-weighted are different, so it should probably be okay?
cc @ogrisel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I agree that it would be better to use numpy implementation of weights but our min support version is not high enough. Do you think we should vendor the code?
If the complexity of vendoring is not to high, we should definitely benefit from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see. I think that I would define weights as np.arange(len(y_true))
and I think that it would good enough for this test. I indeed agree with your analysis @lucyleeow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check the subsequent changes in the test to see if np.arange
could be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to use _averaged_weighted_percentile
directly to latter be able to add an array API compatible version.
At the time of writing numpy.quantile(..., method="averaged_inverted_cdf")
does not support weights (yet) and is not part of the array API spec. So we need a version that supports both weights, the symmetric behavior with even numbers of data points, and possible array API support.
This is a waste, but shape consistency checks are not very costly (compared to value dependent checks, e.g. checking that there are no negative weights or no nan/inf weights), so maybe we can live with it. Unless, removing those redundant checks can help simplify the code, in which case we should do it. |
|
Thanks @ogrisel. With regard to duplicate checks: #30787 (comment), thinking about it more, I think the problem is more:
What do you think of WDYT? |
It sounds like a good plan. |
random_state = check_random_state(0) | ||
# regression | ||
y_true = random_state.random_sample(size=(n_samples,)) | ||
y_pred = random_state.random_sample(size=(n_samples,)) | ||
sample_weight = np.arange(len(y_true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed as suggested in #30787 (comment). This works great.
I've amended check_sample_weight_invariance
to optionally take sample_weight
as I didn't want to change the sample_weight
for all tests that use check_sample_weight_invariance
, WDYT @glemaitre ?
b353a95
to
92884bc
Compare
@@ -1552,7 +1552,8 @@ def check_sample_weight_invariance(name, metric, y1, y2): | |||
if not name.startswith("unnormalized"): | |||
# check that the score is invariant under scaling of the weights by a | |||
# common factor | |||
for scaling in [2, 0.3]: | |||
scaling_values = [2] if name == "median_absolute_error" else [2, 0.3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a problem due to numerical instability in cumulative sum (NOT fixed by stable_cumsum
) in _weighted_percentile
. It is rare for this problem to appear.
In this test, the final value in cumulative_sum
was a small amount (within tolerance in stable_cumsum
) higher than 'actual' value, this resulted in the adjusted_percentile_rank
being a small amount higher than the 'actual' value:
adjusted_percentile_rank
was 17.400000000000002 , when it should have been 17.4, which just happens to be a value in weight_cdf
. Thus when we do searchsorted
scikit-learn/sklearn/utils/stats.py
Lines 98 to 99 in bff3d7d
xp.searchsorted( | |
weight_cdf[feature_idx, ...], adjusted_percentile_rank[feature_idx] |
the index should be that of the 17.4 element in weight_cdf
, but instead it is the next index. stable_cumsum
does not fix this particular problem as it does not matter how close adjusted_percentile_rank
value is to the true value, if the true value is itself present within weight_cdf
, searchsorted
will take the adjacent index.
Note that I checked using numpy.quantile
(with "inverted_cdf`, which now supports weights) and got the same test failure).
For completeness, I will reference the recent discussion regarding use of stable_cumsum
(#29431 (comment)) - it was decided to not be required in _weighted_percentile
(cc @ogrisel).
Further numpy's quantile
implementation simply uses cumsum
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I checked using numpy.quantile (with "inverted_cdf`, which now supports weights) and got the same test failure).
Can you check the behavior of numpy.quantile
with "averaged_inverted_cdf"
and _averaged_weighted_percentile
on this kind of edge case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: I am not sure if the comment above was written before the switch from _weighted_percentile
to _averaged_weighted_percentile
or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I should have clarified that the failure persists with _averaged_weighted_percentile
. But I agree that I also thought _averaged_weighted_percentile
should fix the problem so I did some more digging.
_averaged_weighted_percentile
errors with:
E Max absolute difference among violations: 0.00704464
E Max relative difference among violations: 0.0095152
E ACTUAL: array(0.733313)
E DESIRED: array(0.740357)
_weighted_percentile
errors with:
E Max absolute difference among violations: 0.01408929
E Max relative difference among violations: 0.01903039
E ACTUAL: array(0.726268)
E DESIRED: array(0.740357)
So _averaged_weighted_percentile
halves the error. And the reason is because:
- with (+)
array
the cumulative sum has the instability problem (i.e.adjusted_percentile_rank
is 17.400000000000002) - with
-array
the cumulative sum does not have the instability problem (i.e.adjusted_percentile_rank
is 17.4) 🙃
If cumulative sum had the instability problem with both array
and -array
I think the problem would have been fixed.
numpy.quantile
with weights only supports inverted_cdf
(I double checked in dev docs https://numpy.org/devdocs/reference/generated/numpy.quantile.html and numpy PRs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just documenting a reminder to myself - potentially when we refactor _averaged_weighted_percentile
to avoid sorting twice, we won't use -array
, in which case scale=0.3 may pass.
(I've been wanting to do the refactoring, but it has not reached the top of my todo yet 😬 )
We probably need to enable faulthandler on the CI to understand what's going on. Maybe we should always enable faulthander with a global timeout to 45 min or so (or a 30s timeout per test) before dumping the tracebacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff of the PR looks good to me, but we need to understand where the CI failure comes from. Maybe it's not related to that PR in particular?
If so, +1 for merge on my side.
Gentle ping to @glemaitre. I re-run the failed check and it passed, so seems to just be a timeout. I'll try and look into faulthandler if I have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well.
What does this implement/fix? Explain your changes.
sample_weights
was added tomedian_absolute_error
in 0.24 but it was never removed fromMETRICS_WITHOUT_SAMPLE_WEIGHT
. Removing it has highlighted several issues:Redundancy in checking
I had to add
check_consistent_length
tomedian_absolute_error
to get the same error message as other metrics wheny_pred
,y_true
orsample_weights
are not of the same length. - it looks like this check was added to most sample weight metrics in #9903 but notmedian_absolute_error
.However it is worth noting that there is redundancy is our checking.
_check_reg_targets*
- checksy_pred
,y_true
are of consistent length, performscheck_array
, checksmultioutput
is acceptable, and various other reg related checks.check_consistent_length
- checks thaty_pred
,y_true
andsample_weights
are of the same length, used in most (all?) regression metrics_check_sample_weight
- checkssample_weights
is the same length asy
, performscheck_array
onsample_weights
, various other checks.If all 3 checks are done in a metrics, we are effectively checking the
sample_weight
is the correct length 3 times.Quantile problems
median_absolute_error
failscheck_sample_weight_invariance
intest_regression_sample_weight_invariance
- I will put detailed description in review comments.Classification data used to test regression metrics
test_multilabel_sample_weight_invariance
fails with:This makes sense because we are passing multilabel classification data (0/1's) to
MULTIOUTPUT_METRICS
which are Regression metrics with "multioutput-continuous" format support (e.g., "mean_squared_error", "r2_score" etc). I am not sure why we would not use regression data for these metrics? The tests do pass for all the other regression metrics, but asabs(y_pred - y_true)
would be either 1 or 0 for every sample, it is very likely that weighted and unweightedmedian_absolute_error
would be the same value.I think we should amend
test_multilabel_sample_weight_invariance
so multi-output regression data is passed to theMULTIOUTPUT_METRICS
tests (any maybe even change the name of this various to make it clear that these are regression metrics).Any other comments?
This is a draft as I don't think this PR should be merged without resolving underlying issues.
cc @glemaitre @ogrisel