Skip to content

NDCG score doesn't work with binary relevance and a list of 1 element #21335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cBournhonesque opened this issue Oct 14, 2021 · 14 comments · Fixed by #25672
Closed

NDCG score doesn't work with binary relevance and a list of 1 element #21335

cBournhonesque opened this issue Oct 14, 2021 · 14 comments · Fixed by #25672
Labels
Enhancement good first issue Easy with clear instructions to resolve module:metrics

Comments

@cBournhonesque
Copy link

See this code example:

>>> t = [[1]]
>>> p = [[0]]
>>> metrics.ndcg_score(t, p)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/cbournhonesque/.pyenv/versions/bento/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/Users/cbournhonesque/.pyenv/versions/bento/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 1567, in ndcg_score
    _check_dcg_target_type(y_true)
  File "/Users/cbournhonesque/.pyenv/versions/bento/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 1307, in _check_dcg_target_type
    raise ValueError(
ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got binary instead

It works correctly when the number of elements is bigger than 1: https://stackoverflow.com/questions/64303839/how-to-calculate-ndcg-with-binary-relevances-using-sklearn

@adrinjalali
Copy link
Member

It doesn't seem like a well-defined problem in the case of a single input to me. I'm not sure what you'd expect to get

@cBournhonesque
Copy link
Author

cBournhonesque commented Oct 15, 2021

I'm skipping the computation if there are 0 relevant documents (any(truths) is False), since the metric is undefined.
For a single input, where truth = [1], I would expect to get 1 if prediction is 1, or 0 if predictions is 0 (according to the ndcg definition)

@adrinjalali
Copy link
Member

pinging @jeremiedbb and @jeromedockes who worked on the implementation.

@jeromedockes
Copy link
Contributor

I would expect to get 1 if prediction is 1, or 0 if predictions is 0 (according to the ndcg definition)

which ndcg definition, could you point to a reference? (I ask because IIRC there is some variability in the definitions people use).

Normalized DCG is the ratio between the DCG obtained for the predicted and true rankings, and in my understanding when there is only one possible ranking (when there is only one candidate as in this example), both rankings are the same so this ratio should be 1. (this is the value we obtain if we disable this check).

however, ranking a list of length 1 is not meaningful, so if y_true has only one column it seems more likely that there was a mistake in the formatting/representation of the true gains, or that a user applied this ranking metric to a binary classification task. Therefore raising an error seems reasonable to me, but I guess the message could be improved (although it is hard to guess what was the mistake). showing a warning and returning 1.0 could also be an option

@jeromedockes
Copy link
Contributor

note this is a duplicate of #20119 AFAICT

@cBournhonesque
Copy link
Author

cBournhonesque commented Oct 18, 2021

HI jerome, you are right, I made a mistake. I'm using the definition on wikipedia
It looks like the results would be 0.0 if the document isn't a relevant one (relevance=0), or 1.0 if it is (relevance > 0). So the returned value could be equal to y_true[0] > 0. ?
In any case, I think that just updating error messages but keeping the current behaviour could be fine too

@jeromedockes
Copy link
Contributor

indeed when all documents are truly irrelevant and the ndcg is thus 0 / 0 (undefined) currently 0 is returned (as seen here).

but still I think measuring ndcg for a list of 1 document is not meaningful (regardless of the value of the relevance), so raising an error about the shape of y_true makes sense.

@glemaitre
Copy link
Member

So we should improve the error message in this case.

@georged4s
Copy link

I am happy to work on this if it hasn’t been assigned yet

@glemaitre
Copy link
Member

@georged4s I can see that #24482 has been open but it seems stalled. I think that you can claim the issue and propose a fix. You can also look at the review done in the older PR.

@georged4s
Copy link

Thanks @glemaitre for replying and for the heads up. Cool, I will look into this one.

@kayuksel
Copy link

I came here as I have suffered the same problem, it doesn't support binary targets.

Also, it would be great if it could be calculated simultaneously for a batch of users.

@JanFidor
Copy link
Contributor

Hi, there doesn't seem to be a linked PR (excluding the stalled one), could I pick it up?

@lene
Copy link
Contributor

lene commented Feb 23, 2023

Picking it up as part of the PyLadies "Contribute to scikit-learn" workshop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement good first issue Easy with clear instructions to resolve module:metrics
Projects
None yet
9 participants