Skip to content

Added gini coefficient to ranking and scorer #10084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

tagomatech
Copy link

Added a function at the end of sklearn\metrics\ranking.py to compute the Gini coefficient which is being used in some Kaggle competitions.

I added the corresponding import declaration in sklearn\metrics\__init__.py

Finally, I create a scorer à la sklearn in sklearn\metrics\sorer.py, so that the gini coefficient can be used across sklearn validation/metrics functions, e.g. cross_val_score .

Reference was taken here and results were checked against several entries on Kaggle and sklearn AUC/ROC score (is it not rocket_science, to be honest).

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this to metrics/tests/test_common.py and also add specific tests that this matches known scores on toy datasets.

@@ -858,3 +858,33 @@ def ndcg_score(y_true, y_score, k=5):
scores.append(actual / best)

return np.mean(scores)


def gini(y_true, y_score):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps name this gini_score for consistency

----------
.. [1] David J. Hand and Robert J. Till (2001).
A Simple Generalisation of the Area Under the ROC Curve for
Multiple Class Classification Problems. In Machine Learning, 45,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your implementation does not currently extend to multiclass. You have merely implemented a chance corrected binary roc

@qinhanmin2014
Copy link
Member

@tagomatech Could you please explain why do we need gini coefficient since we already have roc_auc_score? It can almost be replaced by roc_auc_score and it seems hard to find any reference about its definition and application in ML. I don't think the paper your provide is a good reference. It only states that gini index(gini coefficient?) is equivalent to roc_auc_score and the whole paper is based on roc_auc_score.
(Forgive me if there's something wrong :) )

@tagomatech
Copy link
Author

@qinhanmin2014
Adding this function is a small improvement, indeed. Personally, I find it useful when playing around Kaggle competitions.
As per the sources, there is a lot of confusion about "Gini index", "Gini coefficient", "Normalized Coefficient". The source I suggested possesses the virtue of being unambiguous, by defining Gini in relation to AUC.

@qinhanmin2014
Copy link
Member

@tagomatech Thanks.
I think we have reached consensus that:
(1)The metrics can almost be replaced by roc_auc_score.
(2)It is difficult to find reference about its definition and application in ML. Right?
So I might be -1 for the metric.
Also, from my perspective, kaggle can be the application of our metrics, but might be difficult to serve as the (main) origin of our metrics, because in some cases, their metrics are designed for special scenario.
This is only my personal opinion so feel free to fix the conflict, make CIs green, provide more persuasive literature and wait for the opinion from core devs.

@glemaitre
Copy link
Member

I am -1 to merge since the score can be easily computed from the ROC AUC.
I would also think that it could be some confusion between the Gini impurity used the decision tree and the Gini coefficient.

@qinhanmin2014
Copy link
Member

@tagomatech Thanks a lot for your contribution. Sorry but I'm going to close this one with the another -1 above. I think the general consensus is that it can be replaced by roc_auc_score and there's no clear definition.

@ogrisel
Copy link
Member

ogrisel commented Oct 10, 2019

Actually the Gini coefficient is defined in terms of area under the Lorenz curve (for positive regression models) which is not the same as ROC AUC. I started an undocumented prototype implementation in #15176.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants