Skip to content

Regression metrics - which strategy ? #13482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
smarie opened this issue Mar 20, 2019 · 2 comments
Open

Regression metrics - which strategy ? #13482

smarie opened this issue Mar 20, 2019 · 2 comments

Comments

@smarie
Copy link
Contributor

smarie commented Mar 20, 2019

I recently came across #12895 (with PR #13467) and the older #6457, this woke up an old topic that I would like to share.

In our team, we had the need to provide model performance metrics, for regression models. This is a slightly different goal than using metrics for grid-search or model selection. Indeed the metric is not only used to "select the best model" but to provide users with feedback about "how good a model is".

For regression models I introduced three categories of metrics that happened to be quite intuitive:

  • Absolute performance (L2 RMSE, L1 MAE): these metrics can all be interpreted as an "average prediction error" ("average" in the broad sense here) expressed in the unit of the prediction target (e.g. "average error of 12kWh")

  • Relative performance (L2 CVRMSE, L1 CVMAE, and per-point relative metrics such as MAPE or MARE, MARES, MAREL...): these metrics can all be interpreted as an "average relative prediction error" expressed as a percentage of the target (e.g. "average error of 10%").

  • Comparison to a dummy model (L2 RRSE, L1 RAE): these metrics can all be interpreted as a ratio between the performance of the model at hand, and the performance of a dummy, constant model (predicting always the average). These need to be inverted to be intuitive e.g. "20% -> 5 times better than a dummy model"

Of course these categories are "applicative". They all make sense from a user point of view, however as far as model selection is concerned, only two make sense (MAE and RMSE). Not even R² because R²=1-RRSE² so it is not a performance metric but a comparison to dummy metric (but I dont want to open the debate here so please refrain from objecting on that one :) ).

Anyway my question for the core sklearn team is: shall I propose a pull request with all these metrics ? I'm ready to shoot since we've done it in our private repo, aligned with sklearn regression.py file. So it is rather a matter of deciding if this is a good idea. And if so, introducing categories might be needed, to help users better understand.

An alternative might be to create a small independent projet containing all the metrics, leaving only the mean_absolute_error (L1) and mean_squared_error (L2) in sklearn.

Any thoughts on this ?

@smarie
Copy link
Contributor Author

smarie commented Mar 21, 2019

Oh and I forgot: associated with this, I realized that the r2 score in regression.py is wrong since it is automatically set to 0 when the data is constant, instead of -Inf (x/0) or NaN (0/0). This might obviously be a good approximation for model selection, but it is definitely mathematically wrong if that metric is used to display back the model performance to users.

See the source code of r2_score:

# arbitrary set to zero to avoid -inf scores, having a constant
# y_true is not interesting for scoring a regression anyway
output_scores[nonzero_numerator & ~nonzero_denominator] = 0.

In my implementation, I rather keep all NaN and Inf but set numpy temporarily to silent mode concerning division by zero and nan related warnings.

@smarie
Copy link
Contributor Author

smarie commented Mar 21, 2019

And finally: as a user I would prefer to manipulate first-class citizens for all base metrics above than having to partialize more generic metrics. Of course implementation can be shared using more generic routines, but a user looking for rmse should be able to directly find it.

In my implementation I even added aliases for all metrics

  • initials: e.g. rmse = root_mean_squared_error
  • popular alternate naming: e.g. cv_rmsd = cv_rmse.

Indeed the overhead in creating aliases is extremely low, compared to the educational power. All users would be able to find their favorite metric, and then realize that it is the same than some other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants