You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently came across #12895 (with PR #13467) and the older #6457, this woke up an old topic that I would like to share.
In our team, we had the need to provide model performance metrics, for regression models. This is a slightly different goal than using metrics for grid-search or model selection. Indeed the metric is not only used to "select the best model" but to provide users with feedback about "how good a model is".
For regression models I introduced three categories of metrics that happened to be quite intuitive:
Absolute performance (L2 RMSE, L1 MAE): these metrics can all be interpreted as an "average prediction error" ("average" in the broad sense here) expressed in the unit of the prediction target (e.g. "average error of 12kWh")
Relative performance (L2 CVRMSE, L1 CVMAE, and per-point relative metrics such as MAPE or MARE, MARES, MAREL...): these metrics can all be interpreted as an "average relative prediction error" expressed as a percentage of the target (e.g. "average error of 10%").
Comparison to a dummy model (L2 RRSE, L1 RAE): these metrics can all be interpreted as a ratio between the performance of the model at hand, and the performance of a dummy, constant model (predicting always the average). These need to be inverted to be intuitive e.g. "20% -> 5 times better than a dummy model"
Of course these categories are "applicative". They all make sense from a user point of view, however as far as model selection is concerned, only two make sense (MAE and RMSE). Not even R² because R²=1-RRSE² so it is not a performance metric but a comparison to dummy metric (but I dont want to open the debate here so please refrain from objecting on that one :) ).
Anyway my question for the core sklearn team is: shall I propose a pull request with all these metrics ? I'm ready to shoot since we've done it in our private repo, aligned with sklearn regression.py file. So it is rather a matter of deciding if this is a good idea. And if so, introducing categories might be needed, to help users better understand.
An alternative might be to create a small independent projet containing all the metrics, leaving only the mean_absolute_error (L1) and mean_squared_error (L2) in sklearn.
Any thoughts on this ?
The text was updated successfully, but these errors were encountered:
Oh and I forgot: associated with this, I realized that the r2 score in regression.py is wrong since it is automatically set to 0 when the data is constant, instead of -Inf (x/0) or NaN (0/0). This might obviously be a good approximation for model selection, but it is definitely mathematically wrong if that metric is used to display back the model performance to users.
# arbitrary set to zero to avoid -inf scores, having a constant# y_true is not interesting for scoring a regression anywayoutput_scores[nonzero_numerator&~nonzero_denominator] =0.
In my implementation, I rather keep all NaN and Inf but set numpy temporarily to silent mode concerning division by zero and nan related warnings.
And finally: as a user I would prefer to manipulate first-class citizens for all base metrics above than having to partialize more generic metrics. Of course implementation can be shared using more generic routines, but a user looking for rmse should be able to directly find it.
In my implementation I even added aliases for all metrics
initials: e.g. rmse = root_mean_squared_error
popular alternate naming: e.g. cv_rmsd = cv_rmse.
Indeed the overhead in creating aliases is extremely low, compared to the educational power. All users would be able to find their favorite metric, and then realize that it is the same than some other.
I recently came across #12895 (with PR #13467) and the older #6457, this woke up an old topic that I would like to share.
In our team, we had the need to provide model performance metrics, for regression models. This is a slightly different goal than using metrics for grid-search or model selection. Indeed the metric is not only used to "select the best model" but to provide users with feedback about "how good a model is".
For regression models I introduced three categories of metrics that happened to be quite intuitive:
Absolute performance (L2 RMSE, L1 MAE): these metrics can all be interpreted as an "average prediction error" ("average" in the broad sense here) expressed in the unit of the prediction target (e.g. "average error of 12kWh")
Relative performance (L2 CVRMSE, L1 CVMAE, and per-point relative metrics such as MAPE or MARE, MARES, MAREL...): these metrics can all be interpreted as an "average relative prediction error" expressed as a percentage of the target (e.g. "average error of 10%").
Comparison to a dummy model (L2 RRSE, L1 RAE): these metrics can all be interpreted as a ratio between the performance of the model at hand, and the performance of a dummy, constant model (predicting always the average). These need to be inverted to be intuitive e.g. "20% -> 5 times better than a dummy model"
Of course these categories are "applicative". They all make sense from a user point of view, however as far as model selection is concerned, only two make sense (MAE and RMSE). Not even R² because R²=1-RRSE² so it is not a performance metric but a comparison to dummy metric (but I dont want to open the debate here so please refrain from objecting on that one :) ).
Anyway my question for the core
sklearn
team is: shall I propose a pull request with all these metrics ? I'm ready to shoot since we've done it in our private repo, aligned with sklearnregression.py
file. So it is rather a matter of deciding if this is a good idea. And if so, introducing categories might be needed, to help users better understand.An alternative might be to create a small independent projet containing all the metrics, leaving only the
mean_absolute_error
(L1) andmean_squared_error
(L2) in sklearn.Any thoughts on this ?
The text was updated successfully, but these errors were encountered: