-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Plotting API starting with ROC curve #14357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/modules/classes.rst
Outdated
Plotting metrics | ||
---------------- | ||
|
||
.. automodule:: sklearn.metrics.plot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we've decide to use sklearn.XXX.plot.plot_XXX, right?
We'll put things like plot_decision_boundary
in sklearn.inspection
, right?
sklearn/metrics/plot/roc_curve.py
Outdated
(default='predict_proba') | ||
Method to call estimator to get target scores | ||
|
||
label : str or None, optional (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used :)
So we've decided to: I think this is a great solution, but I'd like to see more opinions here (>=+3 maybe). Are you able to reach out to more people during the sprint, or maybe a SLEP/mail in the mailing list? You need to change our matplotlib dependency in readme and install.rst to something like We need to take care of |
CC @NicolasHug |
There's a reimplementation of I think it's worth emphasizing that the common simple use-case involves just calling the function, and the user not having to worry about the object. The main motivation for the object is to allow replotting without recomputation. Maybe if we want to make the SLEPs more lightweight it might make sense to write a slep for this, I'm not sure. I listed alternatives in #13448 (#13448 (comment)) There's some discussion there as well. |
Regarding plot_roc_curve, will it be useful to return the roc_auc_score? |
Maybe to summarize, we could:
I like being able to plot a roc curve with a single line. I think that should be our goal, so I'm strongly in favor of 2. Having to recompute, say, partial dependence, for plotting it again is not acceptable to me, though. That means we need to return or store the results of the computation somehow. There's two immediate solutions (maybe there's better ones that I can't think of): We opted for b) here because it hides the complexity somewhat from the user. The simple case still remains simple with a function call, not a class construction, which seems more natural to me (the difference is a bit academic, though, for the user it mostly manifests in whether we use |
docstring tests are failing @thomasjpfan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an example in the gallery (or we can modify existing example), at least for plot_roc_curve.
doc/developers/contributing.rst
Outdated
return viz | ||
``` | ||
|
||
Note that the ``__init__`` method defines attributes that are going to be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this paragraph can be integrated into the previous one IMO.
There's two roc curve examples, one for multi-class and one for cross-validation. The cross-validation example we can pretty easily do with the new function, apart from computing the mean.
For the multi-class one, I would use the plot function and manually apply it to each label for plotting in OVR fashion, and use the newly added multi-class roc to actually compute the metric. |
sklearn/metrics/plot/roc_curve.py
Outdated
return self | ||
|
||
|
||
def plot_roc_curve(estimator, X, y, pos_label=None, sample_weight=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's only one artist, I think we don't need to have a dict as an argument and can just do **kwargs
instead of having line_kw={...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, current API design seems strange, I think the visualizer should have a consistent interface compared with the helper function, i.e., it should accept estimator&X&y instead of fpr&tpr, otherwise the visualizer seems useless.
doc/visualizations.rst
Outdated
import matplotlib.pyplot as plt | ||
from sklearn.ensemble import RandomForestClassifier | ||
|
||
rfc = DecisionTreeClassifier(random_state=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RandomForestClassifier?
sklearn/metrics/_plot/roc_curve.py
Outdated
def __init__(self, fpr, tpr, auc_roc, estimator_name): | ||
self.fpr = fpr | ||
self.tpr = tpr | ||
self.auc_roc = auc_roc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users need to pass auc_roc manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the computation is done in plot_roc_curve
and passed into RocCurveDisplay
.
self.fpr = fpr | ||
self.tpr = tpr | ||
self.auc_roc = auc_roc | ||
self.estimator_name = estimator_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems redundant, we already have name in plot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives the plot a default to name to display in the legend. This is useful when plotting multiple roc curves on the same axes.
The user can later adjust the name by passing name
into plot
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done because how plot
does custom formatting to the label to add the AUC score in the label for convenience.
sklearn/metrics/_plot/roc_curve.py
Outdated
|
||
if y_pred.ndim != 1: | ||
if y_pred.shape[1] > 2: | ||
raise ValueError("Estimator must be a binary classifier") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binary classifier
seems strange, maybe things like binary classification problem
.
Naming sugestion: changer "Visualizer" to "Display": "RocCurveDisplay" |
""" | ||
if response_method != "auto": | ||
prediction_method = getattr(estimator, response_method, None) | ||
if prediction_method is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should check that user provided prediction_method
is one of the accepted values.
@qinhanmin2014 the ultimate goal here is to enable changing the style of plots without having to recompute values. We want to keep functions |
Maybe it's common for a helper function to do similar things compared with the class in scikit-learn? |
True, but I guess estimators are intended to be only used once. The advantage of the current design is that if you realize you want to change the style of the plot after the fact, your code can stay (almost) the same. With your proposal uses have to replace the function call by an object instantiation. |
@qinhanmin2014 raises the question in the dev meeting whether we need a visualizer for |
I kinda like the |
I think Andy persuaded me during the meeting. I'll vote +1 here.
I agree that we don't need a visualizer for plot_tree, but the inconsistency is indeed a problem.
+1 for either solution. I don't care much about the name. Maybe @NicolasHug can organize a vote in the mailing list (since it's a brand new API in scikit-learn) and then we can merge this one. |
waiting for @thomasjpfan to post an update and ping (not sure why I should be the one calling for a vote though?) |
because you organize the meeting, hmm, strange reason right :) |
I have updated this PR and the original post with the following summary: SummaryThe key features of this API is to run calculations once and to have the flexibility to adjust the visualizations after the fact. This logic is encapsulated into a display object where the computed data is stored and the plotting is done in a For example, the class RocCurveDisplay:
def __init__(self, fpr, tpr, roc_auc, estimator_name):
...
self.fpr = fpr
self.tpr = tpr
self.roc_auc = roc_auc
self.estimator_name = estimator_name
def plot(self, ax=None, name=None, **kwargs):
...
self.line_ = ...
self.ax_ = ax
self.figure_ = ax.figure_ Together with a plotting function to create this object: def plot_roc_curve(estimator, X, y, pos_label=None, sample_weight=None,
drop_intermediate=True, response_method="auto",
name=None, ax=None, **kwargs):
# do computation
viz = RocCurveDisplay(fpr, tpr, roc_auc,
estimator.__class__.__name__)
return viz.plot(ax=ax, name=name, **kwargs) |
I think Display is ambiguous but much more succinct. I don't think there's
any reason these need to take the same convention as "Classifier",
"Estimator" as they are entirely different. Go with Display? Btw I've not
looked in detail but at a glance I like this API.
|
I'm not sure if a vote is going to help here, it'll complicate things and this PR would be merged much much later, if we go for that process, which also includes writing a SLEP. |
So +1 from Joel, Andy, Adrin, Nicolas and me (maybe Gael not sure, sorry if I miss someone), |
doc/visualizations.rst
Outdated
Available Plotting Utilities | ||
============================ | ||
|
||
Fucntions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling Error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Welcome to a brave new world!
|
The brave new world of plotting? |
Yes, the brave new world of maintaining a wide ranging plotting API!
|
(Post Updated 08/05/19)
Reference Issues/PRs
Related to
What does this implement/fix? Explain your changes.
Scikit-learn defines a simple API for creating visualizations for machine learning. The key features of this API is to run calculations once and to have the flexibility to adjust the visualizations after the fact. This logic is encapsulated into a display object where the computed data is stored and the plotting is done in a
plot
method. The display object's__init__
method contains only the data needed to create the visualization. Theplot
method takes in parameters that only have to do with visualization, such as a matplotlib axes. Theplot
method will store the matplotlib artists as attributes allowing for style adjustments through the display object. Aplot_*
helper function accepts parameters to do the computation and the parameters used for plotting. After the helper function creates the display object with the computed values, it calls the display's plot method. Note that theplot
method defines attributes related to matplotlib, such as the line artist. This allows for customizations after calling theplot
method.For example, the
RocCurveDisplay
defines the following methods:Together with a plotting function to create this object:
Any other comments?
Here is a notebook demonstrating a workflow for using this API.
Here are what the API implemenation looks like for other visualizations: