-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH get column names by default in PDP when passing data… #15429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH get column names by default in PDP when passing data… #15429
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to check if the correct feature names are used in the plot?
if not(hasattr(X, '__array__') or sparse.issparse(X)): | ||
X = check_array(X, force_all_finite='allow-nan', dtype=np.object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like at some point this should be inside the check_array
. Also, why not pass accept_sparse
to check_array and not check it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like at some point this should be inside the check_array
Agreed. I think that @jorisvandenbossche intended a PR on this a while ago.
Also, why not pass accept_sparse to check_array and not check it here?
I think that the idea was to delegate the check to the underlying pipeline.
In fact, I am not sure that we need to make any checking at all. If we only need to get a column, _safe_indexing
should be smart enough to deal with list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the next line will fail n_features=X.shape[0]
so we need something else than a list.
if hasattr(X, "loc"): | ||
# get the column names for a pandas dataframe | ||
feature_names = X.columns.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this not explicitly check for columns
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until now, we always ducktyped dataframe in this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's time to have a is_dataframe helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can leave it for another PR though. I'm happy as is for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @glemaitre
Is there a way to check if the correct feature names are used in the plot?
That'd be nice, I think @thomasjpfan would know that?
It is done already in l.131-132 in the file |
Actually, my change in this test create a dataframe and therefore, we do not test anymore with numpy arrays. We should do both as well there. |
I added back the test where the input data is a numpy array and |
Merged with master to get fix for CI. Will merge when green. |
Thank you @glemaitre ! |
follow-up of #14028
partially addressed #14969
This allows not having to specify
feature_names
with pandas DataFrame by takingX.column.tolist()
by default.