Skip to content

[MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 #14475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 55 commits into from
Sep 4, 2019

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Jul 25, 2019

This is the follow-up of #14035.

It is a refactoring of safe_indexing splitting the X indexing depending on the type and reorganizing the tests as well.

@glemaitre glemaitre changed the title EHN add support for scalar, slice and mask in safe_indexing axis=0 [MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 Jul 25, 2019
@glemaitre
Copy link
Member Author

@jnothman @NicolasHug As promised with a bit of delay, I draft something to make safe_indexing consistent. Happy to have feedback.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :)

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question, but is there a real use-case for this?

@glemaitre
Copy link
Member Author

Dumb question, but is there a real use-case for this?

Passing a mask array to select sample does not seem inappropriate.
This is feature is mainly to make safe_indexing consistent after @jnothman had concerned about it. I agree with him that having a close behaviour and support for axis=0 and axis=1 would make sense.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that previously unsupported code will now work, such as cv splitters returning masks or feature selectors...? Are we being too lenient in our APIs then?

- To select multiple elements (i.e. rows or columns), `indices` can be
one of the following: `list`, `array`, `slice`. The type used in
these containers can be one of the following: `int`, `bool`, and
`str`. `str` is only supported when `X` is a dataframe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are slices of bool and str?

Would this work?

To select multiple elements (i.e. rows or columns), indices can be
a slice, or a container (list orarray). The type used in
these containers can be one of the following: int, bool, and
str. str is only supported when X is a dataframe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slice of bool will give you the first line (it casting stuff into int). We should raise an error for such case I think.

slice of string is useful with dataframe:

df.loc[slice('xxx', 'yyy')]

I see that we have an issue because we only use iloc and then it would not work. So the docstring is wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to support slices of strings? I'd say we don't for now to keep it simple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already supported for the column with dataframe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is:

from sklearn.datasets import fetch_openml
iris = fetch_openml('iris', as_frame=True)
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
transformer = make_column_transformer(
    (FunctionTransformer(), slice('sepallength', 'sepalwidth'))
)
transformer.fit_transform(iris.data, iris.target)

@glemaitre
Copy link
Member Author

I refactor the code which should make it more understandable but less easy to review by looking at the diff (sorry for that). I think that I will need to do the same regarding the tests if we want to keep this indexing manageable.

So before to go further, I think that we need to answer a couple of questions:

I think that this is the main blocker here.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand what's going on. Are we considering safe_indexing not public and so we can break backward-compatibility?

@glemaitre
Copy link
Member Author

I'm not sure if I understand what's going on. Are we considering safe_indexing not public and so we can break backward-compatibility?

As mentioned in one of my comment, I reintroduced support for boolean array-like and integer slice meaning since we were supporting it.

@jnothman
Copy link
Member

jnothman commented Aug 20, 2019 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat confused by this _check_key_type(key, superclass) design. It repeats very similar checks for each of three classes, but also has specialised code paths depending on superclass. Why not _determine_key_type(key) -> {'s', 'i', 'b', 'none-slice'} called exactly once per safe_indexing?

"""Index a pandas dataframe or a series."""
# check whether we should index with loc or iloc
if _check_key_type(key, int):
by_name = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be raising a warning/error if X.index.dtype.kind in 'iu' (and similar on the other axis)? This won't handle the case where there are ints of object dtype in the index.


if hasattr(key, 'shape'):
# Work-around for indexing with read-only key in pandas
# FIXME: solved in pandas 0.25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use an explicit version comparison to take this codepath?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we need to make an explicit pandas import

if np.isscalar(key) or isinstance(key, slice):
# key is a slice or a scalar
return X[key]
key_set = set(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key will usually be an array, won't it? Should we cast into an array at some point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we need to do so. We can always postpone until required.

return None
if isinstance(key, tuple(dtype_to_str.keys())):
try:
return dtype_to_str[type(key)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the right behaviour for a scalar bool

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? bool and np.bool_ will return the string 'bool'. What would you expect instead?

Note that in safe indexing, we are not using scalar bool to index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in safe indexing, we are not using scalar bool to index.

Okay ;)

return set_type
if hasattr(key, 'dtype'):
try:
return array_dtype_to_str[key.dtype.kind]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is behaviour with dtype='O' where the array contains only ints or only bools correct? (How does numpy handle this?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy will return 'O' and nothing about 'i' or 'b' so it will fail. I am not really sure how to solve this issue without adding so much complexity.

One would need to try converting the array to bool and int first. However, I thought that we are usually dtype=object to handle string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maintains the behavior on master where we use ('O', 'U', 'S') to mean string:

if hasattr(key, 'dtype'):
if superclass is int:
return key.dtype.kind == 'i'
elif superclass is bool:
return key.dtype.kind == 'b'
else:
# superclass = str
return key.dtype.kind in ('O', 'U', 'S')

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not reviewed tests thoroughly, but I'm pretty happy with this!

glemaitre and others added 2 commits August 21, 2019 13:47
Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
@rth
Copy link
Member

rth commented Sep 4, 2019

Looks like there are two approvals. Anything blocking the merge?

@glemaitre
Copy link
Member Author

I think this is ready to be merged. I address all comments.

@jnothman jnothman merged commit ae5f558 into scikit-learn:master Sep 4, 2019
@jnothman
Copy link
Member

jnothman commented Sep 4, 2019

Thanks @glemaitre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants