-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 #14475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 #14475
Conversation
@jnothman @NicolasHug As promised with a bit of delay, I draft something to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! :)
…into is/consistent_safe_indexing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question, but is there a real use-case for this?
Passing a mask array to select sample does not seem inappropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that previously unsupported code will now work, such as cv splitters returning masks or feature selectors...? Are we being too lenient in our APIs then?
sklearn/utils/__init__.py
Outdated
- To select multiple elements (i.e. rows or columns), `indices` can be | ||
one of the following: `list`, `array`, `slice`. The type used in | ||
these containers can be one of the following: `int`, `bool`, and | ||
`str`. `str` is only supported when `X` is a dataframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are slices of bool and str?
Would this work?
To select multiple elements (i.e. rows or columns),
indices
can be
a slice, or a container (list
orarray
). The type used in
these containers can be one of the following:int
,bool
, and
str
.str
is only supported whenX
is a dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slice of bool will give you the first line (it casting stuff into int). We should raise an error for such case I think.
slice of string is useful with dataframe:
df.loc[slice('xxx', 'yyy')]
I see that we have an issue because we only use iloc
and then it would not work. So the docstring is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want to support slices of strings? I'd say we don't for now to keep it simple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already supported for the column with dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is:
from sklearn.datasets import fetch_openml
iris = fetch_openml('iris', as_frame=True)
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
transformer = make_column_transformer(
(FunctionTransformer(), slice('sepallength', 'sepalwidth'))
)
transformer.fit_transform(iris.data, iris.target)
I refactor the code which should make it more understandable but less easy to review by looking at the diff (sorry for that). I think that I will need to do the same regarding the tests if we want to keep this indexing manageable. So before to go further, I think that we need to answer a couple of questions:
I think that this is the main blocker here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand what's going on. Are we considering safe_indexing
not public and so we can break backward-compatibility?
…into is/consistent_safe_indexing
As mentioned in one of my comment, I reintroduced support for boolean array-like and integer slice meaning since we were supporting it. |
Seems that masks were used for indexing in several safe_indexing uses.
Sorry for sending you off the path. And CV is the only place where the mask
is provided effectively by users, so we should be safe to support more
formats if all formats are acceptable there.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat confused by this _check_key_type(key, superclass)
design. It repeats very similar checks for each of three classes, but also has specialised code paths depending on superclass
. Why not _determine_key_type(key) -> {'s', 'i', 'b', 'none-slice'}
called exactly once per safe_indexing
?
sklearn/utils/__init__.py
Outdated
"""Index a pandas dataframe or a series.""" | ||
# check whether we should index with loc or iloc | ||
if _check_key_type(key, int): | ||
by_name = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be raising a warning/error if X.index.dtype.kind in 'iu'
(and similar on the other axis)? This won't handle the case where there are ints of object dtype in the index.
|
||
if hasattr(key, 'shape'): | ||
# Work-around for indexing with read-only key in pandas | ||
# FIXME: solved in pandas 0.25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use an explicit version comparison to take this codepath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we need to make an explicit pandas import
sklearn/utils/__init__.py
Outdated
if np.isscalar(key) or isinstance(key, slice): | ||
# key is a slice or a scalar | ||
return X[key] | ||
key_set = set(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key will usually be an array, won't it? Should we cast into an array at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we need to do so. We can always postpone until required.
return None | ||
if isinstance(key, tuple(dtype_to_str.keys())): | ||
try: | ||
return dtype_to_str[type(key)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is the right behaviour for a scalar bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? bool
and np.bool_
will return the string 'bool'
. What would you expect instead?
Note that in safe indexing, we are not using scalar bool to index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in safe indexing, we are not using scalar bool to index.
Okay ;)
return set_type | ||
if hasattr(key, 'dtype'): | ||
try: | ||
return array_dtype_to_str[key.dtype.kind] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is behaviour with dtype='O' where the array contains only ints or only bools correct? (How does numpy handle this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy will return 'O'
and nothing about 'i' or 'b' so it will fail. I am not really sure how to solve this issue without adding so much complexity.
One would need to try converting the array to bool and int first. However, I thought that we are usually dtype=object
to handle string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maintains the behavior on master where we use ('O', 'U', 'S') to mean string:
scikit-learn/sklearn/utils/__init__.py
Lines 311 to 318 in 68044b0
if hasattr(key, 'dtype'): | |
if superclass is int: | |
return key.dtype.kind == 'i' | |
elif superclass is bool: | |
return key.dtype.kind == 'b' | |
else: | |
# superclass = str | |
return key.dtype.kind in ('O', 'U', 'S') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not reviewed tests thoroughly, but I'm pretty happy with this!
Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
Looks like there are two approvals. Anything blocking the merge? |
I think this is ready to be merged. I address all comments. |
Thanks @glemaitre |
This is the follow-up of #14035.
It is a refactoring of
safe_indexing
splitting theX
indexing depending on the type and reorganizing the tests as well.