-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
feature names - NamedArray #14315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature names - NamedArray #14315
Conversation
Awesome, thanks for working on this! What does the repr of the output of |
Looks to me like pandas is pretty deeply baked into xarray. |
wanna add xarray to the CI and see what happens? ;) |
What are the requirements for a DataArray as input for column names to work correctly in your proposal? call the [1] dimension |
Looking now at the PR, I see that you do the wrapping of the output into an xarray DataArray only at the level of the Pipeline (in order to pass it to the next step). |
I was actually also thinking of returning them in the estimators, not the pipeline. in terms of backward-incompatibility: this is something we'll have to swallow at some point if we want to return something with more metadata, and I think it makes sense. |
I've called them I didn't try to fix all the estimators. These two need to be done:
The first one I'd like to do once we have #13603 in, cause then it'd be really easy. The second one, we need to discuss how we'd like the API to behave. To start, my proposal is to allow the user to ask for the data with the metadata, or just a numpy array. By default, it's a numpy array, and the pipeline and other meta estimators can activate the metadata output if they need it. |
Drafting the API I mentioned, here's how it'd look like: t = PCA(n_components=3).fit(X)
t._return_dataarray_ = True
t.transform(X)
<xarray.DataArray (samples: 150, features: 3)>
array([[-2.684126, 0.319397, -0.027915],
[-2.714142, -0.177001, -0.210464],
[-2.888991, -0.144949, 0.0179 ],
...,
[ 1.764346, 0.078859, 0.130482],
[ 1.900942, 0.116628, 0.723252],
[ 1.390189, -0.282661, 0.36291 ]])
Coordinates:
* samples (samples) int64 0 1 2 3 4 5 6 7 ... 143 144 145 146 147 148 149
* features (features) <U4 'pca0' 'pca1' 'pca2' It would return a usual numpy array if we don't set the flag, and the pipeline would set the flag (WIP). EDIT: I'm thinking of having that flag as an |
I would rather have a global flag that allows returning the metadata whenever we can. Which is probably whenever the input was a pandas dataframe or DataArray and in some featurizers, and I'd mark that as an experimental feature. |
I did try subclassing an ndarray, it is indeed trickier and dirtier than I though. So for now, I've just added a simple container ( import pandas as pd
X = pd.DataFrame(
{'city': ['London', 'London', 'Paris', 'Sallisaw'],
'title': ["His Last Bow", "How Watson Learned the Trick",
"A Moveable Feast", "The Grapes of Wrath"],
'expert_rating': [5, 3, 4, 5],
'user_rating': [4, 5, 4, 3]})
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
column_trans = ColumnTransformer(
[('categories', OneHotEncoder(dtype='int'),['city']),
('title_bow', CountVectorizer(), 'title')],
remainder='drop')
column_trans.fit(X)
column_trans.feature_names_out_
['categories__city_London',
'categories__city_Paris',
'categories__city_Sallisaw',
'title_bow__bow',
'title_bow__feast',
'title_bow__grapes',
'title_bow__his',
'title_bow__how',
'title_bow__last',
'title_bow__learned',
'title_bow__moveable',
'title_bow__of',
'title_bow__the',
'title_bow__trick',
'title_bow__watson',
'title_bow__wrath']
t = CountVectorizer()
t._return_dataarray_ = True
t.fit(X).transform(X)
<4x4 sparse matrix of type '<class 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>
feature names: ['city', 'expert_rating', 'title', 'user_rating'] |
With a custom |
@adrinjalali what were the issues with subclassing? @thomasjpfan surely you didn't mean Also, I just realized following this discussion: we can actually make it "just work" for many cases and don't need to implement |
@thomasjpfan I think I would first start with just having |
With only |
somehow I can't find your proposal |
@adrinjalali I removed it. It was not a good idea and the discussion is better left for scikit-learn/enhancement_proposals#17. |
This is likely doable with some effort. You would lose label-based indexing/alignment, support for datetime dtypes and groupby operations. With more effort, most of these could be replaced by less efficient pure NumPy/Python implementations. Fundamentally, xarray uses pandas for two things: (1) low-level operations that rely upon hash-tables and (2) some fast low-level loops that aren't implemented by NumPy (mostly for non-numeric dtypes). |
Almost all the tests now pass, and almost all the remaining failing ones are related to me removing This is also not super clean, but I think it's at a good state to be considered as a proposal for us to see how it could look like. |
I'll try to check on it today or tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we enforce that names are unique?
Sounds reasonable to me, but that needs a much better way of feature name generation in some of the transformers. For instance, having two separate PCAs on two different sets of columns, then unioning them, would result in duplicate names right now. |
@adrinjalali I would have |
Sorry I find this a bit hard to follow as there are changes in so many files (now I know how y'all feel about my PRs lol). So this is not using the xarray DataArray idea now but doing named array? |
cv_coefs = np.concatenate([cv_pipeline.named_steps["classifier"].coef_ | ||
for cv_pipeline in cv_results["estimator"]]) | ||
fig, ax = plt.subplots() | ||
ax.barh(pipeline.named_steps["classifier"].feature_names_in_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing since this comes from the fit
way above and not from the cross_validate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't really fixed the documentation part yet
sklearn/base.py
Outdated
|
||
It only attaches the features names to the data if `return_dataarray` | ||
is set to `True`.""" | ||
if self._return_dataarray_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this mechanism. How does this work currently? this is only set in pipeline? Why?
To not show the dataarray (or rather NamedArray) to users?
And doesn't that need to be set in FeatureUnion and ColumnTransformer as well?
And how does it propagate through meta-estimators?
I wonder if most of the time we don't need names to be unique and we would
be wasting computation to check.
|
Here's my effort to summarize what has happened here. I started by passing the column names around using an it seemed sensible to have a lightweight implementation of a On the estimator API side, right now:
Now, these are the questions I have:
If the answer is yes to the above question:
|
Maybe we should go back to trying to write a slep? Why do you introduce a flag? Why would we not always return a NamedArray? |
no it doesn't. Still have the issue that the user doesn't want all functions/estimators receiving a certain metadata to actually use it, and that there may be a meta-data with the same name (let say sample weight) which is different for different functions. @jnothman 's latest suggestion on the SLEP may solve this issue though. Xarray depends on pandas, that itself means we cannot depend on it (we don't want to depend on pandas). Also, xarray uses indices substantially (which is why it depends on pandas), and we don't want them. We did explore using xarray before coming to the conclusion that we need our own array. It was not suitable. |
Could you point me to his latest suggestion? My apologies, just regarding the issues of sample and feature properties there are so many different open issues and SLEPs. |
And apologies too if I appeared to question your choices, totally not my intention and just happy that these issues are being discussed. I just don't want the scikit-learn core team to go through all the effort of developing new functionalities that then don't solve the basic needs of e.g. life sciences researchers who use ML. For us we need 2D sample metadata and 2D feature metadata dataframes that are in some way transparently attached or associated with |
That's a fair point, and good to know. |
I think this more matters for compatibility with the scikit-learn API itself? Like |
If we want to use the new data structure to solve the categorical/continuous issue that @thomasjpfan points to, we're basically reimplementing the chunks that are currently used in pandas, right? The main reason we're not going with pandas is the fact that they want to move away from this model because it's too hard to maintain, from what I understand. |
@adrinjalali can you please elaborate on what you wanted to write in the SLEP re categorical data? @thomasjpfan had the same question above, I think. My assumption was that we would have homogeneous data, so categorical values would indeed be stored as floats. |
Can someone give an example where the output of an estimator would contain both continuous and categorical columns? I guess that would be involving a ColumnTransformer, but right now we have nothing that produces categorical columns (that are not integers), right? |
ColumnTransformer can have remainder=passthrough which can passthrough the categorical features. For example: import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
X = pd.DataFrame(
{'city': ['London', 'London', 'Paris', 'Sallisaw'],
'title': ["His Last Bow", "How Watson Learned the Trick",
"A Moveable Feast", "The Grapes of Wrath"],
'expert_rating': [5, 3, 4, 5],
'user_rating': [4, 5, 4, 3]})
column_trans = ColumnTransformer(
[('title_bow', CountVectorizer(), 'title')],
remainder='passthrough')
column_trans.fit_transform(X) As an aside, I think using homogenous data works for most use cases. Even for strings, we can include the categories in the metadata and encode it as floats. Edit: Fix example to actually pass through the "city" category. |
Regardless of the content of the SELP I really think we need a summary of the current state on this matter. With all the discussions happening on the sleps, issues and PRs, it's very hard to understand what's going on for people who haven't been following it since the beginning. |
@NicolasHug Isn't the summary the new slep? |
I'll write up a new SLEP with the summary and all. I just wanted to talk to Thomas before doing that, and we're doing that in the coming week. Regarding an estimator with categorical and numerical data as output, some resamplers would be another example. And regarding the question of if we want to have homogeneous data or not, the answer is yes as far as I know, or at least it has been the case. I haven't confirmed it since I need to know more about what we want to include in this new object. |
Hi, I just wanted to ask what the current state of this issue is? |
#16772 is the latest attempt to pass feature names around. |
With #23734 merged, can we close? |
Yes, we should think of supporting things other than pandas, but it'd be very far from the implementation here anyway. |
@thomasjpfan also a reminder that this is what we thought we have to do regarding sparse arrays back then. |
This is based on #13307 by @amueller, related SLEP: scikit-learn/enhancement_proposals#18
This PR explores the idea of passing around the column names with the X object. It is the (b) option in scikit-learn/enhancement_proposals#18 (comment)
The idea is to pass the column names with the object, and for that we can either use an
xarray.DataArray
, or asklearn.NamedArray
. This PR uses xarray, exploring the idea, and the focus is to see how it would look like if we do it.There are certain aspect we need to consider, deciding which one of xarray or an internal NamedArray to choose:
pandas
! I haven't checked/asked if it'd be easy/doable to remove that hard dependency, but for now it's there.NamedArray
requires careful discussion on whether we really wanna have such an object, and what it should include as features.The code behaves as such ATM:
input:
output:
input:
output
input:
output
EDIT: an update/summary is written under #14315 (comment)