Skip to content

feature names - NamedArray #14315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from
Closed

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Jul 12, 2019

This is based on #13307 by @amueller, related SLEP: scikit-learn/enhancement_proposals#18

This PR explores the idea of passing around the column names with the X object. It is the (b) option in scikit-learn/enhancement_proposals#18 (comment)

The idea is to pass the column names with the object, and for that we can either use an xarray.DataArray, or a sklearn.NamedArray. This PR uses xarray, exploring the idea, and the focus is to see how it would look like if we do it.

There are certain aspect we need to consider, deciding which one of xarray or an internal NamedArray to choose:

  • xarray depends on pandas! I haven't checked/asked if it'd be easy/doable to remove that hard dependency, but for now it's there.
  • On the other hand, NamedArray requires careful discussion on whether we really wanna have such an object, and what it should include as features.

The code behaves as such ATM:

input:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.compose import make_column_transformer
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
import pandas as pd

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
pipe = make_pipeline(StandardScaler(), PCA(), SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].feature_names_in_

output:

array(['pca0', 'pca1'], dtype='<U4')

input:

pipe = make_pipeline(make_column_transformer((make_pipeline(StandardScaler(),
                                                            PCA()), X.columns),
                                             (StandardScaler(), X.columns[:2])),
                     SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].feature_names_in_

output

array(['pipeline__pca0', 'standardscaler__sepal length (cm)'],
      dtype='<U33')

input:

pipe = make_pipeline(make_column_transformer((PCA(), X.columns),
                                             (StandardScaler(), X.columns[:2])),
                     SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].feature_names_in_

output

array(['pipeline__pca0', 'standardscaler__sepal length (cm)'],
      dtype='<U33')

EDIT: an update/summary is written under #14315 (comment)

@amueller
Copy link
Member

Awesome, thanks for working on this! What does the repr of the output of transform look like, say for the PCA or standard scaler?

@amueller
Copy link
Member

Looks to me like pandas is pretty deeply baked into xarray.

@amueller
Copy link
Member

wanna add xarray to the CI and see what happens? ;)

@amueller
Copy link
Member

What are the requirements for a DataArray as input for column names to work correctly in your proposal? call the [1] dimension columns?

@jorisvandenbossche
Copy link
Member

Looking now at the PR, I see that you do the wrapping of the output into an xarray DataArray only at the level of the Pipeline (in order to pass it to the next step).
That's of course much less invasive than what I was first thinking the idea was (that the actual fit method of the estimator would do this wrapping. But that would of course also be quite a (breaking) change in behaviour, if fit methods would start returning DataArray instead of numpy array). But so, my description of this idea in scikit-learn/enhancement_proposals#18 (comment) is thus not fully correct.

@amueller
Copy link
Member

I was actually also thinking of returning them in the estimators, not the pipeline.
If we do it in the pipeline, I think we'll have tons of inconsistent behavior. I haven't looked at the PR yet but I'm not sure I like this option.

in terms of backward-incompatibility: this is something we'll have to swallow at some point if we want to return something with more metadata, and I think it makes sense.
If we implement code paths for pandas to not convert dataframes, that will also break the API. We could have a global option or an experimental activation thing or something like that.

@adrinjalali
Copy link
Member Author

What are the requirements for a DataArray as input for column names to work correctly in your proposal? call the [1] dimension columns?

I've called them features, and check for it in _feature_names(.) in validation.py.

I didn't try to fix all the estimators. These two need to be done:

  • check the input data at the beginning of fit
  • potentially wrap the data in a DataArray at the end of transform or keep the data as a DataArray during transform

The first one I'd like to do once we have #13603 in, cause then it'd be really easy.

The second one, we need to discuss how we'd like the API to behave.

To start, my proposal is to allow the user to ask for the data with the metadata, or just a numpy array. By default, it's a numpy array, and the pipeline and other meta estimators can activate the metadata output if they need it.

@adrinjalali
Copy link
Member Author

adrinjalali commented Jul 15, 2019

Awesome, thanks for working on this! What does the repr of the output of transform look like, say for the PCA or standard scaler?

Drafting the API I mentioned, here's how it'd look like:

t = PCA(n_components=3).fit(X)
t._return_dataarray_ = True
t.transform(X)

<xarray.DataArray (samples: 150, features: 3)>
array([[-2.684126,  0.319397, -0.027915],
       [-2.714142, -0.177001, -0.210464],
       [-2.888991, -0.144949,  0.0179  ],
       ...,
       [ 1.764346,  0.078859,  0.130482],
       [ 1.900942,  0.116628,  0.723252],
       [ 1.390189, -0.282661,  0.36291 ]])
Coordinates:
  * samples   (samples) int64 0 1 2 3 4 5 6 7 ... 143 144 145 146 147 148 149
  * features  (features) <U4 'pca0' 'pca1' 'pca2'

It would return a usual numpy array if we don't set the flag, and the pipeline would set the flag (WIP).

EDIT: I'm thinking of having that flag as an __init__ parameter to all estimators, but it doesn't have to be there at the beginning.

@amueller
Copy link
Member

I would rather have a global flag that allows returning the metadata whenever we can. Which is probably whenever the input was a pandas dataframe or DataArray and in some featurizers, and I'd mark that as an experimental feature.
Why would people want their meta-data lost if they provided it in the input?

@adrinjalali
Copy link
Member Author

I did try subclassing an ndarray, it is indeed trickier and dirtier than I though.

So for now, I've just added a simple container (NamedArray), and since it doesn't care what the data is, it supports sparse as well:

import pandas as pd
X = pd.DataFrame(
    {'city': ['London', 'London', 'Paris', 'Sallisaw'],
     'title': ["His Last Bow", "How Watson Learned the Trick",
               "A Moveable Feast", "The Grapes of Wrath"],
     'expert_rating': [5, 3, 4, 5],
     'user_rating': [4, 5, 4, 3]})

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
column_trans = ColumnTransformer(
    [('categories', OneHotEncoder(dtype='int'),['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder='drop')
column_trans.fit(X)
column_trans.feature_names_out_

['categories__city_London',
 'categories__city_Paris',
 'categories__city_Sallisaw',
 'title_bow__bow',
 'title_bow__feast',
 'title_bow__grapes',
 'title_bow__his',
 'title_bow__how',
 'title_bow__last',
 'title_bow__learned',
 'title_bow__moveable',
 'title_bow__of',
 'title_bow__the',
 'title_bow__trick',
 'title_bow__watson',
 'title_bow__wrath']

t = CountVectorizer()
t._return_dataarray_ = True
t.fit(X).transform(X)

<4x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>
feature names: ['city', 'expert_rating', 'title', 'user_rating']

@thomasjpfan
Copy link
Member

thomasjpfan commented Jul 17, 2019

I did try subclassing an ndarray, it is indeed trickier and dirtier than I though.

I agree with not subclassing ndarray. We can duck type this: https://www.numpy.org/devdocs/user/basics.dispatch.html

With a custom NamedArray, we can have "namedarray in -> namedarray out" for every estimator. make_named_array(pd_df) is used to wrap dataframe and pass into our estimators. make_named_array(np_array, features=[...]) is used to wrap numpy arrays. We would still be backward compatibility, since only NamedArrays will have its metadata preserved.

@amueller
Copy link
Member

amueller commented Jul 17, 2019

@adrinjalali what were the issues with subclassing?

@thomasjpfan surely you didn't mean named_named_array?

Also, I just realized following this discussion:
scikit-learn/enhancement_proposals#18 (comment)

we can actually make it "just work" for many cases and don't need to implement get_feature_names in those cases!!!

stupid github not supporting gifs

@amueller
Copy link
Member

@thomasjpfan I think I would first start with just having columns

@thomasjpfan
Copy link
Member

With only columns each transformer would have to do some name mangling with the original column names. In my API proposal, _orig_columns and _columns_to_orig are used together to form columns. (I updated them to be private)

@adrinjalali
Copy link
Member Author

@thomasjpfan

With only columns each transformer would have to do some name mangling with the original column names. In my API proposal, _orig_columns and _columns_to_orig are used together to form columns. (I updated them to be private)

somehow I can't find your proposal

@thomasjpfan
Copy link
Member

@adrinjalali I removed it. It was not a good idea and the discussion is better left for scikit-learn/enhancement_proposals#17.

@shoyer
Copy link
Contributor

shoyer commented Jul 17, 2019

  • xarray depends on pandas! I haven't checked/asked if it'd be easy/doable to remove that hard dependency, but for now it's there.

This is likely doable with some effort. You would lose label-based indexing/alignment, support for datetime dtypes and groupby operations. With more effort, most of these could be replaced by less efficient pure NumPy/Python implementations. Fundamentally, xarray uses pandas for two things: (1) low-level operations that rely upon hash-tables and (2) some fast low-level loops that aren't implemented by NumPy (mostly for non-numeric dtypes).

@adrinjalali
Copy link
Member Author

Almost all the tests now pass, and almost all the remaining failing ones are related to me removing get_feature_names(). We can have the function with the old behavior and deprecate it of course in the final proposal.

This is also not super clean, but I think it's at a good state to be considered as a proposal for us to see how it could look like.

@amueller
Copy link
Member

I'll try to check on it today or tomorrow.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we enforce that names are unique?

@adrinjalali
Copy link
Member Author

should we enforce that names are unique?

Sounds reasonable to me, but that needs a much better way of feature name generation in some of the transformers. For instance, having two separate PCAs on two different sets of columns, then unioning them, would result in duplicate names right now.

@amueller
Copy link
Member

@adrinjalali I would have FeatureUnion and ColumnTransformer do name mangling. I think these are the only places where we need to do it, but that should solve most issues.

@amueller
Copy link
Member

Sorry I find this a bit hard to follow as there are changes in so many files (now I know how y'all feel about my PRs lol).

So this is not using the xarray DataArray idea now but doing named array?

cv_coefs = np.concatenate([cv_pipeline.named_steps["classifier"].coef_
for cv_pipeline in cv_results["estimator"]])
fig, ax = plt.subplots()
ax.barh(pipeline.named_steps["classifier"].feature_names_in_,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing since this comes from the fit way above and not from the cross_validate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't really fixed the documentation part yet

sklearn/base.py Outdated

It only attaches the features names to the data if `return_dataarray`
is set to `True`."""
if self._return_dataarray_:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this mechanism. How does this work currently? this is only set in pipeline? Why?
To not show the dataarray (or rather NamedArray) to users?
And doesn't that need to be set in FeatureUnion and ColumnTransformer as well?
And how does it propagate through meta-estimators?

@jnothman
Copy link
Member

jnothman commented Jul 23, 2019 via email

@adrinjalali adrinjalali changed the title feature names - xarray feature names - NamedArray Jul 24, 2019
@adrinjalali
Copy link
Member Author

Here's my effort to summarize what has happened here.

I started by passing the column names around using an xarray.DataArray, to see how it would look like in terms of the change required in the code etc. Then for the following reasons:
(1) xarray has a hard dependency on pandas,
(2) it doesn't support sparse matrices, and when it does, it'll be pydata.sparse which is COO only for now,

it seemed sensible to have a lightweight implementation of a NamedArray. For now, the implementation accepts the data and the feature names, and as long as the data is a numpy array it works w/o an issue in operations with other numpy objects (it implements __array__). However, things get a bit tricky once the data is a sparse matrix, in which case the user should get the data of the NamedArray and work with that (for instance, scipy.issparse() does an isinstance() check which clearly fails on a NamedArray, unless we hook into that).

On the estimator API side, right now:

  • the estimators return the usual numpy or a scipy sparse matrix by default
  • they return a NamedArray if a flag is set by the user, which is done in pipeline, but has to be done in other places such as ColumnTransformer and metaestimators.
  • the feature_names_in_ is filled with the feature names of the data if a pandas.DataFrame or a NamedArray is passed.
  • the feature_names_out_ is populated regardless of passing feature names. It is x0, x1, ... by default, and derived from the actual column names if present.

Now, these are the questions I have:

  • What is left to be able to answer the question: is this what we'd like to develop further (i.e. some sort of an internal NamedArray + having the transformers handling the feature names).

If the answer is yes to the above question:

  • should it be a global [experimental] flag telling the estimators to return a NamedArray if a pandas DF or a NamedArray is given, or should it be a per-estimator flag, as it is now.

  • NamedArray should probably better be in its own PR, shouldn't it?

  • How should NamedArray behave if the data is sparse (or not a numpy array in general)?

  • Should we enforce unique column names?
    ...

@amueller
Copy link
Member

Maybe we should go back to trying to write a slep?
You made many decision in this PR lol.

Why do you introduce a flag? Why would we not always return a NamedArray?

@adrinjalali
Copy link
Member Author

I think this would make the property routing issue totally moot? Since we wouldn't have such metadata being passed via fit or transform method params, it would be in the NamedArray.

no it doesn't. Still have the issue that the user doesn't want all functions/estimators receiving a certain metadata to actually use it, and that there may be a meta-data with the same name (let say sample weight) which is different for different functions. @jnothman 's latest suggestion on the SLEP may solve this issue though.

Xarray depends on pandas, that itself means we cannot depend on it (we don't want to depend on pandas). Also, xarray uses indices substantially (which is why it depends on pandas), and we don't want them. We did explore using xarray before coming to the conclusion that we need our own array. It was not suitable.

@hermidalc
Copy link
Contributor

no it doesn't. Still have the issue that the user doesn't want all functions/estimators receiving a certain metadata to actually use it, and that there may be a meta-data with the same name (let say sample weight) which is different for different functions. @jnothman 's latest suggestion on the SLEP may solve this issue though.

Could you point me to his latest suggestion? My apologies, just regarding the issues of sample and feature properties there are so many different open issues and SLEPs.

@adrinjalali
Copy link
Member Author

Solution 4: https://github.com/scikit-learn/enhancement_proposals/pull/16/files

@hermidalc
Copy link
Contributor

And apologies too if I appeared to question your choices, totally not my intention and just happy that these issues are being discussed.

I just don't want the scikit-learn core team to go through all the effort of developing new functionalities that then don't solve the basic needs of e.g. life sciences researchers who use ML. For us we need 2D sample metadata and 2D feature metadata dataframes that are in some way transparently attached or associated with X numpy array such that everything gets passed around and subset the right way. As long as you are keeping these needs in your plan then I'm happy.

@adrinjalali
Copy link
Member Author

That's a fair point, and good to know.

@hermidalc
Copy link
Contributor

no it doesn't. Still have the issue that the user doesn't want all functions/estimators receiving a certain metadata to actually use it, and that there may be a meta-data with the same name (let say sample weight) which is different for different functions. @jnothman 's latest suggestion on the SLEP may solve this issue though.

I think this more matters for compatibility with the scikit-learn API itself? Like groups and sample_weight. From the user side when we write our own custom extensions if I do not need to specify new named parameters in fit and transform to hold specific 2D sample or feature metadata and they are just attached to X then this a great solution and no need to worry about routing from the user side. In my custom classes I only access the metadata columns I need and it doesn't have any potential namespace clashes with existing scikit-learn named parameters being passed to fit or transform or predict for example.

@amueller
Copy link
Member

If we want to use the new data structure to solve the categorical/continuous issue that @thomasjpfan points to, we're basically reimplementing the chunks that are currently used in pandas, right? The main reason we're not going with pandas is the fact that they want to move away from this model because it's too hard to maintain, from what I understand.
So before we consider implementing this, we might talk to @jorisvandenbossche a bit more first ;)

@amueller
Copy link
Member

@adrinjalali can you please elaborate on what you wanted to write in the SLEP re categorical data? @thomasjpfan had the same question above, I think.

My assumption was that we would have homogeneous data, so categorical values would indeed be stored as floats.

@amueller
Copy link
Member

Can someone give an example where the output of an estimator would contain both continuous and categorical columns? I guess that would be involving a ColumnTransformer, but right now we have nothing that produces categorical columns (that are not integers), right?

@thomasjpfan
Copy link
Member

thomasjpfan commented Nov 15, 2019

ColumnTransformer can have remainder=passthrough which can passthrough the categorical features. For example:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

X = pd.DataFrame(
     {'city': ['London', 'London', 'Paris', 'Sallisaw'],
      'title': ["His Last Bow", "How Watson Learned the Trick",
                "A Moveable Feast", "The Grapes of Wrath"],
      'expert_rating': [5, 3, 4, 5],
      'user_rating': [4, 5, 4, 3]})


column_trans = ColumnTransformer(
     [('title_bow', CountVectorizer(), 'title')],
     remainder='passthrough')

column_trans.fit_transform(X)

As an aside, I think using homogenous data works for most use cases. Even for strings, we can include the categories in the metadata and encode it as floats.

Edit: Fix example to actually pass through the "city" category.

@NicolasHug
Copy link
Member

@adrinjalali can you please elaborate on what you wanted to write in the SLEP re categorical data?

Regardless of the content of the SELP I really think we need a summary of the current state on this matter. With all the discussions happening on the sleps, issues and PRs, it's very hard to understand what's going on for people who haven't been following it since the beginning.
(I'd like to help but can't because of this)

@amueller
Copy link
Member

@NicolasHug Isn't the summary the new slep?
@thomasjpfan that's the only one I could come up with as well.
We could have ColumnTransformer optionally return a dataframe in that case....

@adrinjalali
Copy link
Member Author

I'll write up a new SLEP with the summary and all. I just wanted to talk to Thomas before doing that, and we're doing that in the coming week.

Regarding an estimator with categorical and numerical data as output, some resamplers would be another example.

And regarding the question of if we want to have homogeneous data or not, the answer is yes as far as I know, or at least it has been the case. I haven't confirmed it since I need to know more about what we want to include in this new object.

@Diadochokinetic
Copy link
Contributor

Hi,

I just wanted to ask what the current state of this issue is?

@jnothman
Copy link
Member

jnothman commented Aug 4, 2020

#16772 is the latest attempt to pass feature names around.

Base automatically changed from master to main January 22, 2021 10:51
@lorentzenchr
Copy link
Member

With #23734 merged, can we close?

@adrinjalali
Copy link
Member Author

Yes, we should think of supporting things other than pandas, but it'd be very far from the implementation here anyway.

@adrinjalali adrinjalali closed this Nov 2, 2022
@adrinjalali
Copy link
Member Author

@thomasjpfan also a reminder that this is what we thought we have to do regarding sparse arrays back then.

@adrinjalali adrinjalali deleted the xarray branch November 2, 2022 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants