feature names - NamedArray #14315

adrinjalali · 2019-07-12T19:35:46Z

This is based on #13307 by @amueller, related SLEP: scikit-learn/enhancement_proposals#18

This PR explores the idea of passing around the column names with the X object. It is the (b) option in scikit-learn/enhancement_proposals#18 (comment)

The idea is to pass the column names with the object, and for that we can either use an xarray.DataArray, or a sklearn.NamedArray. This PR uses xarray, exploring the idea, and the focus is to see how it would look like if we do it.

There are certain aspect we need to consider, deciding which one of xarray or an internal NamedArray to choose:

xarray depends on pandas! I haven't checked/asked if it'd be easy/doable to remove that hard dependency, but for now it's there.
On the other hand, NamedArray requires careful discussion on whether we really wanna have such an object, and what it should include as features.

The code behaves as such ATM:

input:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.compose import make_column_transformer
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
import pandas as pd

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
pipe = make_pipeline(StandardScaler(), PCA(), SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].feature_names_in_

output:

array(['pca0', 'pca1'], dtype='<U4')

input:

pipe = make_pipeline(make_column_transformer((make_pipeline(StandardScaler(),
                                                            PCA()), X.columns),
                                             (StandardScaler(), X.columns[:2])),
                     SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].feature_names_in_

output

array(['pipeline__pca0', 'standardscaler__sepal length (cm)'],
      dtype='<U33')

input:

pipe = make_pipeline(make_column_transformer((PCA(), X.columns),
                                             (StandardScaler(), X.columns[:2])),
                     SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].feature_names_in_

output

array(['pipeline__pca0', 'standardscaler__sepal length (cm)'],
      dtype='<U33')

EDIT: an update/summary is written under #14315 (comment)

amueller · 2019-07-13T14:32:09Z

Awesome, thanks for working on this! What does the repr of the output of transform look like, say for the PCA or standard scaler?

amueller · 2019-07-13T14:35:25Z

Looks to me like pandas is pretty deeply baked into xarray.

amueller · 2019-07-13T14:39:36Z

wanna add xarray to the CI and see what happens? ;)

amueller · 2019-07-13T14:43:46Z

What are the requirements for a DataArray as input for column names to work correctly in your proposal? call the [1] dimension columns?

jorisvandenbossche · 2019-07-14T18:19:10Z

Looking now at the PR, I see that you do the wrapping of the output into an xarray DataArray only at the level of the Pipeline (in order to pass it to the next step).
That's of course much less invasive than what I was first thinking the idea was (that the actual fit method of the estimator would do this wrapping. But that would of course also be quite a (breaking) change in behaviour, if fit methods would start returning DataArray instead of numpy array). But so, my description of this idea in scikit-learn/enhancement_proposals#18 (comment) is thus not fully correct.

amueller · 2019-07-14T18:47:13Z

I was actually also thinking of returning them in the estimators, not the pipeline.
If we do it in the pipeline, I think we'll have tons of inconsistent behavior. I haven't looked at the PR yet but I'm not sure I like this option.

in terms of backward-incompatibility: this is something we'll have to swallow at some point if we want to return something with more metadata, and I think it makes sense.
If we implement code paths for pandas to not convert dataframes, that will also break the API. We could have a global option or an experimental activation thing or something like that.

adrinjalali · 2019-07-15T10:59:57Z

What are the requirements for a DataArray as input for column names to work correctly in your proposal? call the [1] dimension columns?

I've called them features, and check for it in _feature_names(.) in validation.py.

I didn't try to fix all the estimators. These two need to be done:

check the input data at the beginning of fit
potentially wrap the data in a DataArray at the end of transform or keep the data as a DataArray during transform

The first one I'd like to do once we have #13603 in, cause then it'd be really easy.

The second one, we need to discuss how we'd like the API to behave.

To start, my proposal is to allow the user to ask for the data with the metadata, or just a numpy array. By default, it's a numpy array, and the pipeline and other meta estimators can activate the metadata output if they need it.

adrinjalali · 2019-07-15T17:16:23Z

Awesome, thanks for working on this! What does the repr of the output of transform look like, say for the PCA or standard scaler?

Drafting the API I mentioned, here's how it'd look like:

t = PCA(n_components=3).fit(X)
t._return_dataarray_ = True
t.transform(X)

<xarray.DataArray (samples: 150, features: 3)>
array([[-2.684126,  0.319397, -0.027915],
       [-2.714142, -0.177001, -0.210464],
       [-2.888991, -0.144949,  0.0179  ],
       ...,
       [ 1.764346,  0.078859,  0.130482],
       [ 1.900942,  0.116628,  0.723252],
       [ 1.390189, -0.282661,  0.36291 ]])
Coordinates:
  * samples   (samples) int64 0 1 2 3 4 5 6 7 ... 143 144 145 146 147 148 149
  * features  (features) <U4 'pca0' 'pca1' 'pca2'

It would return a usual numpy array if we don't set the flag, and the pipeline would set the flag (WIP).

EDIT: I'm thinking of having that flag as an __init__ parameter to all estimators, but it doesn't have to be there at the beginning.

amueller · 2019-07-16T18:29:45Z

I would rather have a global flag that allows returning the metadata whenever we can. Which is probably whenever the input was a pandas dataframe or DataArray and in some featurizers, and I'd mark that as an experimental feature.
Why would people want their meta-data lost if they provided it in the input?

adrinjalali · 2019-07-17T09:32:56Z

I did try subclassing an ndarray, it is indeed trickier and dirtier than I though.

So for now, I've just added a simple container (NamedArray), and since it doesn't care what the data is, it supports sparse as well:

import pandas as pd
X = pd.DataFrame(
    {'city': ['London', 'London', 'Paris', 'Sallisaw'],
     'title': ["His Last Bow", "How Watson Learned the Trick",
               "A Moveable Feast", "The Grapes of Wrath"],
     'expert_rating': [5, 3, 4, 5],
     'user_rating': [4, 5, 4, 3]})

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
column_trans = ColumnTransformer(
    [('categories', OneHotEncoder(dtype='int'),['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder='drop')
column_trans.fit(X)
column_trans.feature_names_out_

['categories__city_London',
 'categories__city_Paris',
 'categories__city_Sallisaw',
 'title_bow__bow',
 'title_bow__feast',
 'title_bow__grapes',
 'title_bow__his',
 'title_bow__how',
 'title_bow__last',
 'title_bow__learned',
 'title_bow__moveable',
 'title_bow__of',
 'title_bow__the',
 'title_bow__trick',
 'title_bow__watson',
 'title_bow__wrath']

t = CountVectorizer()
t._return_dataarray_ = True
t.fit(X).transform(X)

<4x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>
feature names: ['city', 'expert_rating', 'title', 'user_rating']

thomasjpfan · 2019-07-17T15:15:58Z

I did try subclassing an ndarray, it is indeed trickier and dirtier than I though.

~~I agree with not subclassing ndarray.~~ We can duck type this: https://www.numpy.org/devdocs/user/basics.dispatch.html

With a custom NamedArray, we can have "namedarray in -> namedarray out" for every estimator. make_named_array(pd_df) is used to wrap dataframe and pass into our estimators. make_named_array(np_array, features=[...]) is used to wrap numpy arrays. We would still be backward compatibility, since only NamedArrays will have its metadata preserved.

amueller · 2019-07-17T15:40:13Z

@adrinjalali what were the issues with subclassing?

@thomasjpfan surely you didn't mean named_named_array?

Also, I just realized following this discussion:
scikit-learn/enhancement_proposals#18 (comment)

we can actually make it "just work" for many cases and don't need to implement get_feature_names in those cases!!!

amueller · 2019-07-17T15:47:21Z

@thomasjpfan I think I would first start with just having columns

thomasjpfan · 2019-07-17T16:38:20Z

With only columns each transformer would have to do some name mangling with the original column names. In my API proposal, _orig_columns and _columns_to_orig are used together to form columns. (I updated them to be private)

adrinjalali · 2019-07-17T19:41:55Z

@thomasjpfan

With only columns each transformer would have to do some name mangling with the original column names. In my API proposal, _orig_columns and _columns_to_orig are used together to form columns. (I updated them to be private)

somehow I can't find your proposal

thomasjpfan · 2019-07-17T19:52:33Z

@adrinjalali I removed it. It was not a good idea and the discussion is better left for scikit-learn/enhancement_proposals#17.

shoyer · 2019-07-17T20:35:48Z

xarray depends on pandas! I haven't checked/asked if it'd be easy/doable to remove that hard dependency, but for now it's there.

This is likely doable with some effort. You would lose label-based indexing/alignment, support for datetime dtypes and groupby operations. With more effort, most of these could be replaced by less efficient pure NumPy/Python implementations. Fundamentally, xarray uses pandas for two things: (1) low-level operations that rely upon hash-tables and (2) some fast low-level loops that aren't implemented by NumPy (mostly for non-numeric dtypes).

adrinjalali · 2019-07-22T17:16:39Z

Almost all the tests now pass, and almost all the remaining failing ones are related to me removing get_feature_names(). We can have the function with the old behavior and deprecate it of course in the final proposal.

This is also not super clean, but I think it's at a good state to be considered as a proposal for us to see how it could look like.

amueller · 2019-07-22T17:52:29Z

I'll try to check on it today or tomorrow.

jnothman

should we enforce that names are unique?

adrinjalali · 2019-07-23T12:13:45Z

should we enforce that names are unique?

Sounds reasonable to me, but that needs a much better way of feature name generation in some of the transformers. For instance, having two separate PCAs on two different sets of columns, then unioning them, would result in duplicate names right now.

amueller · 2019-07-23T18:18:33Z

@adrinjalali I would have FeatureUnion and ColumnTransformer do name mangling. I think these are the only places where we need to do it, but that should solve most issues.

amueller · 2019-07-23T22:54:33Z

Sorry I find this a bit hard to follow as there are changes in so many files (now I know how y'all feel about my PRs lol).

So this is not using the xarray DataArray idea now but doing named array?

amueller · 2019-07-23T22:56:21Z

examples/compose/plot_column_transformer_mixed_types.py

+cv_coefs = np.concatenate([cv_pipeline.named_steps["classifier"].coef_
+                           for cv_pipeline in cv_results["estimator"]])
+fig, ax = plt.subplots()
+ax.barh(pipeline.named_steps["classifier"].feature_names_in_,


This is a bit confusing since this comes from the fit way above and not from the cross_validate.

I haven't really fixed the documentation part yet

amueller · 2019-07-23T22:59:50Z

sklearn/base.py

+
+        It only attaches the features names to the data if `return_dataarray`
+        is set to `True`."""
+        if self._return_dataarray_:


I don't understand this mechanism. How does this work currently? this is only set in pipeline? Why?
To not show the dataarray (or rather NamedArray) to users?
And doesn't that need to be set in FeatureUnion and ColumnTransformer as well?
And how does it propagate through meta-estimators?

jnothman · 2019-07-23T23:16:17Z

I wonder if most of the time we don't need names to be unique and we would be wasting computation to check.

adrinjalali · 2019-07-24T15:11:08Z

Here's my effort to summarize what has happened here.

I started by passing the column names around using an xarray.DataArray, to see how it would look like in terms of the change required in the code etc. Then for the following reasons:
(1) xarray has a hard dependency on pandas,
(2) it doesn't support sparse matrices, and when it does, it'll be pydata.sparse which is COO only for now,

it seemed sensible to have a lightweight implementation of a NamedArray. For now, the implementation accepts the data and the feature names, and as long as the data is a numpy array it works w/o an issue in operations with other numpy objects (it implements __array__). However, things get a bit tricky once the data is a sparse matrix, in which case the user should get the data of the NamedArray and work with that (for instance, scipy.issparse() does an isinstance() check which clearly fails on a NamedArray, unless we hook into that).

On the estimator API side, right now:

the estimators return the usual numpy or a scipy sparse matrix by default
they return a NamedArray if a flag is set by the user, which is done in pipeline, but has to be done in other places such as ColumnTransformer and metaestimators.
the feature_names_in_ is filled with the feature names of the data if a pandas.DataFrame or a NamedArray is passed.
the feature_names_out_ is populated regardless of passing feature names. It is x0, x1, ... by default, and derived from the actual column names if present.

Now, these are the questions I have:

What is left to be able to answer the question: is this what we'd like to develop further (i.e. some sort of an internal NamedArray + having the transformers handling the feature names).

If the answer is yes to the above question:

should it be a global [experimental] flag telling the estimators to return a NamedArray if a pandas DF or a NamedArray is given, or should it be a per-estimator flag, as it is now.
NamedArray should probably better be in its own PR, shouldn't it?
How should NamedArray behave if the data is sparse (or not a numpy array in general)?
Should we enforce unique column names?
...

amueller · 2019-07-24T20:16:37Z

Maybe we should go back to trying to write a slep?
You made many decision in this PR lol.

Why do you introduce a flag? Why would we not always return a NamedArray?

adrinjalali · 2019-10-28T15:17:23Z

I think this would make the property routing issue totally moot? Since we wouldn't have such metadata being passed via fit or transform method params, it would be in the NamedArray.

no it doesn't. Still have the issue that the user doesn't want all functions/estimators receiving a certain metadata to actually use it, and that there may be a meta-data with the same name (let say sample weight) which is different for different functions. @jnothman 's latest suggestion on the SLEP may solve this issue though.

Xarray depends on pandas, that itself means we cannot depend on it (we don't want to depend on pandas). Also, xarray uses indices substantially (which is why it depends on pandas), and we don't want them. We did explore using xarray before coming to the conclusion that we need our own array. It was not suitable.

hermidalc · 2019-10-28T15:24:05Z

no it doesn't. Still have the issue that the user doesn't want all functions/estimators receiving a certain metadata to actually use it, and that there may be a meta-data with the same name (let say sample weight) which is different for different functions. @jnothman 's latest suggestion on the SLEP may solve this issue though.

Could you point me to his latest suggestion? My apologies, just regarding the issues of sample and feature properties there are so many different open issues and SLEPs.

adrinjalali · 2019-10-28T15:29:54Z

Solution 4: https://github.com/scikit-learn/enhancement_proposals/pull/16/files

hermidalc · 2019-10-28T15:34:17Z

And apologies too if I appeared to question your choices, totally not my intention and just happy that these issues are being discussed.

I just don't want the scikit-learn core team to go through all the effort of developing new functionalities that then don't solve the basic needs of e.g. life sciences researchers who use ML. For us we need 2D sample metadata and 2D feature metadata dataframes that are in some way transparently attached or associated with X numpy array such that everything gets passed around and subset the right way. As long as you are keeping these needs in your plan then I'm happy.

adrinjalali · 2019-10-28T15:38:24Z

That's a fair point, and good to know.

hermidalc · 2019-10-28T15:51:48Z

no it doesn't. Still have the issue that the user doesn't want all functions/estimators receiving a certain metadata to actually use it, and that there may be a meta-data with the same name (let say sample weight) which is different for different functions. @jnothman 's latest suggestion on the SLEP may solve this issue though.

I think this more matters for compatibility with the scikit-learn API itself? Like groups and sample_weight. From the user side when we write our own custom extensions if I do not need to specify new named parameters in fit and transform to hold specific 2D sample or feature metadata and they are just attached to X then this a great solution and no need to worry about routing from the user side. In my custom classes I only access the metadata columns I need and it doesn't have any potential namespace clashes with existing scikit-learn named parameters being passed to fit or transform or predict for example.

amueller · 2019-11-12T20:33:56Z

If we want to use the new data structure to solve the categorical/continuous issue that @thomasjpfan points to, we're basically reimplementing the chunks that are currently used in pandas, right? The main reason we're not going with pandas is the fact that they want to move away from this model because it's too hard to maintain, from what I understand.
So before we consider implementing this, we might talk to @jorisvandenbossche a bit more first ;)

amueller · 2019-11-15T17:50:08Z

@adrinjalali can you please elaborate on what you wanted to write in the SLEP re categorical data? @thomasjpfan had the same question above, I think.

My assumption was that we would have homogeneous data, so categorical values would indeed be stored as floats.

amueller · 2019-11-15T17:55:28Z

Can someone give an example where the output of an estimator would contain both continuous and categorical columns? I guess that would be involving a ColumnTransformer, but right now we have nothing that produces categorical columns (that are not integers), right?

thomasjpfan · 2019-11-15T18:19:58Z

ColumnTransformer can have remainder=passthrough which can passthrough the categorical features. For example:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

X = pd.DataFrame(
     {'city': ['London', 'London', 'Paris', 'Sallisaw'],
      'title': ["His Last Bow", "How Watson Learned the Trick",
                "A Moveable Feast", "The Grapes of Wrath"],
      'expert_rating': [5, 3, 4, 5],
      'user_rating': [4, 5, 4, 3]})


column_trans = ColumnTransformer(
     [('title_bow', CountVectorizer(), 'title')],
     remainder='passthrough')

column_trans.fit_transform(X)

As an aside, I think using homogenous data works for most use cases. Even for strings, we can include the categories in the metadata and encode it as floats.

Edit: Fix example to actually pass through the "city" category.

NicolasHug · 2019-11-15T18:36:40Z

@adrinjalali can you please elaborate on what you wanted to write in the SLEP re categorical data?

Regardless of the content of the SELP I really think we need a summary of the current state on this matter. With all the discussions happening on the sleps, issues and PRs, it's very hard to understand what's going on for people who haven't been following it since the beginning.
(I'd like to help but can't because of this)

amueller · 2019-11-15T19:31:30Z

@NicolasHug Isn't the summary the new slep?
@thomasjpfan that's the only one I could come up with as well.
We could have ColumnTransformer optionally return a dataframe in that case....

adrinjalali · 2019-11-16T09:27:25Z

I'll write up a new SLEP with the summary and all. I just wanted to talk to Thomas before doing that, and we're doing that in the coming week.

Regarding an estimator with categorical and numerical data as output, some resamplers would be another example.

And regarding the question of if we want to have homogeneous data or not, the answer is yes as far as I know, or at least it has been the case. I haven't confirmed it since I need to know more about what we want to include in this new object.

Diadochokinetic · 2020-08-04T19:11:11Z

Hi,

I just wanted to ask what the current state of this issue is?

jnothman · 2020-08-04T20:02:31Z

#16772 is the latest attempt to pass feature names around.

lorentzenchr · 2022-11-01T22:10:52Z

With #23734 merged, can we close?

adrinjalali · 2022-11-02T08:20:07Z

Yes, we should think of supporting things other than pandas, but it'd be very far from the implementation here anyway.

adrinjalali · 2022-11-02T08:22:03Z

@thomasjpfan also a reminder that this is what we thought we have to do regarding sparse arrays back then.

jorisvandenbossche mentioned this pull request Jul 14, 2019

SLEP 8: Propagating feature names scikit-learn/enhancement_proposals#18

Closed

jnothman reviewed Jul 23, 2019

View reviewed changes

amueller reviewed Jul 23, 2019

View reviewed changes

adrinjalali changed the title ~~feature names - xarray~~ feature names - NamedArray Jul 24, 2019

merge upstream/master

b3eaba9

adrinjalali mentioned this pull request Dec 27, 2019

SLEP012 - DataArray scikit-learn/enhancement_proposals#25

Merged

adrinjalali added 6 commits January 6, 2020 16:13

fix typo and remove merge note from __init__

293dbba

understand pandas feature names

a498e4d

fix the sparse case

3d70e72

add force_sparse and sparse from df

d122adb

add sparse types to init

169cbb3

pep8

a6fbfbd

github-actions bot added the module:utils label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:51

adrinjalali closed this Nov 2, 2022

adrinjalali deleted the xarray branch November 2, 2022 08:22

Uh oh!

feature names - NamedArray #14315

feature names - NamedArray #14315

Uh oh!

Conversation

adrinjalali commented Jul 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Jul 13, 2019

Uh oh!

amueller commented Jul 13, 2019

Uh oh!

amueller commented Jul 13, 2019

Uh oh!

amueller commented Jul 13, 2019

Uh oh!

jorisvandenbossche commented Jul 14, 2019

Uh oh!

amueller commented Jul 14, 2019

Uh oh!

adrinjalali commented Jul 15, 2019

Uh oh!

adrinjalali commented Jul 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Jul 16, 2019

Uh oh!

adrinjalali commented Jul 17, 2019

Uh oh!

thomasjpfan commented Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Jul 17, 2019

Uh oh!

thomasjpfan commented Jul 17, 2019

Uh oh!

adrinjalali commented Jul 17, 2019

Uh oh!

thomasjpfan commented Jul 17, 2019

Uh oh!

shoyer commented Jul 17, 2019

Uh oh!

adrinjalali commented Jul 22, 2019

Uh oh!

amueller commented Jul 22, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Jul 23, 2019

Uh oh!

amueller commented Jul 23, 2019

Uh oh!

amueller commented Jul 23, 2019

Uh oh!

amueller Jul 23, 2019

Choose a reason for hiding this comment

Uh oh!

adrinjalali Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

amueller Jul 23, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jul 23, 2019 via email

Uh oh!

adrinjalali commented Jul 24, 2019

Uh oh!

amueller commented Jul 24, 2019

Uh oh!

adrinjalali commented Oct 28, 2019

Uh oh!

hermidalc commented Oct 28, 2019

Uh oh!

adrinjalali commented Oct 28, 2019

Uh oh!

hermidalc commented Oct 28, 2019

adrinjalali commented Jul 12, 2019 •

edited

Loading

adrinjalali commented Jul 15, 2019 •

edited

Loading

thomasjpfan commented Jul 17, 2019 •

edited

Loading

amueller commented Jul 17, 2019 •

edited

Loading

thomasjpfan commented Nov 15, 2019 •

edited

Loading