[MRG] Add get_feature_names to PCA #6445

maniteja123 · 2016-02-24T14:50:16Z

Attempt to add get_feature_names to PCA as suggested in #6425 .
The present idea is to take the input feature with the most variance along the component. But it doesn't consider the importance of the component. It definitely seems that the present approach can be modified to accommodate the significance of the components.
Please let me know about the best way to move forward here. Thanks!

jakevdp · 2016-02-24T14:56:37Z

Thanks for the contribution! Looking at this, I think using only the dominant feature name might be a bit misleading. That said, I'm not sure what to suggest as an alternative.

jakevdp · 2016-02-24T15:02:51Z

In some contexts, the full feature names might be desirable (see e.g. cell 19 of this notebook). In other cases, that will lead to feature names with thousands of characters.

I think probably the safest default is to just call them "pca0", "pca1", etc. and perhaps have an optional argument along the lines of full_names or something like that.

maniteja123 · 2016-02-24T15:09:31Z

Thanks for the input. I have just created an initial PR and it seems helpful if the variance corresponding the input feature could also be mentioned in the results. Also a concern would be the criteria to choose whether multiple features are equally dominant. Will proceed as per your feedback.

maniteja123 · 2016-02-24T16:58:52Z

@jakevdp, thanks for the notebook. They really are great help for me in general too. If I understand it right, your suggestion is to print the components as the linear combination of the features weighted by the variance. It definitely provides clearer understanding to the end user but might become too lengthy. Please do correct me if I misunderstood the comment.

jakevdp · 2016-02-24T17:15:36Z

your suggestion is to print the components as the linear combination of the features weighted by the variance.

Yes – though not weighted by the variance per se, but by the contribution to each principal vector. I think that could be useful in some cases, but far too long to be the default.

maniteja123 · 2016-02-24T17:21:31Z

Ah sorry, I mistook the contribution to the component for variance. Thanks !

yenchenlin · 2016-02-24T17:24:16Z

So maybe having an option full_names like @jakevdp suggested above can solve this?
When it is set to True, then show each features' contribution to each principal as output feature names.

jakevdp · 2016-02-24T17:26:29Z

So maybe having an option full_names like @jakevdp suggested above can solve this?
When it is set to True, then show each features' contribution to each principal as output feature names.

That's what I had in mind. What do you think?

maniteja123 · 2016-02-24T17:31:13Z

So, when full_names is set to True mention the contribution of features to principal component and simply be [ 'pca0', 'pca1', .. ] when full_names is False right ?

yenchenlin · 2016-02-24T17:35:53Z

So, when full_names is set to True mention the contribution of features to principal component and simply be [ 'pca0', 'pca1', .. ] when full_names is False right ?

Features output by PCA is called Principal Component, is 'pc0' better than 'pca0'?
Just in curious, I have no strong opinion on it.

jakevdp · 2016-02-24T17:36:58Z

Yes, I think that's a good idea.

maniteja123 · 2016-02-24T17:37:45Z

Thanks, will go ahead and do the suggested changes !

maniteja123 · 2016-02-24T17:54:47Z

Does something like this seem fine ?

>>> pca.get_feature_names(full_output=True)
['0.838492 * x0 + 0.544914 * x1', '0.544914 * x0 + -0.838492 * x1']
>>> pca.get_feature_names()
['pc0', 'pc1']

Also should the value of full_output be False by default or should be provided compulsorily ? Do let me know your opinion !

jakevdp · 2016-02-24T17:56:30Z

Looks good! I'd probably cut off the coefficients at 3 places past the decimal point, and perhaps change + -0.838 to - 0.838.

maniteja123 · 2016-02-24T18:11:38Z

sklearn/decomposition/pca.py

+                    "equal number of features when fitted: {1}.".format
+                    (len(input_features), self.n_features))
+
+        def get_contr(component):


I don't think this is the best way to handle the signs of the contribution. Would be helpful if you could suggest a much better way to do this.

I should definitely use formatting here, but then it will look like
['0.838 * x0 + 0.544 * x1', '0.544 * x0 -0.838 * x1'] with no space between the negative sign and digit. Please bear my naivety in asking doubts about a trivial issue.

Aesthetically it would be nice to have a space there. I think with a bit of generator-fu it could be done pretty straightforwardly 😄

Here's a quick approach:

def name_generator(coefficients, names): yield "{0:.3f} * {1}".format(coefficients[0], names[0]) for c, n in zip(coefficients[1:], names[1:]): yield "{0:s} {1:.3f} * {2}".format('-' if c < 0 else '+', abs(c), n) coefficients = np.random.rand(5) - 0.5 names = ['pc{0}'.format(i) for i in range(len(coefficients))] ' '.join(name_generator(coefficients, names)) # '-0.111 * pc0 + 0.476 * pc1 + 0.241 * pc2 - 0.011 * pc3 + 0.329 * pc4'

maniteja123 · 2016-02-24T18:35:47Z

Hi, I have done the changes but not probably in the best possible implementation. Do have a look and let me know your opinion. Thanks !

jakevdp · 2016-02-24T18:55:41Z

Not sure if you saw my other comment, because the diff was outdated. Here's a quick & clear way to create the feature name with generator expressions:

def name_generator(coefficients, names):
    yield "{0:.3f} * {1}".format(coefficients[0], names[0])
    for c, n in zip(coefficients[1:], names[1:]):
        yield "{0:s} {1:.3f} * {2}".format('-' if c < 0 else '+', abs(c), n)

coefficients = np.random.rand(5) - 0.5
names = ['pc{0}'.format(i) for i in range(len(coefficients))]

' '.join(name_generator(coefficients, names))
# '-0.111 * pc0 + 0.476 * pc1 + 0.241 * pc2 - 0.011 * pc3 + 0.329 * pc4'

maniteja123 · 2016-02-24T19:00:51Z

Sorry my old approach was really bad so changed it a bit. The generator-fu is really cool :) Only one question out of curiosity. Is it intentional for not leaving space between the sign for the first feature ?

jakevdp · 2016-02-24T19:07:21Z

Yes, no space on the first feature is intentional – I think it reads better that way.

maniteja123 · 2016-02-24T19:36:43Z

I have done the changes as suggested. Please have a look at your convenience. Thanks.

jakevdp · 2016-02-24T19:58:37Z

sklearn/decomposition/tests/test_pca.py

+    pca.fit(X)
+    assert_equal(pca.get_feature_names(), ['pc0', 'pc1'])
+    assert_equal(pca.get_feature_names(full_output=True),
+        ['0.838 * x0 + 0.545 * x1', '0.545 * x0 - 0.838 * x1'])


Probably should construct a test in which one of the components has a negative sign in the first coefficient.

jnothman · 2016-02-24T23:23:38Z

PCA is mostly used when there are a lot of input features. Listing a fixed number of the largest contributing features to each component seems most useful to me, rather than listing all. Perhaps the appropriate parameter is show_coef: one of {False, True, integer} where False and True are like the current full_output option; integer n shows the sorted top n contributions to each component. I'm not sure about the need for True.

I would also use '{:2g}*{s}' to format the weight and the input name to reduce verbosity (remove spaces around multiplication and reduce precision to 2), but this can be configurable.

maniteja123 · 2016-02-25T16:56:34Z

Hi everyone, thanks for the inputs. I understand that the addition of get_feature_names warrants discussion. It seems that the idea of giving only specified number of highest contributing features to each component is a good approach here. Please do let me know if I shall proceed the currently suggested approach or would happily oblige to wait until a consensus is reached. Thanks.

amueller · 2016-02-25T18:55:34Z

I'd be ok with False, True, int

maniteja123 · 2016-03-03T15:04:17Z

Hi everyone, I understand the embargo on adding this feature, but still wanted to let you know that I did try enabling integer value for full_output as suggested. Please do have a look at your convenience and let me know your thoughts and suggestions. Thanks.

maniteja123 · 2016-03-08T13:42:38Z

Hi, this is the present working with the option of choosing the number of features to be shown in the decreasing order of contribution to the component. Please do let me know your opinions on this. Thanks.

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2).fit(X)
>>> pca.get_feature_names(full_output=True)
['0.84*x0 + 0.54*x1', '0.54*x0 - 0.84*x1']
>>> pca.get_feature_names(full_output=1)
['0.84*x0', '-0.84*x1']
>>> pca.get_feature_names()
['pc0', 'pc1']

jnothman · 2016-03-08T13:53:36Z

I'd rather show_coef over full_output

jnothman · 2016-03-08T13:54:55Z

sklearn/decomposition/pca.py

+            feature_names = [' '.join(name_generator(components[i][required[i]], input_features[required[i]]))
+                            for i in range(self.n_components)]
+        else:
+            raise ValueError("full_output must be integer or boolean")


this will be a bad error message where full_output >= n_features

Oops my bad. Would it be okay to check for this condition and raise the error in the previous elif block ?

maniteja123 · 2016-03-09T03:24:20Z

@jnothman I have done the changes and also raised meaningful errors when show_coef is less than 1 or greater than n_features. Please let me know if you have any more suggestions. Thanks.

maniteja123 · 2016-03-09T03:28:32Z

Actually now that I look closely at your suggestion, shouldn't it be np.argsort(np.abs(components), axis=1)[:, ::-1] since we want to reverse the order along the first axis. Please correct me if I am wrong. The tests here will pass for either of them since it is symmetric matrix in the example. Thanks.

jnothman · 2016-03-09T04:12:24Z

Yes, what you said. But perhaps you should use a stronger test.

maniteja123 · 2016-03-09T05:45:46Z

Thanks for confirming. I just used the example already given. Will use a better one and update the PR.

thomasjpfan · 2022-02-02T05:05:10Z

Closing because with SLEP007, PCA now has get_feature_names_out: #21334

maniteja123 reviewed Feb 24, 2016
View reviewed changes

jakevdp reviewed Feb 24, 2016
View reviewed changes

maniteja123 force-pushed the pca-get_feature_names branch from d47ac7c to cd600a1 Compare March 3, 2016 17:35

jnothman reviewed Mar 8, 2016
View reviewed changes

maniteja123 force-pushed the pca-get_feature_names branch from 3c3ecec to 4fd22e1 Compare March 29, 2016 19:00

ENH: Add get_feature_names to PCA

8659788

maniteja123 force-pushed the pca-get_feature_names branch from 4fd22e1 to 8659788 Compare March 29, 2016 19:04

maniteja123 changed the title ~~Add get_feature_names to PCA~~ [MRG] Add get_feature_names to PCA Mar 29, 2016

Corect doctest error

ef3b9a0

eyadsibai mentioned this pull request Jun 7, 2018

Add get_feature_names() method scikit-learn-contrib/category_encoders#79

Closed

amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019

github-actions bot added the module:decomposition label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:49

thomasjpfan closed this Feb 2, 2022

Uh oh!

[MRG] Add get_feature_names to PCA #6445

[MRG] Add get_feature_names to PCA #6445

Uh oh!

Conversation

maniteja123 commented Feb 24, 2016

Uh oh!

jakevdp commented Feb 24, 2016

Uh oh!

jakevdp commented Feb 24, 2016

Uh oh!

maniteja123 commented Feb 24, 2016

Uh oh!

maniteja123 commented Feb 24, 2016

Uh oh!

jakevdp commented Feb 24, 2016

Uh oh!

maniteja123 commented Feb 24, 2016

Uh oh!

yenchenlin commented Feb 24, 2016

Uh oh!

jakevdp commented Feb 24, 2016

Uh oh!

maniteja123 commented Feb 24, 2016

Uh oh!

yenchenlin commented Feb 24, 2016

Uh oh!

jakevdp commented Feb 24, 2016

Uh oh!

maniteja123 commented Feb 24, 2016

Uh oh!

maniteja123 commented Feb 24, 2016

Uh oh!

jakevdp commented Feb 24, 2016

Uh oh!

maniteja123 Feb 24, 2016

Choose a reason for hiding this comment

Uh oh!

maniteja123 Feb 24, 2016

Choose a reason for hiding this comment

Uh oh!

jakevdp Feb 24, 2016

Choose a reason for hiding this comment

Uh oh!

jakevdp Feb 24, 2016

Choose a reason for hiding this comment

Uh oh!

maniteja123 commented Feb 24, 2016

Uh oh!

jakevdp commented Feb 24, 2016

Uh oh!

maniteja123 commented Feb 24, 2016

Uh oh!

jakevdp commented Feb 24, 2016

Uh oh!

maniteja123 commented Feb 24, 2016

Uh oh!

jakevdp Feb 24, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 24, 2016

Uh oh!

maniteja123 commented Feb 25, 2016

Uh oh!

amueller commented Feb 25, 2016

Uh oh!

maniteja123 commented Mar 3, 2016

Uh oh!

maniteja123 commented Mar 8, 2016

Uh oh!

jnothman commented Mar 8, 2016

Uh oh!

jnothman Mar 8, 2016

Choose a reason for hiding this comment

Uh oh!

maniteja123 Mar 8, 2016

Choose a reason for hiding this comment

Uh oh!

maniteja123 commented Mar 9, 2016

Uh oh!

thomasjpfan commented Feb 2, 2022 •

edited

Loading