Skip to content

[MRG] Add get_feature_names to PCA #6445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

maniteja123
Copy link
Contributor

Attempt to add get_feature_names to PCA as suggested in #6425 .
The present idea is to take the input feature with the most variance along the component. But it doesn't consider the importance of the component. It definitely seems that the present approach can be modified to accommodate the significance of the components.
Please let me know about the best way to move forward here. Thanks!

@jakevdp
Copy link
Member

jakevdp commented Feb 24, 2016

Thanks for the contribution! Looking at this, I think using only the dominant feature name might be a bit misleading. That said, I'm not sure what to suggest as an alternative.

@jakevdp
Copy link
Member

jakevdp commented Feb 24, 2016

In some contexts, the full feature names might be desirable (see e.g. cell 19 of this notebook). In other cases, that will lead to feature names with thousands of characters.

I think probably the safest default is to just call them "pca0", "pca1", etc. and perhaps have an optional argument along the lines of full_names or something like that.

@maniteja123
Copy link
Contributor Author

Thanks for the input. I have just created an initial PR and it seems helpful if the variance corresponding the input feature could also be mentioned in the results. Also a concern would be the criteria to choose whether multiple features are equally dominant. Will proceed as per your feedback.

@maniteja123
Copy link
Contributor Author

@jakevdp, thanks for the notebook. They really are great help for me in general too. If I understand it right, your suggestion is to print the components as the linear combination of the features weighted by the variance. It definitely provides clearer understanding to the end user but might become too lengthy. Please do correct me if I misunderstood the comment.

@jakevdp
Copy link
Member

jakevdp commented Feb 24, 2016

your suggestion is to print the components as the linear combination of the features weighted by the variance.

Yes – though not weighted by the variance per se, but by the contribution to each principal vector. I think that could be useful in some cases, but far too long to be the default.

@maniteja123
Copy link
Contributor Author

Ah sorry, I mistook the contribution to the component for variance. Thanks !

@yenchenlin
Copy link
Contributor

So maybe having an option full_names like @jakevdp suggested above can solve this?
When it is set to True, then show each features' contribution to each principal as output feature names.

@jakevdp
Copy link
Member

jakevdp commented Feb 24, 2016

So maybe having an option full_names like @jakevdp suggested above can solve this?
When it is set to True, then show each features' contribution to each principal as output feature names.

That's what I had in mind. What do you think?

@maniteja123
Copy link
Contributor Author

So, when full_names is set to True mention the contribution of features to principal component and simply be [ 'pca0', 'pca1', .. ] when full_names is False right ?

@yenchenlin
Copy link
Contributor

So, when full_names is set to True mention the contribution of features to principal component and simply be [ 'pca0', 'pca1', .. ] when full_names is False right ?

Features output by PCA is called Principal Component, is 'pc0' better than 'pca0'?
Just in curious, I have no strong opinion on it.

@jakevdp
Copy link
Member

jakevdp commented Feb 24, 2016

Yes, I think that's a good idea.

@maniteja123
Copy link
Contributor Author

Thanks, will go ahead and do the suggested changes !

@maniteja123
Copy link
Contributor Author

Does something like this seem fine ?

>>> pca.get_feature_names(full_output=True)
['0.838492 * x0 + 0.544914 * x1', '0.544914 * x0 + -0.838492 * x1']
>>> pca.get_feature_names()
['pc0', 'pc1']

Also should the value of full_output be False by default or should be provided compulsorily ? Do let me know your opinion !

@jakevdp
Copy link
Member

jakevdp commented Feb 24, 2016

Looks good! I'd probably cut off the coefficients at 3 places past the decimal point, and perhaps change + -0.838 to - 0.838.

"equal number of features when fitted: {1}.".format
(len(input_features), self.n_features))

def get_contr(component):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the best way to handle the signs of the contribution. Would be helpful if you could suggest a much better way to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should definitely use formatting here, but then it will look like
['0.838 * x0 + 0.544 * x1', '0.544 * x0 -0.838 * x1'] with no space between the negative sign and digit. Please bear my naivety in asking doubts about a trivial issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aesthetically it would be nice to have a space there. I think with a bit of generator-fu it could be done pretty straightforwardly 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a quick approach:

def name_generator(coefficients, names):
    yield "{0:.3f} * {1}".format(coefficients[0], names[0])
    for c, n in zip(coefficients[1:], names[1:]):
        yield "{0:s} {1:.3f} * {2}".format('-' if c < 0 else '+', abs(c), n)

coefficients = np.random.rand(5) - 0.5
names = ['pc{0}'.format(i) for i in range(len(coefficients))]
' '.join(name_generator(coefficients, names))
# '-0.111 * pc0 + 0.476 * pc1 + 0.241 * pc2 - 0.011 * pc3 + 0.329 * pc4'

@maniteja123
Copy link
Contributor Author

Hi, I have done the changes but not probably in the best possible implementation. Do have a look and let me know your opinion. Thanks !

@jakevdp
Copy link
Member

jakevdp commented Feb 24, 2016

Not sure if you saw my other comment, because the diff was outdated. Here's a quick & clear way to create the feature name with generator expressions:

def name_generator(coefficients, names):
    yield "{0:.3f} * {1}".format(coefficients[0], names[0])
    for c, n in zip(coefficients[1:], names[1:]):
        yield "{0:s} {1:.3f} * {2}".format('-' if c < 0 else '+', abs(c), n)

coefficients = np.random.rand(5) - 0.5
names = ['pc{0}'.format(i) for i in range(len(coefficients))]

' '.join(name_generator(coefficients, names))
# '-0.111 * pc0 + 0.476 * pc1 + 0.241 * pc2 - 0.011 * pc3 + 0.329 * pc4'

@maniteja123
Copy link
Contributor Author

Sorry my old approach was really bad so changed it a bit. The generator-fu is really cool :) Only one question out of curiosity. Is it intentional for not leaving space between the sign for the first feature ?

@jakevdp
Copy link
Member

jakevdp commented Feb 24, 2016

Yes, no space on the first feature is intentional – I think it reads better that way.

@maniteja123
Copy link
Contributor Author

I have done the changes as suggested. Please have a look at your convenience. Thanks.

pca.fit(X)
assert_equal(pca.get_feature_names(), ['pc0', 'pc1'])
assert_equal(pca.get_feature_names(full_output=True),
['0.838 * x0 + 0.545 * x1', '0.545 * x0 - 0.838 * x1'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should construct a test in which one of the components has a negative sign in the first coefficient.

@jnothman
Copy link
Member

PCA is mostly used when there are a lot of input features. Listing a fixed number of the largest contributing features to each component seems most useful to me, rather than listing all. Perhaps the appropriate parameter is show_coef: one of {False, True, integer} where False and True are like the current full_output option; integer n shows the sorted top n contributions to each component. I'm not sure about the need for True.

I would also use '{:2g}*{s}' to format the weight and the input name to reduce verbosity (remove spaces around multiplication and reduce precision to 2), but this can be configurable.

@maniteja123
Copy link
Contributor Author

Hi everyone, thanks for the inputs. I understand that the addition of get_feature_names warrants discussion. It seems that the idea of giving only specified number of highest contributing features to each component is a good approach here. Please do let me know if I shall proceed the currently suggested approach or would happily oblige to wait until a consensus is reached. Thanks.

@amueller
Copy link
Member

I'd be ok with False, True, int

@maniteja123
Copy link
Contributor Author

Hi everyone, I understand the embargo on adding this feature, but still wanted to let you know that I did try enabling integer value for full_output as suggested. Please do have a look at your convenience and let me know your thoughts and suggestions. Thanks.

@maniteja123 maniteja123 force-pushed the pca-get_feature_names branch from d47ac7c to cd600a1 Compare March 3, 2016 17:35
@maniteja123
Copy link
Contributor Author

Hi, this is the present working with the option of choosing the number of features to be shown in the decreasing order of contribution to the component. Please do let me know your opinions on this. Thanks.

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2).fit(X)
>>> pca.get_feature_names(full_output=True)
['0.84*x0 + 0.54*x1', '0.54*x0 - 0.84*x1']
>>> pca.get_feature_names(full_output=1)
['0.84*x0', '-0.84*x1']
>>> pca.get_feature_names()
['pc0', 'pc1']

@jnothman
Copy link
Member

jnothman commented Mar 8, 2016

I'd rather show_coef over full_output

feature_names = [' '.join(name_generator(components[i][required[i]], input_features[required[i]]))
for i in range(self.n_components)]
else:
raise ValueError("full_output must be integer or boolean")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be a bad error message where full_output >= n_features

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops my bad. Would it be okay to check for this condition and raise the error in the previous elif block ?

@maniteja123
Copy link
Contributor Author

@jnothman I have done the changes and also raised meaningful errors when show_coef is less than 1 or greater than n_features. Please let me know if you have any more suggestions. Thanks.

@maniteja123
Copy link
Contributor Author

Actually now that I look closely at your suggestion, shouldn't it be np.argsort(np.abs(components), axis=1)[:, ::-1] since we want to reverse the order along the first axis. Please correct me if I am wrong. The tests here will pass for either of them since it is symmetric matrix in the example. Thanks.

@jnothman
Copy link
Member

jnothman commented Mar 9, 2016

Yes, what you said. But perhaps you should use a stronger test.

@maniteja123
Copy link
Contributor Author

Thanks for confirming. I just used the example already given. Will use a better one and update the PR.

@maniteja123 maniteja123 force-pushed the pca-get_feature_names branch from 3c3ecec to 4fd22e1 Compare March 29, 2016 19:00
@maniteja123 maniteja123 force-pushed the pca-get_feature_names branch from 4fd22e1 to 8659788 Compare March 29, 2016 19:04
@maniteja123 maniteja123 changed the title Add get_feature_names to PCA [MRG] Add get_feature_names to PCA Mar 29, 2016
@amueller amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019
Base automatically changed from master to main January 22, 2021 10:49
@thomasjpfan
Copy link
Member

thomasjpfan commented Feb 2, 2022

Closing because with SLEP007, PCA now has get_feature_names_out: #21334

@thomasjpfan thomasjpfan closed this Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:decomposition Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants