-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add get_feature_names to PCA #6445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the contribution! Looking at this, I think using only the dominant feature name might be a bit misleading. That said, I'm not sure what to suggest as an alternative. |
In some contexts, the full feature names might be desirable (see e.g. cell 19 of this notebook). In other cases, that will lead to feature names with thousands of characters. I think probably the safest default is to just call them " |
Thanks for the input. I have just created an initial PR and it seems helpful if the variance corresponding the input feature could also be mentioned in the results. Also a concern would be the criteria to choose whether multiple features are equally dominant. Will proceed as per your feedback. |
@jakevdp, thanks for the notebook. They really are great help for me in general too. If I understand it right, your suggestion is to print the components as the linear combination of the features weighted by the variance. It definitely provides clearer understanding to the end user but might become too lengthy. Please do correct me if I misunderstood the comment. |
Yes – though not weighted by the variance per se, but by the contribution to each principal vector. I think that could be useful in some cases, but far too long to be the default. |
Ah sorry, I mistook the contribution to the component for variance. Thanks ! |
So maybe having an option |
That's what I had in mind. What do you think? |
So, when |
Features output by PCA is called Principal Component, is |
Yes, I think that's a good idea. |
Thanks, will go ahead and do the suggested changes ! |
Does something like this seem fine ?
Also should the value of |
Looks good! I'd probably cut off the coefficients at 3 places past the decimal point, and perhaps change |
sklearn/decomposition/pca.py
Outdated
"equal number of features when fitted: {1}.".format | ||
(len(input_features), self.n_features)) | ||
|
||
def get_contr(component): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the best way to handle the signs of the contribution. Would be helpful if you could suggest a much better way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should definitely use formatting here, but then it will look like
['0.838 * x0 + 0.544 * x1', '0.544 * x0 -0.838 * x1']
with no space between the negative sign and digit. Please bear my naivety in asking doubts about a trivial issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aesthetically it would be nice to have a space there. I think with a bit of generator-fu it could be done pretty straightforwardly 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a quick approach:
def name_generator(coefficients, names):
yield "{0:.3f} * {1}".format(coefficients[0], names[0])
for c, n in zip(coefficients[1:], names[1:]):
yield "{0:s} {1:.3f} * {2}".format('-' if c < 0 else '+', abs(c), n)
coefficients = np.random.rand(5) - 0.5
names = ['pc{0}'.format(i) for i in range(len(coefficients))]
' '.join(name_generator(coefficients, names))
# '-0.111 * pc0 + 0.476 * pc1 + 0.241 * pc2 - 0.011 * pc3 + 0.329 * pc4'
Hi, I have done the changes but not probably in the best possible implementation. Do have a look and let me know your opinion. Thanks ! |
Not sure if you saw my other comment, because the diff was outdated. Here's a quick & clear way to create the feature name with generator expressions: def name_generator(coefficients, names):
yield "{0:.3f} * {1}".format(coefficients[0], names[0])
for c, n in zip(coefficients[1:], names[1:]):
yield "{0:s} {1:.3f} * {2}".format('-' if c < 0 else '+', abs(c), n)
coefficients = np.random.rand(5) - 0.5
names = ['pc{0}'.format(i) for i in range(len(coefficients))]
' '.join(name_generator(coefficients, names))
# '-0.111 * pc0 + 0.476 * pc1 + 0.241 * pc2 - 0.011 * pc3 + 0.329 * pc4' |
Sorry my old approach was really bad so changed it a bit. The generator-fu is really cool :) Only one question out of curiosity. Is it intentional for not leaving space between the sign for the first feature ? |
Yes, no space on the first feature is intentional – I think it reads better that way. |
I have done the changes as suggested. Please have a look at your convenience. Thanks. |
pca.fit(X) | ||
assert_equal(pca.get_feature_names(), ['pc0', 'pc1']) | ||
assert_equal(pca.get_feature_names(full_output=True), | ||
['0.838 * x0 + 0.545 * x1', '0.545 * x0 - 0.838 * x1']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should construct a test in which one of the components has a negative sign in the first coefficient.
PCA is mostly used when there are a lot of input features. Listing a fixed number of the largest contributing features to each component seems most useful to me, rather than listing all. Perhaps the appropriate parameter is I would also use '{:2g}*{s}' to format the weight and the input name to reduce verbosity (remove spaces around multiplication and reduce precision to 2), but this can be configurable. |
Hi everyone, thanks for the inputs. I understand that the addition of |
I'd be ok with |
Hi everyone, I understand the embargo on adding this feature, but still wanted to let you know that I did try enabling integer value for |
d47ac7c
to
cd600a1
Compare
Hi, this is the present working with the option of choosing the number of features to be shown in the decreasing order of contribution to the component. Please do let me know your opinions on this. Thanks.
|
I'd rather |
sklearn/decomposition/pca.py
Outdated
feature_names = [' '.join(name_generator(components[i][required[i]], input_features[required[i]])) | ||
for i in range(self.n_components)] | ||
else: | ||
raise ValueError("full_output must be integer or boolean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be a bad error message where full_output >= n_features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops my bad. Would it be okay to check for this condition and raise the error in the previous elif block ?
@jnothman I have done the changes and also raised meaningful errors when |
Actually now that I look closely at your suggestion, shouldn't it be |
Yes, what you said. But perhaps you should use a stronger test. |
Thanks for confirming. I just used the example already given. Will use a better one and update the PR. |
3c3ecec
to
4fd22e1
Compare
4fd22e1
to
8659788
Compare
Attempt to add get_feature_names to PCA as suggested in #6425 .
The present idea is to take the input feature with the most variance along the component. But it doesn't consider the importance of the component. It definitely seems that the present approach can be modified to accommodate the significance of the components.
Please let me know about the best way to move forward here. Thanks!