Skip to content

Ledoit Wolf covariance estimator should standardize data #3508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cbrnr opened this issue Jul 30, 2014 · 17 comments
Closed

Ledoit Wolf covariance estimator should standardize data #3508

cbrnr opened this issue Jul 30, 2014 · 17 comments

Comments

@cbrnr
Copy link
Contributor

cbrnr commented Jul 30, 2014

If you have differently scaled features, the calculation of the shrinkage parameter using the algorithm by Ledoit and Wolf yields an incorrect estimate if the data are not standardized. Therefore, the function ledoit_wolf_shrinkage in shrunk_covariance_.py should standardize the data before computing the shrinkage parameter, and then scale the shrunk covariance matrix back at the end.

@eickenberg
Copy link
Contributor

According to the docstring, centering is done by default, but not standardization, just as you say.

def ledoit_wolf_shrinkage(X, assume_centered=False, block_size=1000):
    """Estimates the shrunk Ledoit-Wolf covariance matrix.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Data from which to compute the Ledoit-Wolf shrunk covariance shrinkage.

    assume_centered : Boolean
        If True, data are not centered before computation.
        Useful to work with data whose mean is significantly equal to
        zero but is not exactly zero.
        If False, data are centered before computation.

Maybe you could similarly add a kwarg assume_unit_variance (or something better named)?

Alternatively, adding a line in the docstring hinting at the usefulness of prepending a StandardScaler to this estimator could be helpful, too.

Unfortunately I cannot find what I am convinced I saw somewhere: I have the vague recollection that in one of their papers, Ledoit and Wolf just write cov = X.T.dot(X) accompanied by a line saying somehting along the lines of "assuming zero mean" (and I can't remember if they also said unit variance). Maybe I am wrong about this.

@agramfort
Copy link
Member

centering is necessary but not standardization AFAIK. I however have the feeling that if you have a very noisy feature the perf with not be good. As LW does not assume standardized data I would let the users use StandardScaler before if they need.

@cbrnr
Copy link
Contributor Author

cbrnr commented Aug 1, 2014

@eickenberg, I can certainly add a new parameter. Maybe we should call it assume_scaled? IMO, centering and scaling is necessary, but it still makes sense to have parameters to turn this off (people might have centered and scaled their data before, so we don't have to do it twice).

@agramfort, I think the issue is really that if you have features on different scales, the estimated parameter will not be optimal. I will try to figure out if standardization is required, but we should add this optional feature anyway (I would argue to scale by default though).

I created PR #3521 to address this issue.

@eickenberg
Copy link
Contributor

If you have time to look closely at the proof in Honey, I shrunk the covariance matrix then that should provide an answer, right?

@agramfort
Copy link
Member

@cle1109 I've been using LW for MEG data with gradiometers and magnetometers and indeed I need to scale to make both sensors on the same order of magnitude but I use a scalar scaling and don't standardize the features.

cc @dengemann

@mbillingr
Copy link
Contributor

@agramfort Maybe we got some terms mixed up. By standardization @cle1109 was referring to scaling each feature with its standard deviation. Isn't that scalar scaling too?

@agramfort
Copy link
Member

@agramfort https://github.com/agramfort Maybe we got some terms mixed
up. By standardization @cle1109 https://github.com/cle1109 was
referring to scaling each feature with its standard deviation. Isn't that
scalar scaling too?

I meant using 1 scalar for all the same types of features (eg all
gradiometers). The diagonal of my matrices is not constant.

@mbillingr
Copy link
Contributor

This makes sense, but I don't see how this could be applied if you don't have groups or don't know them.
(From a different perspective, standardizing the features is assuming a group size of 1 :) )

@agramfort
Copy link
Member

:)

do you agree that there is no reason why LW should only work with a matrix
of constant diagonal?
to me standardizing features should really be a preprocessing outside of LW

@eickenberg
Copy link
Contributor

As far as I understand, @agramfort 's normalization may in some settings not even be estimated from data, just scalar multiplication with constants that bring the features into the range of unit variance, the emphasis being on getting the two types of features onto the same scale

@mbillingr
Copy link
Contributor

@agramfort I agree to some extent. Standardization is probably not strictly required for LW. However, LW fails when features are scaled with different orders of magnitude.
The need to rescale features might be common enough to justify putting this inside LW.

@agramfort
Copy link
Member

Exactly

@mbillingr
Copy link
Contributor

I have been reading through the shrinkage literature in the hope to shed light on this issue.

  1. In their earlier papers Ledoit and Wolf use a common variance shrinkage target, which is also implemented in sklearn.
  2. Later, in Honey, I shrunk the covariance matrix they use a constant correlation shrinkage target (which is useful for positively correlated variables).
  3. Schäfer and Strimmer briefly discuss different shrinkage targets in Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics, focusing on an unequal variance target.

It seems that standardization is indeed not part of the LW estimator.
However, the LW estimator is not limited to the common variance target. Indeed, prior standardization of the data is equivalent to using a unequal variance target.

Wouldn't it be nice to support different shrinkage targets?

@agramfort
Copy link
Member

It seems that standardization is indeed not part of the LW estimator.

we agree.

However, the LW estimator is not limited to the common variance target.
Indeed, prior standardization of the data is equivalent to using a unequal
variance target.

Wouldn't it be nice to support different shrinkage targets?

that would a be a way forward indeed. We could have a target_covariance
parameter that could specify it. This is to me however a different aim than
improving LDA as to me you can make LDA + LW work with the use of a
StandardScaler

@cbrnr
Copy link
Contributor Author

cbrnr commented Aug 5, 2014

I agree that the best solution for now is not to include standardization in the LW estimator. Indeed, supporting different shrinkage target other than the diagonal matrix with the mean variance would be great in the future. Specifically, LDA needs the diagonal matrix with the individual feature variances as a shrinkage target to work best.

I wrote to Olivier Ledoit, and he said that if we precondition our matrix by standardizing our features, we're not really minimizing the standard Frobenius norm, but a generalized one. In principal, preconditioning is a good idea if it captures important characteristics of the data (which it does in our case). We only need to be aware of the consequences (since the standard deviations are estimated from the same data as the sample covariance matrix, their errors might interact; also, we're using a loss function that downweights estimation errors in highly variable features).

Most importantly, he says:
"Our 2004 JMVA paper makes it clear that the improvement over the sample covariance matrix is highest when the population eigenvalues are not too dispersed. It also has a proposition saying that the population eigenvalues are at least as dispersed as the variances. Therefore we should expect that this type of pre-conditioning would enhance the performance of the shrinkage estimator when features have wildly differing variances, because it should reduce the cross-sectional dispersion of the eigenvalues of the pre-conditioned population covariance matrix."

In summary, all this can be solved by supporting different shrinkage targets.

For now, I'm standardizing the features in the LDA class if we're using shrinkage. I've addressed this in #3523, which will hopefully soon be merged once I figure out the build problems.

@cbrnr cbrnr closed this as completed Aug 5, 2014
@xiaoxionglin
Copy link

Standardization will lead to ~ 0 dispersed eigenvalues, thus will increase shrinkage systematically and dramatically.
not saying if it's good or bad, but it makes a huge difference.
now the problems is, shrinkage='auto' standardizes while shrinkage=[any constant] does not, which will create big unexpected difference.

@charlesbmi
Copy link

I know this is an old thread, but would love to see the diagonal matrix supported as a shrinkage target in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants