Skip to content

BUG: source matrix rank not equal to size of non zero number of singular values after linear svd #28063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dtanalytic opened this issue Dec 24, 2024 · 4 comments
Labels

Comments

@dtanalytic
Copy link

Describe the issue:

While I was trying to reproduce source code of het_white test in statsmodels package i got an error of unequality of matrix ranks. Diggering in the problem I understood that question was in 2 ways of calculating degrees of freedom in statmodels. One uses a rank of exog matrix and another - rank of diagonal matrix of singular values, computed after np.linalg.svd of the same exog matric. As far as I understand statistics 2 ways must be equal, but no... I have an example of colab dataset and source code where computed ranks are different. Please help me to solve the problem...

Reproduce the code example:

import statsmodels.api as sm

import pandas as pd
import numpy as np

df = pd.read_csv('/content/sample_data/california_housing_train.csv')
X = sm.add_constant(df.drop(columns='median_house_value'))
y = df['median_house_value']

# Обучение модели линейной регрессии
model_res = sm.OLS(y, X).fit()
x = model_res.model.exog
y = model_res.resid

# тест предполагает наличие константы
x = sm.add_constant(x)
nvars0 = x.shape[1]
i0, i1 = np.triu_indices(nvars0)
exog = x[:, i0] * x[:, i1]
nobs, nvars = exog.shape

u, s, vt = np.linalg.svd(exog, False)

np.linalg.matrix_rank(np.diag(s))==np.linalg.matrix_rank(exog)

Error message:

result is False

Python and NumPy Versions:

1.26.4
3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]

Runtime Environment:

[{'numpy_version': '1.26.4',
'python': '3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]',
'uname': uname_result(system='Linux', node='a0806de48948', release='6.1.85+', version='#1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024', machine='x86_64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_KNL',
'AVX512_KNM',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'Haswell',
'filepath': '/usr/local/lib/python3.10/dist-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so',
'internal_api': 'openblas',
'num_threads': 2,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23.dev'}]

Context for the issue:

This issue is connected with possible problem in calcuating white test of heteroscedasticity

@jakevdp
Copy link
Contributor

jakevdp commented Dec 24, 2024

The matrix rank is computed based on the number of singular values that are larger than a particular floating point tolerance. This means that when a singular value happens to be very near that cutoff, the rank will be sensitive to floating point roundoff errors. In general when working with floating point math, computing the "same" value in different ways will often lead to results that are not equivalent. A classic example is this:

>>> 0.1 + 0.2 == 0.3
False

I suspect in your case, you have singular values which are right at the edge of the cutoff, meaning that when you compute the "same" result two different ways, different floating point roundoff leads to values on either side of the cutoff. You can adjust for this by passing a larger tol parameter to np.matrix_rank; see the np.matrix_rank docs for details on its default value.

@dtanalytic
Copy link
Author

Thanks for your answer, I' ve already read your book (it is one of my favourite!). Passing the same tolerance helped, but I cannot understand why when the default value is used (which is the same , because types of arrays are the same on both sides), the result is different. Reformulating question why when you pass the exact tol value result is correct, and when using default - not.
Your "classic" example is teachful, but how is it connected with my problem, the lowest singular value is rather big - larger than 0.01, could there be problems with roundoff ?

@jakevdp
Copy link
Contributor

jakevdp commented Dec 24, 2024

Reformulating question why when you pass the exact tol value result is correct, and when using default - not.

Neither is correct and neither is incorrect: what matrix_rank does is look at the singular values and count the number above some cutoff. You're computing two slightly different sets of singular values, so you get a different answer. If what you're computing is sensitive to that difference, you need a looser tolerance to reflect that.

Your "classic" example is teachful, but how is it connected with my problem, the lowest singular value is rather big - larger than 0.01, could there be problems with roundoff ?

Sorry to be unclear. My point was to illustrate that if you compute the "same" value two different ways in floating point arithmetic, you may not get the exact same answer. You are (implicitly) computing the "same" singular values two different ways, and getting different answers. This is expected in floating point arithmetic, because all computations are subject to floating point roundoff error, and different ways of expressing the same computation will lead to errors accumulating differently.

Does that make sense?

@rgommers rgommers changed the title BUG: <Please write a comprehensive title after the 'BUG: ' prefix>source matrix rank not equal to size of non zero number of singular values after linear svd BUG: source matrix rank not equal to size of non zero number of singular values after linear svd Dec 24, 2024
@dtanalytic
Copy link
Author

Thank you very much, I appreciate your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants