-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Description
Description
When using SparsePCA
, the transform()
method incorrectly scales the results based on the number of rows in the data matrix passed.
Proposed Fix
I am regrettably unable to do a pull request from where I sit. The issue is with this chunk of code, as of writing this at line number 179 in sparse_pca.py:
U = ridge_regression(self.components_.T, X.T, ridge_alpha,
solver='cholesky')
s = np.sqrt((U ** 2).sum(axis=0))
s[s == 0] = 1
U /= s
return U
I honestly do not understand the details of the chosen implementation of SparsePCA. Depending on the objectives of the class, making use of the features as significant for unseen examples requires one of two modifications. Either (a) learn the scale factor s
from the training data (i.e., make it an instance attribute like .scale_factor_
), or (b) use .mean(axis=0)
instead of .sum(axis=0)
to remove the number-of-examples dependency.
Steps/Code to Reproduce
from sklearn.decomposition import SparsePCA
import numpy as np
def get_data( count, seed ):
np.random.seed(seed)
col1 = np.random.random(count)
col2 = np.random.random(count)
data = np.hstack([ a[:,np.newaxis] for a in [
col1 + .01*np.random.random(count),
-col1 + .01*np.random.random(count),
2*col1 + col2 + .01*np.random.random(count),
col2 + .01*np.random.random(count),
]])
return data
train = get_data(1000,1)
spca = SparsePCA(max_iter=20)
results_train = spca.fit_transform( train )
test = get_data(10,1)
results_test = spca.transform( test )
print( "Training statistics:" )
print( " mean: %12.3f" % results_train.mean() )
print( " max: %12.3f" % results_train.max() )
print( " min: %12.3f" % results_train.min() )
print( "Testing statistics:" )
print( " mean: %12.3f" % results_test.mean() )
print( " max: %12.3f" % results_test.max() )
print( " min: %12.3f" % results_test.min() )
Output:
Training statistics:
mean: -0.009
max: 0.067
min: -0.080
Testing statistics:
mean: -0.107
max: 0.260
min: -0.607
Expected Results
The test results min/max values are on the same scale as the training results.
Actual Results
The test results min/max values are much larger than the training results, because fewer examples were used. It is trivial to repeat this process with various sizes of training and testing data to see the relationship.