Skip to content

SparsePCA incorrectly scales results in .transform() #9394

@AndreusUltimus

Description

@AndreusUltimus

Description

When using SparsePCA, the transform() method incorrectly scales the results based on the number of rows in the data matrix passed.

Proposed Fix

I am regrettably unable to do a pull request from where I sit. The issue is with this chunk of code, as of writing this at line number 179 in sparse_pca.py:

        U = ridge_regression(self.components_.T, X.T, ridge_alpha,
                             solver='cholesky')
        s = np.sqrt((U ** 2).sum(axis=0))
        s[s == 0] = 1
        U /= s
        return U

I honestly do not understand the details of the chosen implementation of SparsePCA. Depending on the objectives of the class, making use of the features as significant for unseen examples requires one of two modifications. Either (a) learn the scale factor s from the training data (i.e., make it an instance attribute like .scale_factor_), or (b) use .mean(axis=0) instead of .sum(axis=0) to remove the number-of-examples dependency.

Steps/Code to Reproduce

from sklearn.decomposition import SparsePCA
import numpy as np


def get_data( count, seed ):
    np.random.seed(seed)
    col1 = np.random.random(count)
    col2 = np.random.random(count)

    data = np.hstack([ a[:,np.newaxis] for a in [
        col1 + .01*np.random.random(count),
        -col1 + .01*np.random.random(count),
        2*col1 + col2 + .01*np.random.random(count),
        col2 + .01*np.random.random(count),
        ]])
    return data


train = get_data(1000,1)
spca = SparsePCA(max_iter=20)
results_train = spca.fit_transform( train )

test = get_data(10,1)
results_test = spca.transform( test )

print( "Training statistics:" )
print( "  mean: %12.3f" % results_train.mean() )
print( "   max: %12.3f" % results_train.max() )
print( "   min: %12.3f" % results_train.min() )
print( "Testing statistics:" )
print( "  mean: %12.3f" % results_test.mean() )
print( "   max: %12.3f" % results_test.max() )
print( "   min: %12.3f" % results_test.min() )

Output:

Training statistics:
  mean:       -0.009
   max:        0.067
   min:       -0.080
Testing statistics:
  mean:       -0.107
   max:        0.260
   min:       -0.607

Expected Results

The test results min/max values are on the same scale as the training results.

Actual Results

The test results min/max values are much larger than the training results, because fewer examples were used. It is trivial to repeat this process with various sizes of training and testing data to see the relationship.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions