Skip to content

Wrong results from pairwise_distances #13929

@TheRevanchist

Description

@TheRevanchist

Hey guys,

I recently updated scikit-learn to version 0.21 and suddenly the results of a project of mine became very bad. After more than a day of investigation, I found that the error is happening in pairwise_distances, where the distance matrix becomes 0 after some number n, in my case 2028.

Essentially, in this line of code:

distances = sklearn.metrics.pairwise.pairwise_distances(X)

where X is a feature matrix of size (8131, 1024).

The results of distances are correct for the first 2028 elements, while for the elements starting from 2029, everything becomes 0.

This error doesn't happen if you use version 0.20.2 or 0.20.3.

I also saw that if we use scipy.spatial.distance.pdist, the results match those of sklearn 0.20.

Clearly, the error is an overflow, and it can be mitigated by simply casting X to float64. I saw that there have been similar errors in the past, though considering that this works well in 0.20 but it is a bug in 0.21, probably it wouldn't be a bad idea to mention in the documentation that the feature matrix (X) should be cast from float32 to float64 (which I confirmed to solve the issue).

I can provide the X, distances (sklearn 0.21), distances (sklearn 0.20) and distances (pdist) if someone wants to further investigate and or to replicate these results.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions