Skip to content

TSNE with correlation metric: ValueError: Distance matrix 'X' must be symmetric #4475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cel4 opened this issue Mar 31, 2015 · 18 comments
Closed
Labels
Bug module:manifold Needs Decision - Close Requires decision for closing Needs Decision Requires decision

Comments

@cel4
Copy link

cel4 commented Mar 31, 2015

from sklearn.manifold import TSNE
import numpy as np
np.random.seed(42)

data = np.random.rand(10, 3)
data[-1, :] = 0

model = TSNE(metric="correlation")
model.fit_transform(data)

TSNE raises an obscure error, when the data set contains rows with a standard deviation 0 and therefore undefined correlations:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-658142c1e315> in <module>()
      1 model = TSNE(metric="correlation")
----> 2 res = model.fit_transform(data)
      3 ran = model.fit_transform(ran_data)

/Users/ch/miniconda/envs/sci34/lib/python3.4/site-packages/sklearn/manifold/t_sne.py in fit_transform(self, X, y)
    522             Embedding of the training data in low-dimensional space.
    523         """
--> 524         self.fit(X)
    525         return self.embedding_

/Users/ch/miniconda/envs/sci34/lib/python3.4/site-packages/sklearn/manifold/t_sne.py in fit(self, X, y)
    447         self.training_data_ = X
    448 
--> 449         P = _joint_probabilities(distances, self.perplexity, self.verbose)
    450         if self.init == 'pca':
    451             pca = RandomizedPCA(n_components=self.n_components,

/Users/ch/miniconda/envs/sci34/lib/python3.4/site-packages/sklearn/manifold/t_sne.py in _joint_probabilities(distances, desired_perplexity, verbose)
     52     P = conditional_P + conditional_P.T
     53     sum_P = np.maximum(np.sum(P), MACHINE_EPSILON)
---> 54     P = np.maximum(squareform(P) / sum_P, MACHINE_EPSILON)
     55     return P
     56 

/Users/ch/miniconda/envs/sci34/lib/python3.4/site-packages/scipy/spatial/distance.py in squareform(X, force, checks)
   1479             raise ValueError('The matrix argument must be square.')
   1480         if checks:
-> 1481             is_valid_dm(X, throw=True, name='X')
   1482 
   1483         # One-side of the dimensions is set here.

/Users/ch/miniconda/envs/sci34/lib/python3.4/site-packages/scipy/spatial/distance.py in is_valid_dm(D, tol, throw, name, warning)
   1562                 if name:
   1563                     raise ValueError(('Distance matrix \'%s\' must be '
-> 1564                                      'symmetric.') % name)
   1565                 else:
   1566                     raise ValueError('Distance matrix must be symmetric.')

ValueError: Distance matrix 'X' must be symmetric
@amueller amueller added the Bug label Mar 31, 2015
@amueller
Copy link
Member

Thanks for the report. Maybe the best fix would be to not make the metric produce NaNs and instead produce zeros in metrics. That is how we usually deal with zero variance cases.

@amueller amueller added the Easy Well-defined and straightforward way to resolve label Mar 31, 2015
LowikC pushed a commit to LowikC/scikit-learn that referenced this issue Apr 2, 2015
…ith zero variance samples when using correlation metric.

The best fix would be to have the metric not returning NaN values, but as the correlation metric is actually computed by spicy, we can't modify it directly.
So, in case of metric=='correlation', we replace rows and cols corresponding to zero variance samples by the maximum distance (here 1.0).
LowikC pushed a commit to LowikC/scikit-learn that referenced this issue Apr 2, 2015
@littmus
Copy link

littmus commented Aug 25, 2015

It also happens when using cosine metric.

@amueller
Copy link
Member

@littmus can you give a small example?

@amueller amueller added this to the 0.17 milestone Sep 9, 2015
@giorgiop
Copy link
Contributor

@littmus can you give a small example?

Same as above, with cosine

from sklearn.manifold import TSNE
import numpy as np
np.random.seed(42)

data = np.random.rand(10, 3)
data[-1, :] = 0

model = TSNE(metric="cosine")
model.fit_transform(data)

In the case of cosine the problem is with negative values instead of NaN:

>>> pairwise_distances(data, metric='cosine')
array([[  2.22044605e-16,   3.93068804e-01,   3.16458410e-02,
          3.41088501e-01,   4.14053926e-01,   6.83873363e-02,
          1.22318330e-01,   2.66348778e-02,   8.99886820e-02,
          1.00000000e+00],
       [  3.93068804e-01,   1.11022302e-16,   6.08604862e-01,
          2.45187852e-01,   7.51253761e-04,   4.08564041e-01,
          2.20856854e-01,   4.02945244e-01,   2.80685191e-01,
          1.00000000e+00],
       [  3.16458410e-02,   6.08604862e-01,   2.22044605e-16,
          4.93788565e-01,   6.31623359e-01,   1.18832183e-01,
          2.39696816e-01,   6.75443089e-02,   1.57980145e-01,
          1.00000000e+00],
       [  3.41088501e-01,   2.45187852e-01,   4.93788565e-01,
         -2.22044605e-16,   2.69769895e-01,   1.52418674e-01,
          6.20247116e-02,   2.16408658e-01,   5.22888116e-01,
          1.00000000e+00],
       [  4.14053926e-01,   7.51253761e-04,   6.31623359e-01,
          2.69769895e-01,   0.00000000e+00,   4.38078471e-01,
          2.45155524e-01,   4.29871796e-01,   2.86287653e-01,
          1.00000000e+00],
       [  6.83873363e-02,   4.08564041e-01,   1.18832183e-01,
          1.52418674e-01,   4.38078471e-01,   0.00000000e+00,
          3.99266855e-02,   1.00043973e-02,   2.74710246e-01,
          1.00000000e+00],
       [  1.22318330e-01,   2.20856854e-01,   2.39696816e-01,
          6.20247116e-02,   2.45155524e-01,   3.99266855e-02,
          0.00000000e+00,   5.95202718e-02,   2.66727644e-01,
          1.00000000e+00],
       [  2.66348778e-02,   4.02945244e-01,   6.75443089e-02,
          2.16408658e-01,   4.29871796e-01,   1.00043973e-02,
          5.95202718e-02,   0.00000000e+00,   1.94449773e-01,
          1.00000000e+00],
       [  8.99886820e-02,   2.80685191e-01,   1.57980145e-01,
          5.22888116e-01,   2.86287653e-01,   2.74710246e-01,
          2.66727644e-01,   1.94449773e-01,   2.22044605e-16,
          1.00000000e+00],
       [  1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          1.00000000e+00]])

@ogrisel
Copy link
Member

ogrisel commented Sep 30, 2015

For the cosine metric it should be fixed a posteriori by changings negative numbers with small absolute value (e.g. smaller than 10 * np.finfo(X.dtype).eps) to be set to zero.

For the correlation distance nans, it looks like a problem in scipy's pdist.

@giorgiop
Copy link
Contributor

giorgiop commented Oct 1, 2015

Returning Nan seems a chosen behaviour in Scipy and not a bug. See for example scipy/#3728. I think we should handle those on our side.

scipy/#3728 also discusses the problem of negative values due to finite arithmetic calculation issues. I will do some experiments.

By the way, we would get a similar issue with missing value, but I don't think we want to handle that explicitly. See scipy/#3870.

@giorgiop
Copy link
Contributor

giorgiop commented Oct 1, 2015

I have tried all the sklearn.metric.pairwise._VALID_METRICS, but wminkowski which is not compatible with t-sne. They include also scipy's ones. The only issues are with correlation,yule, dice and sokalsneath with Nan and cosine with negative values. The latter is the only one built in sklearn.

Regarding the negative values, we can follow what suggested by @ogrisel and it works fine. Or we could use the trick with np.ptp as proposed in scipy/#3728. However, clipping values to zero was already done for euclidean_distances and we should have a consistent solution.

@amueller
Copy link
Member

@giorgiop there is no missing values (in scikit-learn, apart from the preprocessing module)

@amueller
Copy link
Member

Clipping to zero sounds good. For correlation, I'd say we should set the distance between constant points to zero.

@amueller
Copy link
Member

as post-processing?

@giorgiop
Copy link
Contributor

👍

@asanakoy
Copy link
Contributor

This is the problem of cosine_distances. For example we get small negative values when compute similarity of the vector to itself.

from sklearn.metrics.pairwise import cosine_distances
import numpy as np
x = np.abs(np.random.RandomState(1337).rand(910))
X = np.vstack([x, x])
print cosine_distances(X)

[[ -4.44089210e-16  -4.44089210e-16]
 [ -4.44089210e-16  -4.44089210e-16]]

The fix is to add in pairwise.py#L573 (almost the same as you did in euclidean_distances:

np.clip(S, 0, 2, S)
if X is Y or Y is None:
    # Ensure that distances between vectors and themselves are set to 0.0.
    # This may not be the case due to floating point rounding errors.
    S.flat[::S.shape[0] + 1] = 0.0

@asanakoy
Copy link
Contributor

Pool request fixing negative values: #7732.

@amueller amueller modified the milestone: 0.19 Jun 12, 2017
@jnothman
Copy link
Member

So it seems the cosine bug has been fixed, but correlation is open.

@haiatn
Copy link
Contributor

haiatn commented Sep 21, 2020

I think we should close this issue. As I understand this metric does not work if one of the vectors has zero variance, returning Nan seems to be the chosen behaviour for this metric and no solution seems to be better. We could add a different assertion but I am not sure if we should.

@dkobak
Copy link
Contributor

dkobak commented Mar 9, 2021

I agree with @haiatn -- I think this is nothing to do here. If there is a row with std=0, then correlation distances are undefined. They are definitely not zero! They are undefined. So one cannot proceed with t-SNE. That's correct behaviour, not a bug. @amueller

@cmarmo cmarmo added module:manifold Needs Decision Requires decision and removed Easy Well-defined and straightforward way to resolve labels Dec 7, 2021
@thomasjpfan
Copy link
Member

thomasjpfan commented Apr 22, 2022

I agree with the comments above and this issue can be closed. On main, the snippet from the opening comment now errors with:

AssertionError: All probabilities should be finite

which is more informative about there being non-finite values, such as nans.

Edit: I am leaving this open for now, to see what other maintainers think.

@thomasjpfan thomasjpfan reopened this Apr 22, 2022
@thomasjpfan thomasjpfan added the Needs Decision - Close Requires decision for closing label Apr 22, 2022
@haiatn
Copy link
Contributor

haiatn commented May 5, 2022

This looks good imo

@thomasjpfan thomasjpfan closed this as not planned Won't fix, can't repro, duplicate, stale Jun 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug module:manifold Needs Decision - Close Requires decision for closing Needs Decision Requires decision
Projects
None yet
Development

Successfully merging a pull request may close this issue.