Skip to content

K-Means: Subtract X_means from initial centroids iff it's also subtracted from X #6741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

tomtung
Copy link

@tomtung tomtung commented May 2, 2016

Reference Issue

Fixes #6740

What does this implement/fix? Explain your changes.

Subtract X_means from initial centroids iff it's also subtracted from X

Any other comments?

The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason.

To reproduce:

import numpy as np
import scipy
from sklearn.cluster import KMeans
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data

# Get a local optimum
centers = KMeans(n_clusters=3).fit(X).cluster_centers_

# Fit starting from a local optimum shouldn't change the solution
np.testing.assert_allclose(
    centers,
    KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
)

# The same should be true when X is sparse, but wasn't before the bug fix
X_sparse = scipy.sparse.csr_matrix(X)
np.testing.assert_allclose(
    centers,
    KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
)

…cted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )
@maniteja123
Copy link
Contributor

Hi, thanks for the fix. It looks like the bug is now fixed. Can you please add the above example as a regression test? Thanks.

@raghavrv
Copy link
Member

Hi, thanks for the fix. It looks like the bug is now fixed.

You mean to say it will be fixed by this PR right?

Besides +1 for the NRT like @maniteja123 suggested

@tguillemot
Copy link
Contributor

@tomtung Have you done any update of this PR ?
Have you add the regression test ?

@amueller
Copy link
Member

are you still working on this? I'll put the "needs contributor" tag on this for now.

@amueller amueller added Bug Easy Well-defined and straightforward way to resolve labels Oct 25, 2016
@amueller amueller added this to the 0.19 milestone Oct 25, 2016
@kiote
Copy link
Contributor

kiote commented Nov 7, 2016

Seems like bug is fixed and the code need some care to merge it? I think I can handle this.

@jnothman
Copy link
Member

jnothman commented Nov 7, 2016

Thanks @kiote. It also needs a test.

@amueller
Copy link
Member

amueller commented Nov 9, 2016

I think @jkarno might be working on this, see #6740. (sorry I was offline for the last 4 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Easy Well-defined and straightforward way to resolve
Projects
None yet
Development

Successfully merging this pull request may close these issues.

K-Means: Should subtract X_means from initial centroids iff it's also subtracted from X
7 participants