Skip to content

[Help wanted] KMeans error thrown on updating to most recent SKLearn package #11636

Closed
@czhao028

Description

@czhao028

Could possibly be @jnhansen's pull request #10933? It's the latest commit that changes the lines relating to this --
#10933

Description

I have a really simple, basic error that's really frustrating me.
I recently updated my SKLearn library to the most recent one on Github (did this today) and now I'm getting "expected: 6 values but given 4" or "expected: 4 values but given 3". The code is exactly the same as it is in the documentation and when it worked before. Why could it be throwing this error?

Steps/Code to Reproduce

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
])   

X = pipeline.fit_transform(list_of_sentences)

km = KMeans(n_clusters=3, random_state=0)
km.fit(X) or km.fit(X.toarray())  

Expected Results

No error thrown: KMeans fits the tfidf matrix.

Actual Results

When I try to run KMeans.fit(X), where X is a sparse matrix, I get this error:

    kmeans_model.fit(tfidf_matrix)
  File "/Users/..../anaconda3/lib/python3.6/site-packages/sklearn/cluster/k_means_.py", line 896, in fit
    return_n_iter=True)
  File "/Users/.../anaconda3/lib/python3.6/site-packages/sklearn/cluster/k_means_.py", line 346, in k_means
    x_squared_norms=x_squared_norms, random_state=random_state)
  File "/Users/.../anaconda3/lib/python3.6/site-packages/sklearn/cluster/k_means_.py", line 493, in _kmeans_single_lloyd
    distances=distances)
  File "/Users/.../anaconda3/lib/python3.6/site-packages/sklearn/cluster/k_means_.py", line 616, in _labels_inertia
    X, x_squared_norms, centers, labels, distances=distances)
  File "sklearn/cluster/_k_means.pyx", line 104, in sklearn.cluster._k_means.__pyx_fuse_1_assign_labels_csr
TypeError: __pyx_fuse_1_assign_labels_csr() takes exactly 6 positional arguments (4 given)

I think it may something to do with the sample weights? specifically in
inertia = _k_means._assign_labels_csr( X, x_squared_norms, centers, labels, distances=distances) is where it's throwing the error.

When I try to run KMeans on a dense matrix (or X.toarray()), I get:

     1 km = KMeans(n_clusters=3, random_state=0)
----> 2 km.fit(X.toarray())

~/anaconda3/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y)
    894                 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
    895                 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 896                 return_n_iter=True)
    897         return self
    898 

~/anaconda3/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in k_means(X, n_clusters, init, precompute_distances, n_init, max_iter, verbose, tol, random_state, copy_x, n_jobs, algorithm, return_n_iter)
    344                 X, n_clusters, max_iter=max_iter, init=init, verbose=verbose,
    345                 precompute_distances=precompute_distances, tol=tol,
--> 346                 x_squared_norms=x_squared_norms, random_state=random_state)
    347             # determine if these results are the best so far
    348             if best_inertia is None or inertia < best_inertia:

~/anaconda3/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in _kmeans_single_elkan(X, n_clusters, max_iter, init, verbose, x_squared_norms, random_state, tol, precompute_distances)
    398         print('Initialization complete')
    399     centers, labels, n_iter = k_means_elkan(X, n_clusters, centers, tol=tol,
--> 400                                             max_iter=max_iter, verbose=verbose)
    401     inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64)
    402     return labels, inertia, centers, n_iter

sklearn/cluster/_k_means_elkan.pyx in sklearn.cluster._k_means_elkan.k_means_elkan()

Versions

Please run the following snippet and paste the output below.

import platform; print(platform.platform())
Darwin-17.6.0-x86_64-i386-64bit
import sys; print("Python", sys.version)
Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
import numpy; print("NumPy", numpy.version)
NumPy 1.14.5
import scipy; print("SciPy", scipy.version)
SciPy 1.1.0
import sklearn; print("Scikit-Learn", sklearn.version)

What could be causing the issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions