Closed
Description
Description
LabelSpreading fails during predict
when using a callable kernel that returns sparse matrix
sklearn.semi_supervised.LabelSpreading
allows user to provide a callable as a kernel function.
However, if this callable returns a sparse matrix, then LabelSpreading.predict()
will fail.
The root cause is that np.dot(sparse, dense)
behaves differently than sparse.dot(dense)
, and does not give the intended result.
For example, you can try the following to see the issue with np.dot(sparse, dense)
:
>>> from scipy.sparse import csr_matrix
>>> a = csr_matrix([[1,0,0,0,0], [0,1,0,0,0],[0,0,1,0,0]])
>>> b = np.ones((5,8))
>>> a.dot(b).shape
(3, 8)
>>> np.dot(a, b).shape
(5, 8)
>>> a.dot(b)
array([[1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1.]])
>>> np.dot(a, b)
array([[<3x5 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>,
<3x5 sparse matrix of type '<class 'numpy.float64'>'
...
The fix is a one-liner: change np.dot(...)
to A.dot(B)
or to A @ B
(whichever style is preferred) on the following line:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/semi_supervised/_label_propagation.py#L198
Steps/Code to Reproduce
import numpy as np
from sklearn import datasets
from sklearn.semi_supervised import LabelSpreading
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import classification_report, confusion_matrix
# Our custom kernel is a sparse RBF kernel containing only the top K nearest neighbors
def topk_rbf(X, Y=None, n_neighbors=10, gamma=1e-5):
nn = NearestNeighbors(n_neighbors=10, metric='euclidean', n_jobs=-1).fit(X)
W = -1 * nn.kneighbors_graph(Y, mode='distance').power(2) * gamma
np.exp(W.data, out=W.data)
assert isinstance(W, csr_matrix)
return W.T
digits = datasets.load_digits()
rng = np.random.RandomState(2)
indices = np.arange(len(digits.data))
rng.shuffle(indices)
Xtrain = digits.data[indices[:1000]]
Ytrain = digits.target[indices[:1000]]
Xtest = digits.data[indices[1000:1100]]
Ytest = digits.target[indices[1000:1100]]
# The "transductive" learning phase happens during fit, but is not the concern here
# Therefore, none of the labels were masked
model = LabelSpreading(kernel=topk_rbf, n_jobs=-1).fit(Xtrain, Ytrain)
# Here, we try the "inductive" learning phase
predicted_labels = model.predict(Xtest)
print(f"Confusion matrix: {confusion_matrix(y_true=Ytest, y_pred=predicted_labels, labels=model.classes_)}")
print(f"Classification_report: {classification_report(y_true=Ytest, y_pred=predicted_labels)}")
Expected Results
Confusion matrix: [[ 8 0 0 0 0 0 0 0 0 0]
[ 0 16 0 0 0 0 0 0 0 0]
[ 0 0 7 0 0 0 0 0 0 0]
[ 0 0 0 8 0 0 0 0 0 0]
[ 0 0 0 0 6 0 0 0 0 0]
[ 0 0 0 0 0 11 0 0 0 0]
[ 0 0 0 0 0 0 12 0 0 0]
[ 0 0 0 0 0 0 0 11 0 0]
[ 0 0 0 0 0 0 0 0 10 0]
[ 0 0 0 0 0 0 0 0 0 11]]
Classification_report: precision recall f1-score support
0 1.00 1.00 1.00 8
1 1.00 1.00 1.00 16
2 1.00 1.00 1.00 7
3 1.00 1.00 1.00 8
4 1.00 1.00 1.00 6
5 1.00 1.00 1.00 11
6 1.00 1.00 1.00 12
7 1.00 1.00 1.00 11
8 1.00 1.00 1.00 10
9 1.00 1.00 1.00 11
accuracy 1.00 100
macro avg 1.00 1.00 1.00 100
weighted avg 1.00 1.00 1.00 100
Actual Results
...
Traceback (most recent call last):
File "sklearn_bugreport_example.py", line 33, in <module>
predicted_labels = model.predict(Xtest)
File "/py37/lib/python3.7/site-packages/sklearn/semi_supervised/label_propagation.py", line 169, in predict
return self.classes_[np.argmax(probas, axis=1)].ravel()
File "<__array_function__ internals>", line 6, in argmax
File "/py37/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 1153, in argmax
return _wrapfunc(a, 'argmax', axis=axis, out=out)
File "/py37/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
return bound(*args, **kwds)
File "/py37/lib/python3.7/site-packages/numpy/matrixlib/defmatrix.py", line 171, in __array_finalize__
if (isinstance(obj, matrix) and obj._getitem): return
SystemError: <built-in function isinstance> returned a result with an error set
Versions
import sklearn; sklearn.show_versions()
System:
python: 3.7.5rc1 (default, Oct 8 2019, 16:47:45) [GCC 9.2.1 20191008]
executable: /py37/bin/python3
machine: Linux-5.3.0-23-generic-x86_64-with-Ubuntu-19.10-eoan
Python deps:
pip: 19.3.1
setuptools: 41.6.0
sklearn: 0.21.3
numpy: 1.17.4
scipy: 1.3.2
Cython: None
pandas: 0.25.3
Thanks!
Metadata
Metadata
Assignees
Labels
No labels