Closed
Description
So, I want to use the DBScan clustering algorithm from scikit learn, on a dataset of string charcaters. So, I used metric=Levenshtein.distance and my data is given in a numpy array. But, why it wants to convert them to float?
I am using the following packages:
CPython 2.7.3
IPython 2.0.0
numpy 1.9.0
scipy 0.14.0
scikit-learn 0.15.2
matplotlib 1.4.0
pyprind 2.5.0
and my data shape and dbscan object is defined as below:
d_arr.shape
(30079, 1)
dbs = sklearn.cluster.DBSCAN(eps=4, min_samples=2, metric=Levenshtein.distance)
dbs.fit(d_arr)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-33-a7573a2d5bba> in <module>()
----> 1 dbs.fit(d_arr)
/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.pyc in fit(self, X)
246 Overwrite keywords from __init__.
247 """
--> 248 clust = dbscan(X, **self.get_params())
249 self.core_sample_indices_, self.labels_ = clust
250 self.components_ = X[self.core_sample_indices_].copy()
/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.pyc in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, random_state)
99 leaf_size=leaf_size,
100 metric=metric, p=p)
--> 101 neighbors_model.fit(X)
102
103 # Calculate neighborhood for all samples. This leaves the original point
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in fit(self, X, y)
659 Training data. If array or matrix, shape = [n_samples, n_features]
660 """
--> 661 return self._fit(X)
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in _fit(self, X)
232 self._tree = BallTree(X, self.leaf_size,
233 metric=self.effective_metric_,
--> 234 **self.effective_metric_params_)
235 elif self._fit_method == 'kd_tree':
236 self._tree = KDTree(X, self.leaf_size,
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/ball_tree.so in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:7983)()
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
460
461 """
--> 462 return array(a, dtype, copy=False, order=order)
463
464 def asanyarray(a, dtype=None, order=None):
ValueError: could not convert string to float: TACGTAGGGGGCAA