Skip to content

DBScan Clustering of string data #3737

Closed
@vmirly

Description

@vmirly

So, I want to use the DBScan clustering algorithm from scikit learn, on a dataset of string charcaters. So, I used metric=Levenshtein.distance and my data is given in a numpy array. But, why it wants to convert them to float?

I am using the following packages:

CPython 2.7.3
IPython 2.0.0

numpy 1.9.0
scipy 0.14.0
scikit-learn 0.15.2
matplotlib 1.4.0
pyprind 2.5.0

and my data shape and dbscan object is defined as below:

d_arr.shape
(30079, 1)

dbs = sklearn.cluster.DBSCAN(eps=4, min_samples=2, metric=Levenshtein.distance)
dbs.fit(d_arr)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-a7573a2d5bba> in <module>()
----> 1 dbs.fit(d_arr)

/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.pyc in fit(self, X)
    246             Overwrite keywords from __init__.
    247         """
--> 248         clust = dbscan(X, **self.get_params())
    249         self.core_sample_indices_, self.labels_ = clust
    250         self.components_ = X[self.core_sample_indices_].copy()

/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.pyc in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, random_state)
     99                                            leaf_size=leaf_size,
    100                                            metric=metric, p=p)
--> 101         neighbors_model.fit(X)
    102 
    103     # Calculate neighborhood for all samples. This leaves the original point

/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in fit(self, X, y)
    659             Training data. If array or matrix, shape = [n_samples, n_features]
    660         """
--> 661         return self._fit(X)

/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in _fit(self, X)
    232             self._tree = BallTree(X, self.leaf_size,
    233                                   metric=self.effective_metric_,
--> 234                                   **self.effective_metric_params_)
    235         elif self._fit_method == 'kd_tree':
    236             self._tree = KDTree(X, self.leaf_size,

/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/ball_tree.so in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:7983)()

/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    460 
    461     """
--> 462     return array(a, dtype, copy=False, order=order)
    463 
    464 def asanyarray(a, dtype=None, order=None):

ValueError: could not convert string to float: TACGTAGGGGGCAA

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions