-
Hello, I was wondering if it were possible to use the $ pip install sklearn python-Levenshtein
$ python >>> import Levenshtein
>>> import numpy as np
>>> from sklearn.neighbors import BallTree, DistanceMetric
>>>
>>> X = np.array(["Some string", "Some other string", "Yet another string"])
>>> tree = BallTree(X, metric=DistanceMetric.get_metric('pyfunc', func=Levenshtein.distance))
ValueError: could not convert string to float: 'Some string' Do you have any smart ideas to coax sklearn into accepting strings as the input? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
Below is quite a nice hack, which creates a mapping from strings to double-precision float vectors and back. Since double-precision float vectors can represent integers of up to 2^53 strings, this will also be the maximum size of a dataset. However, this is not a serious limitation for most datasets, since 2^53 one-character strings correspond to ca 10PiB of data. $ pip install sklearn python-Levenshtein
$ python
>>> from typing import Optional
>>> import Levenshtein
>>> import numpy as np
>>> from sklearn.neighbors import BallTree, DistanceMetric
>>>
>>> texts = ["Some string", "Some other string", "Yet another string"]
>>> query = "Smeo srting"
>>> knn = 2
>>>
>>> def text_to_vector(text: str, idx: Optional[int] = None) -> np.ndarray:
... if idx is None:
... idx = len(texts)
... texts.append(text)
... return np.array([idx], dtype=np.float64)
...
>>> def vector_to_text(idx: np.ndarray) -> str:
... return texts[int(idx[0])]
...
>>> def levdist(x: np.ndarray, y: np.ndarray) -> int:
... x, y = map(vector_to_text, (x, y))
... return Levenshtein.distance(x, y)
...
>>> X = [text_to_vector(text, idx) for idx, text in enumerate(texts)]
>>> x = [text_to_vector(query)]
>>>
>>> tree = BallTree(X, metric=DistanceMetric.get_metric('pyfunc', func=levdist))
>>> (dists,), (idxs,) = tree.query(x, k=knn)
>>> for dist, idx in zip(dists, idxs):
... print(f'{texts[idx]}, {int(dist)}')
...
Some string, 4
Some other string, 8 |
Beta Was this translation helpful? Give feedback.
Below is quite a nice hack, which creates a mapping from strings to double-precision float vectors and back. Since double-precision float vectors can represent integers of up to 2^53 strings, this will also be the maximum size of a dataset. However, this is not a serious limitation for most datasets, since 2^53 one-character strings correspond to ca 10PiB of data.