Closed
Description
The LSHForest implementation in scikit-learn currently only supports a single hasher, GaussianRandomProjectionHash
, and its corresponding metric, cosine distance. It would be much more useful if it supported Minkowski distance (incl. euclidean) and distances facilitated by other hash functions (Minkowski distance might use the p-stable distributions technique).
I propose the following parameters:
metric : metric string or DistanceMetric object
The exact metric calculated to select neighbors. ...
p : integer, optional (default=2)
Power parameter for the Minkowski metric ...
hasher : estimator or 'auto'
An estimator whose ``transform`` method hashes each
query into a 32-bit integer. If 'auto' (default), a hasher
appropriate for the given metric is used, or a
``ValueError`` is raised if none is available.
Thus the 'auto' hasher for metric='cosine'
might be GaussianRandomProjectionHash(32)
, but a user could specify an equivalent based on SparseRandomProjections
, for instance.
The following extensions may be best implemented as separate PRs:
- For the moment we should retain the requirement that the hash be represented as a 32-bit integer.
- We should also implement the p-stable distributions hash (or a better alternative if it exists) to support Minkowski metrics and make this the default hasher, before the next scikit-learn release! see Support approximating Euclidean/Manhattan metrics in LSHForest #3990
- Very low priority: In some cases calculating the exact distance in the original feature space may be a waste of time, while an approximate metric can be evaluated as the hamming distance between the query and returned hashes (assuming all hash functions are independent, which they are currently). One might thus be able to set
metric='approximate'
, and rely onhasher
to calculate the returned approximate distances. This also means_fit_X
does not need to be stored.