Support alternative metrics/hashers in LSHForest

The LSHForest implementation in scikit-learn currently only supports a single hasher, `GaussianRandomProjectionHash`, and its corresponding metric, cosine distance. It would be much more useful if it supported Minkowski distance (incl. euclidean) and distances facilitated by other hash functions (Minkowski distance might use the [_p_-stable distributions technique](http://www.immorlica.com/pubs/pstable.pdf)).

I propose the following parameters:

```
metric : metric string or DistanceMetric object
    The exact metric calculated to select neighbors. ...
p : integer, optional (default=2)
    Power parameter for the Minkowski metric ...
hasher : estimator or 'auto'
    An estimator whose ``transform`` method hashes each
    query into a 32-bit integer. If 'auto' (default), a hasher
    appropriate for the given metric is used, or a
    ``ValueError`` is raised if none is available.
```

Thus the 'auto' hasher for `metric='cosine'` might be `GaussianRandomProjectionHash(32)`, but a user could specify an equivalent based on `SparseRandomProjections`, for instance.

The following extensions may be best implemented as separate PRs:
- For the moment we should retain the requirement that the hash be represented as a 32-bit integer.
- We should also implement the p-stable distributions hash (or a better alternative if it exists) to support Minkowski metrics and make this the default hasher, before the next scikit-learn release! see #3990 
- Very low priority: In some cases calculating the exact distance in the original feature space may be a waste of time, while an approximate metric can be evaluated as the hamming distance between the query and returned hashes (assuming all hash functions are independent, which they are currently). One might thus be able to set `metric='approximate'`, and rely on `hasher` to calculate the returned approximate distances. This also means `_fit_X` does not need to be stored.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support alternative metrics/hashers in LSHForest #3988

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Support alternative metrics/hashers in LSHForest #3988

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions