Skip to content

Support alternative metrics/hashers in LSHForest #3988

Closed
@jnothman

Description

@jnothman

The LSHForest implementation in scikit-learn currently only supports a single hasher, GaussianRandomProjectionHash, and its corresponding metric, cosine distance. It would be much more useful if it supported Minkowski distance (incl. euclidean) and distances facilitated by other hash functions (Minkowski distance might use the p-stable distributions technique).

I propose the following parameters:

metric : metric string or DistanceMetric object
    The exact metric calculated to select neighbors. ...
p : integer, optional (default=2)
    Power parameter for the Minkowski metric ...
hasher : estimator or 'auto'
    An estimator whose ``transform`` method hashes each
    query into a 32-bit integer. If 'auto' (default), a hasher
    appropriate for the given metric is used, or a
    ``ValueError`` is raised if none is available.

Thus the 'auto' hasher for metric='cosine' might be GaussianRandomProjectionHash(32), but a user could specify an equivalent based on SparseRandomProjections, for instance.

The following extensions may be best implemented as separate PRs:

  • For the moment we should retain the requirement that the hash be represented as a 32-bit integer.
  • We should also implement the p-stable distributions hash (or a better alternative if it exists) to support Minkowski metrics and make this the default hasher, before the next scikit-learn release! see Support approximating Euclidean/Manhattan metrics in LSHForest #3990
  • Very low priority: In some cases calculating the exact distance in the original feature space may be a waste of time, while an approximate metric can be evaluated as the hamming distance between the query and returned hashes (assuming all hash functions are independent, which they are currently). One might thus be able to set metric='approximate', and rely on hasher to calculate the returned approximate distances. This also means _fit_X does not need to be stored.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ModerateAnything that requires some knowledge of conventions and best practices

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions