Skip to content

make ProjectionToHashMixin and GaussianRandomProjectionHash private? #8029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Dec 9, 2016 · 5 comments
Closed

Comments

@amueller
Copy link
Member

amueller commented Dec 9, 2016

These two classes are not really documented and seem more like implementation details.
Given how the LSHForest is not great, I'm not sure how much work we should invest in the module, though.
@jnothman @ogrisel is there a path forward with LSHForest? (in other words, should we deprecate the whole thing?)

@jnothman
Copy link
Member

I think the module needs some dedicated work, something that I once thought I might put into it, but have not. We have not facilitated its integration as a substitute for NN methods. It also seems that other ANN approaches seem more successful for the general case (especially when you can learn something about the distribution of your data, i.e. assuming samples will be identically distributed). LSH is good when identical distribution is a poor assumption, and has been prominent in fast textual retrieval, using hash functions particularly suited to n-grams. Yes, I think we should get someone dedicated to work on it, or abandon it.

@jnothman
Copy link
Member

As long as the hash is not really configurable, those can certainly be private.

@amueller
Copy link
Member Author

I think banking on you or Olivier to invest a lot of time into this issue is not a good idea. I remember that there was some effort to work in this, but I think that didn't work out so well?

We could do what we did for HMMs, which was basically say "this is abandonware, use at own risk" - Or simply say "this is not fast enough and will be removed".

@vene
Copy link
Member

vene commented Jun 5, 2017

So GaussianRandomProjectionHash can be instantiated and fit with any value for n_components, but transform fails if it's not a multiple of 32. It fails with a variety of exceptions:


In [39]: obj = GaussianRandomProjectionHash(n_components=1)

In [40]: obj.fit(X)
Out[40]: GaussianRandomProjectionHash(n_components=1, random_state=None)

In [41]: obj.transform(X)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-f09c70b94a52> in <module>()
----> 1 obj.transform(X)

/home/vlad/conda/lib/python3.5/site-packages/sklearn/neighbors/approximate.py in transform(self, X, y)
     87
     88     def transform(self, X, y=None):
---> 89         return self._to_hash(super(ProjectionToHashMixin, self).transform(X))
     90
     91

/home/vlad/conda/lib/python3.5/site-packages/sklearn/neighbors/approximate.py in _to_hash(projected)
     76     def _to_hash(projected):
     77         if projected.shape[1] % 8 != 0:
---> 78             raise ValueError('Require reduced dimensionality to be a multiple '
     79                              'of 8 for hashing')
     80         # XXX: perhaps non-copying operation better

ValueError: Require reduced dimensionality to be a multiple of 8 for hashing

In [42]: obj = GaussianRandomProjectionHash(n_components=8)

In [43]: obj.fit(X)
Out[43]: GaussianRandomProjectionHash(n_components=8, random_state=None)

In [44]: obj.transform(X)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-44-f09c70b94a52> in <module>()
----> 1 obj.transform(X)

/home/vlad/conda/lib/python3.5/site-packages/sklearn/neighbors/approximate.py in transform(self, X, y)
     87
     88     def transform(self, X, y=None):
---> 89         return self._to_hash(super(ProjectionToHashMixin, self).transform(X))
     90
     91

/home/vlad/conda/lib/python3.5/site-packages/sklearn/neighbors/approximate.py in _to_hash(projected)
     79                              'of 8 for hashing')
     80         # XXX: perhaps non-copying operation better
---> 81         out = np.packbits((projected > 0).astype(int)).view(dtype=HASH_DTYPE)
     82         return out.reshape(projected.shape[0], -1)
     83

ValueError: new type not compatible with array.

In [45]: obj = GaussianRandomProjectionHash(n_components=16)

In [46]: obj.fit(X)
Out[46]: GaussianRandomProjectionHash(n_components=16, random_state=None)

In [47]: obj.transform(X)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-f09c70b94a52> in <module>()
----> 1 obj.transform(X)

/home/vlad/conda/lib/python3.5/site-packages/sklearn/neighbors/approximate.py in transform(self, X, y)
     87
     88     def transform(self, X, y=None):
---> 89         return self._to_hash(super(ProjectionToHashMixin, self).transform(X))
     90
     91

/home/vlad/conda/lib/python3.5/site-packages/sklearn/neighbors/approximate.py in _to_hash(projected)
     80         # XXX: perhaps non-copying operation better
     81         out = np.packbits((projected > 0).astype(int)).view(dtype=HASH_DTYPE)
---> 82         return out.reshape(projected.shape[0], -1)
     83
     84     def fit_transform(self, X, y=None):

ValueError: total size of new array must be unchanged

In [48]: obj = GaussianRandomProjectionHash(n_components=32 * 5)

In [49]: obj.fit(X)
Out[49]: GaussianRandomProjectionHash(n_components=160, random_state=None)

In [50]: obj.transform(X)
Out[50]:
array([[1768520822,   26520965, 2691117481,  607667819, 1718707950],
       [1079154558,  150277543,  871633455, 1214824012, 2028314012],
...
       [1149736519,  418897671, 2080675020, 2209257966,  625223291],
       [ 474200689, 2151285614, 1264544891, 3903350547, 2231551203]], dtype=uint32)


@vene
Copy link
Member

vene commented Jun 5, 2017

Presumably n_components should be exposed as the multiplier of 32 used internally, however this leads to obj.components_.shape[0] != obj.n_components, unlike at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants