Skip to content

[MRG+1] FIX Hash collisions in the FeatureHasher #7565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 8, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -284,9 +284,18 @@ Bug fixes
left `coef_` as a list, rather than an ndarray.
:issue:`8160` by :user:`CJ Carey <perimosocordiae>`.


- Fix a bug where :class:`sklearn.feature_extraction.FeatureHasher`
mandatorily applied a sparse random projection to the hashed features,
preventing the use of
:class:`sklearn.feature_extraction.text.HashingVectorizer` in a
pipeline with :class:`sklearn.feature_extraction.text.TfidfTransformer`.
:issue:`7513` by :user:`Roman Yurchak <rth>`.

- Fix a bug in cases where `numpy.cumsum` may be numerically unstable,
raising an exception if instability is identified. :issue:`7376` and
:issue:`7331` by `Joel Nothman`_ and :user:`yangarbiter`.

- Fix a bug where :meth:`sklearn.base.BaseEstimator.__getstate__`
obstructed pickling customizations of child-classes, when used in a
multiple inheritance context.
Expand Down
6 changes: 4 additions & 2 deletions sklearn/feature_extraction/_hashing.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ np.import_array()

@cython.boundscheck(False)
@cython.cdivision(True)
def transform(raw_X, Py_ssize_t n_features, dtype):
def transform(raw_X, Py_ssize_t n_features, dtype, bint alternate_sign=1):
"""Guts of FeatureHasher.transform.

Returns
Expand Down Expand Up @@ -63,7 +63,9 @@ def transform(raw_X, Py_ssize_t n_features, dtype):

array.resize_smart(indices, len(indices) + 1)
indices[len(indices) - 1] = abs(h) % n_features
value *= (h >= 0) * 2 - 1
# improve inner product preservation in the hashed space
if alternate_sign:
value *= (h >= 0) * 2 - 1
values[size] = value
size += 1

Expand Down
26 changes: 20 additions & 6 deletions sklearn/feature_extraction/hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# License: BSD 3 clause

import numbers
import warnings

import numpy as np
import scipy.sparse as sp
Expand Down Expand Up @@ -53,11 +54,17 @@ class FeatureHasher(BaseEstimator, TransformerMixin):
The feature_name is hashed to find the appropriate column for the
feature. The value's sign might be flipped in the output (but see
non_negative, below).
alternate_sign : boolean, optional, default True
When True, an alternating sign is added to the features as to
approximately conserve the inner product in the hashed space even for
small n_features. This approach is similar to sparse random projection.
non_negative : boolean, optional, default False
Whether output matrices should contain non-negative values only;
effectively calls abs on the matrix prior to returning it.
When True, output values can be interpreted as frequencies.
When False, output values will have expected value zero.
When True, an absolute value is applied to the features matrix prior to
returning it. When used in conjunction with alternate_sign=True, this
significantly reduces the inner product preservation property.
.. deprecated:: 0.19
This option will be removed in 0.21.


Examples
--------
Expand All @@ -77,12 +84,17 @@ class FeatureHasher(BaseEstimator, TransformerMixin):
"""

def __init__(self, n_features=(2 ** 20), input_type="dict",
dtype=np.float64, non_negative=False):
dtype=np.float64, alternate_sign=True, non_negative=False):
self._validate_params(n_features, input_type)
if non_negative:
warnings.warn("the option non_negative=True has been deprecated"
" in 0.19 and will be removed"
" in version 0.21.", DeprecationWarning)

self.dtype = dtype
self.input_type = input_type
self.n_features = n_features
self.alternate_sign = alternate_sign
self.non_negative = non_negative

@staticmethod
Expand Down Expand Up @@ -139,7 +151,8 @@ def transform(self, raw_X, y=None):
elif self.input_type == "string":
raw_X = (((f, 1) for f in x) for x in raw_X)
indices, indptr, values = \
_hashing.transform(raw_X, self.n_features, self.dtype)
_hashing.transform(raw_X, self.n_features, self.dtype,
self.alternate_sign)
n_samples = indptr.shape[0] - 1

if n_samples == 0:
Expand All @@ -148,6 +161,7 @@ def transform(self, raw_X, y=None):
X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype,
shape=(n_samples, self.n_features))
X.sum_duplicates() # also sorts the indices

if self.non_negative:
np.abs(X.data, X.data)
return X
46 changes: 45 additions & 1 deletion sklearn/feature_extraction/tests/test_feature_hasher.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
from numpy.testing import assert_array_equal

from sklearn.feature_extraction import FeatureHasher
from sklearn.utils.testing import assert_raises, assert_true, assert_equal
from sklearn.utils.testing import (assert_raises, assert_true, assert_equal,
ignore_warnings)


def test_feature_hasher_dicts():
Expand Down Expand Up @@ -106,3 +107,46 @@ def test_hasher_zeros():
# Assert that no zeros are materialized in the output.
X = FeatureHasher().transform([{'foo': 0}])
assert_equal(X.data.shape, (0,))


@ignore_warnings(category=DeprecationWarning)
def test_hasher_alternate_sign():
# the last two tokens produce a hash collision that sums as 0
X = [["foo", "bar", "baz", "investigation need", "records"]]

Xt = FeatureHasher(alternate_sign=True, non_negative=False,
input_type='string').fit_transform(X)
assert_true(Xt.data.min() < 0 and Xt.data.max() > 0)
# check that we have a collision that produces a 0 count
assert_true(len(Xt.data) < len(X[0]))
assert_true((Xt.data == 0.).any())

Xt = FeatureHasher(alternate_sign=True, non_negative=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will raise a deprecation warning right? can you wrap it using ignore_warnings(DeprecatedWarning) maybe?

input_type='string').fit_transform(X)
assert_true((Xt.data >= 0).all()) # all counts are positive
assert_true((Xt.data == 0.).any()) # we still have a collision
Xt = FeatureHasher(alternate_sign=False, non_negative=True,
input_type='string').fit_transform(X)
assert_true((Xt.data > 0).all()) # strictly positive counts
Xt_2 = FeatureHasher(alternate_sign=False, non_negative=False,
input_type='string').fit_transform(X)
# With initially positive features, the non_negative option should
# have no impact when alternate_sign=False
assert_array_equal(Xt.data, Xt_2.data)


@ignore_warnings(category=DeprecationWarning)
def test_hasher_negative():
X = [{"foo": 2, "bar": -4, "baz": -1}.items()]
Xt = FeatureHasher(alternate_sign=False, non_negative=False,
input_type="pair").fit_transform(X)
assert_true(Xt.data.min() < 0 and Xt.data.max() > 0)
Xt = FeatureHasher(alternate_sign=False, non_negative=True,
input_type="pair").fit_transform(X)
assert_true(Xt.data.min() > 0)
Xt = FeatureHasher(alternate_sign=True, non_negative=False,
input_type="pair").fit_transform(X)
assert_true(Xt.data.min() < 0 and Xt.data.max() > 0)
Xt = FeatureHasher(alternate_sign=True, non_negative=True,
input_type="pair").fit_transform(X)
assert_true(Xt.data.min() > 0)
25 changes: 18 additions & 7 deletions sklearn/feature_extraction/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -404,11 +404,20 @@ class HashingVectorizer(BaseEstimator, VectorizerMixin):
dtype : type, optional
Type of the matrix returned by fit_transform() or transform().

non_negative : boolean, default=False
Whether output matrices should contain non-negative values only;
effectively calls abs on the matrix prior to returning it.
When True, output values can be interpreted as frequencies.
When False, output values will have expected value zero.
alternate_sign : boolean, optional, default True
When True, an alternating sign is added to the features as to
approximately conserve the inner product in the hashed space even for
small n_features. This approach is similar to sparse random projection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need a new feature (or feature added?) tag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(has been addressed below)


.. versionadded:: 0.19

non_negative : boolean, optional, default False
When True, an absolute value is applied to the features matrix prior to
returning it. When used in conjunction with alternate_sign=True, this
significantly reduces the inner product preservation property.

.. deprecated:: 0.19
This option will be removed in 0.21.

See also
--------
Expand All @@ -420,8 +429,8 @@ def __init__(self, input='content', encoding='utf-8',
lowercase=True, preprocessor=None, tokenizer=None,
stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
ngram_range=(1, 1), analyzer='word', n_features=(2 ** 20),
binary=False, norm='l2', non_negative=False,
dtype=np.float64):
binary=False, norm='l2', alternate_sign=True,
non_negative=False, dtype=np.float64):
self.input = input
self.encoding = encoding
self.decode_error = decode_error
Expand All @@ -436,6 +445,7 @@ def __init__(self, input='content', encoding='utf-8',
self.ngram_range = ngram_range
self.binary = binary
self.norm = norm
self.alternate_sign = alternate_sign
self.non_negative = non_negative
self.dtype = dtype

Expand Down Expand Up @@ -496,6 +506,7 @@ def transform(self, X, y=None):
def _get_hasher(self):
return FeatureHasher(n_features=self.n_features,
input_type='string', dtype=self.dtype,
alternate_sign=self.alternate_sign,
non_negative=self.non_negative)


Expand Down