-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
KNN-Search is not translation invariant #20278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't understand what you mean. Which of the following cases happens?
|
I should have been clearer: what I meant is that generally one is interested by identifying correctly nearest neighbors (up to a few rank inversions) rather than having exact values for their distances. Yet, different distances give different nearest neighbors, and this is what is happening in this case, and it is not just a few rank index inversions in the results. For example, to build on top of the given example above, this: # ... previous example lines
# Testing for identical nearest neighbors sets
np.testing.assert_array_equal(np.sort(reference_nns), np.sort(nns)) gives:
As for distances: rtol = 1e-03 if dtype is np.float32 else 1e-05
np.testing.assert_allclose(reference_dist, dist, rtol=rtol)
This is a rather small mismatch for nearest neighbors, but in some configurations (float32, bigger translation, etc.) it gets larger (this is what I observe on #20279). Is this a notable issue? Could / should this be alleviated? |
Although I'm not sure whether this matters much, this doesn't account for the fact that the ties could happen at the edge of the neighborhood, like a tie between the
That sounds reasonably low but it could definitely explain why there are not just rank inversions |
Maybe it's expected to observe neighbors rank inversions caused by near ties and rounding errors with |
also maybe try after spreading the samples a bit? X_train = rng.rand(n, d).astype(dtype) * 100
X_test = rng.rand(n, d).astype(dtype) * 100 |
See discussion starting from: scikit-learn#20278 (comment)
With both proposed changes (including a greater spreading, see eb9d406) a few tests are failing, but it's negligible. pytest trace on eb9d406 → pytest -k "test_translation_invariance" sklearn/neighbors/tests/test_neighbors.py
======================================================================================= test session starts =======================================================================================
platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/jjerphan/dev/skd, configfile: setup.cfg
plugins: metadata-1.11.0, html-3.1.1
collected 167 items / 95 deselected / 72 selected
sklearn/neighbors/tests/test_neighbors.py ..F.........FF......F.....F...FFF..........................F............ [100%]
============================================================================================ FAILURES =============================================================================================
_____________________________________________________________________ test_translation_invariance[float32-10-euclidean-10000] _____________________________________________________________________
translation = 10000, metric = 'euclidean', n_neighbors = 10, dtype = <class 'numpy.float32'>
@pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
@pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
@pytest.mark.parametrize("n_neighbors", [10, 100])
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_translation_invariance(
translation,
metric,
n_neighbors,
dtype,
):
""" K-NN search must be translation-invariant. """
n = 10_000
d = 50
rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype) * 1000
X_test = rng.rand(n, d).astype(dtype) * 1000
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
n_neighbors=n_neighbors,
return_distance=True)
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train +
translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
n_neighbors=n_neighbors,
return_distance=True)
> np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
np.sort(nns, axis=1))
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 2 / 100000 (0.002%)
E Max absolute difference: 2322
E Max relative difference: 0.24107143
E x: array([[1629, 3616, 5074, ..., 7344, 8215, 9920],
E [ 457, 625, 2250, ..., 8429, 8598, 8694],
E [ 617, 5024, 5490, ..., 8355, 8996, 9721],...
E y: array([[1629, 3616, 5074, ..., 7344, 8215, 9920],
E [ 457, 625, 2250, ..., 8429, 8598, 8694],
E [ 617, 5024, 5490, ..., 8355, 8996, 9721],...
sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
______________________________________________________________________ test_translation_invariance[float32-10-chebyshev-100] ______________________________________________________________________
translation = 100, metric = 'chebyshev', n_neighbors = 10, dtype = <class 'numpy.float32'>
@pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
@pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
@pytest.mark.parametrize("n_neighbors", [10, 100])
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_translation_invariance(
translation,
metric,
n_neighbors,
dtype,
):
""" K-NN search must be translation-invariant. """
n = 10_000
d = 50
rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype) * 1000
X_test = rng.rand(n, d).astype(dtype) * 1000
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
n_neighbors=n_neighbors,
return_distance=True)
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train +
translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
n_neighbors=n_neighbors,
return_distance=True)
> np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
np.sort(nns, axis=1))
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 8 / 100000 (0.008%)
E Max absolute difference: 2526
E Max relative difference: 0.55176933
E x: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E [ 351, 2022, 2163, ..., 7486, 7620, 8881],...
E y: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E [ 351, 2022, 2163, ..., 7486, 7620, 8881],...
sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
_____________________________________________________________________ test_translation_invariance[float32-10-chebyshev-1000] ______________________________________________________________________
translation = 1000, metric = 'chebyshev', n_neighbors = 10, dtype = <class 'numpy.float32'>
@pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
@pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
@pytest.mark.parametrize("n_neighbors", [10, 100])
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_translation_invariance(
translation,
metric,
n_neighbors,
dtype,
):
""" K-NN search must be translation-invariant. """
n = 10_000
d = 50
rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype) * 1000
X_test = rng.rand(n, d).astype(dtype) * 1000
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
n_neighbors=n_neighbors,
return_distance=True)
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train +
translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
n_neighbors=n_neighbors,
return_distance=True)
> np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
np.sort(nns, axis=1))
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 8 / 100000 (0.008%)
E Max absolute difference: 2526
E Max relative difference: 0.55176933
E x: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E [ 351, 2022, 2163, ..., 7486, 7620, 8881],...
E y: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E [ 351, 2022, 2163, ..., 7486, 7620, 8881],...
sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
____________________________________________________________________ test_translation_invariance[float32-100-euclidean-10000] _____________________________________________________________________
translation = 10000, metric = 'euclidean', n_neighbors = 100, dtype = <class 'numpy.float32'>
@pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
@pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
@pytest.mark.parametrize("n_neighbors", [10, 100])
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_translation_invariance(
translation,
metric,
n_neighbors,
dtype,
):
""" K-NN search must be translation-invariant. """
n = 10_000
d = 50
rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype) * 1000
X_test = rng.rand(n, d).astype(dtype) * 1000
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
n_neighbors=n_neighbors,
return_distance=True)
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train +
translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
n_neighbors=n_neighbors,
return_distance=True)
> np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
np.sort(nns, axis=1))
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 19 / 1000000 (0.0019%)
E Max absolute difference: 193
E Max relative difference: 0.0475823
E x: array([[ 308, 384, 482, ..., 9790, 9920, 9929],
E [ 42, 83, 114, ..., 9672, 9676, 9812],
E [ 19, 76, 93, ..., 9650, 9721, 9735],...
E y: array([[ 308, 384, 482, ..., 9790, 9920, 9929],
E [ 42, 83, 114, ..., 9672, 9676, 9812],
E [ 19, 76, 93, ..., 9650, 9721, 9735],...
sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
____________________________________________________________________ test_translation_invariance[float32-100-manhattan-10000] _____________________________________________________________________
translation = 10000, metric = 'manhattan', n_neighbors = 100, dtype = <class 'numpy.float32'>
@pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
@pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
@pytest.mark.parametrize("n_neighbors", [10, 100])
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_translation_invariance(
translation,
metric,
n_neighbors,
dtype,
):
""" K-NN search must be translation-invariant. """
n = 10_000
d = 50
rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype) * 1000
X_test = rng.rand(n, d).astype(dtype) * 1000
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
n_neighbors=n_neighbors,
return_distance=True)
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train +
translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
n_neighbors=n_neighbors,
return_distance=True)
> np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
np.sort(nns, axis=1))
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 173 / 1000000 (0.0173%)
E Max absolute difference: 506
E Max relative difference: 0.31873727
E x: array([[ 217, 326, 384, ..., 9881, 9920, 9929],
E [ 42, 164, 197, ..., 9672, 9738, 9865],
E [ 76, 93, 116, ..., 9735, 9852, 9900],...
E y: array([[ 217, 326, 384, ..., 9881, 9920, 9929],
E [ 42, 164, 197, ..., 9672, 9738, 9865],
E [ 76, 93, 116, ..., 9735, 9852, 9900],...
sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
_____________________________________________________________________ test_translation_invariance[float32-100-chebyshev-100] ______________________________________________________________________
translation = 100, metric = 'chebyshev', n_neighbors = 100, dtype = <class 'numpy.float32'>
@pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
@pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
@pytest.mark.parametrize("n_neighbors", [10, 100])
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_translation_invariance(
translation,
metric,
n_neighbors,
dtype,
):
""" K-NN search must be translation-invariant. """
n = 10_000
d = 50
rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype) * 1000
X_test = rng.rand(n, d).astype(dtype) * 1000
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
n_neighbors=n_neighbors,
return_distance=True)
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train +
translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
n_neighbors=n_neighbors,
return_distance=True)
> np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
np.sort(nns, axis=1))
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 10 / 1000000 (0.001%)
E Max absolute difference: 196
E Max relative difference: 0.03285283
E x: array([[ 69, 293, 323, ..., 9591, 9634, 9910],
E [ 83, 251, 310, ..., 9676, 9757, 9812],
E [ 18, 77, 143, ..., 9652, 9800, 9972],...
E y: array([[ 69, 293, 323, ..., 9591, 9634, 9910],
E [ 83, 251, 310, ..., 9676, 9757, 9812],
E [ 18, 77, 143, ..., 9652, 9800, 9972],...
sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
_____________________________________________________________________ test_translation_invariance[float32-100-chebyshev-1000] _____________________________________________________________________
translation = 1000, metric = 'chebyshev', n_neighbors = 100, dtype = <class 'numpy.float32'>
@pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
@pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
@pytest.mark.parametrize("n_neighbors", [10, 100])
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_translation_invariance(
translation,
metric,
n_neighbors,
dtype,
):
""" K-NN search must be translation-invariant. """
n = 10_000
d = 50
rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype) * 1000
X_test = rng.rand(n, d).astype(dtype) * 1000
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
n_neighbors=n_neighbors,
return_distance=True)
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train +
translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
n_neighbors=n_neighbors,
return_distance=True)
> np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
np.sort(nns, axis=1))
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 27 / 1000000 (0.0027%)
E Max absolute difference: 516
E Max relative difference: 0.15542169
E x: array([[ 69, 293, 323, ..., 9591, 9634, 9910],
E [ 83, 251, 310, ..., 9676, 9757, 9812],
E [ 18, 77, 143, ..., 9652, 9800, 9972],...
E y: array([[ 69, 293, 323, ..., 9591, 9634, 9910],
E [ 83, 251, 310, ..., 9676, 9757, 9812],
E [ 18, 77, 143, ..., 9652, 9800, 9972],...
sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
____________________________________________________________________ test_translation_invariance[float32-100-chebyshev-10000] _____________________________________________________________________
translation = 10000, metric = 'chebyshev', n_neighbors = 100, dtype = <class 'numpy.float32'>
@pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
@pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
@pytest.mark.parametrize("n_neighbors", [10, 100])
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_translation_invariance(
translation,
metric,
n_neighbors,
dtype,
):
""" K-NN search must be translation-invariant. """
n = 10_000
d = 50
rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype) * 1000
X_test = rng.rand(n, d).astype(dtype) * 1000
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
n_neighbors=n_neighbors,
return_distance=True)
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train +
translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
n_neighbors=n_neighbors,
return_distance=True)
> np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
np.sort(nns, axis=1))
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 287 / 1000000 (0.0287%)
E Max absolute difference: 516
E Max relative difference: 0.15542169
E x: array([[ 69, 293, 323, ..., 9591, 9634, 9910],
E [ 83, 251, 310, ..., 9676, 9757, 9812],
E [ 18, 77, 143, ..., 9652, 9800, 9972],...
E y: array([[ 69, 293, 323, ..., 9591, 9634, 9910],
E [ 83, 251, 310, ..., 9676, 9757, 9812],
E [ 18, 77, 143, ..., 9652, 9800, 9972],...
sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
___________________________________________________________________ test_translation_invariance[float64-100-euclidean-10000000] ___________________________________________________________________
translation = 10000000, metric = 'euclidean', n_neighbors = 100, dtype = <class 'numpy.float64'>
@pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
@pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
@pytest.mark.parametrize("n_neighbors", [10, 100])
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_translation_invariance(
translation,
metric,
n_neighbors,
dtype,
):
""" K-NN search must be translation-invariant. """
n = 10_000
d = 50
rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype) * 1000
X_test = rng.rand(n, d).astype(dtype) * 1000
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
n_neighbors=n_neighbors,
return_distance=True)
neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
algorithm="brute",
metric=metric).fit(X_train +
translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
n_neighbors=n_neighbors,
return_distance=True)
> np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
np.sort(nns, axis=1))
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 145 / 1000000 (0.0145%)
E Max absolute difference: 494
E Max relative difference: 0.12893642
E x: array([[ 308, 384, 482, ..., 9790, 9920, 9929],
E [ 42, 83, 114, ..., 9672, 9676, 9812],
E [ 19, 76, 93, ..., 9650, 9721, 9735],...
E y: array([[ 308, 384, 482, ..., 9790, 9920, 9929],
E [ 42, 83, 114, ..., 9672, 9676, 9812],
E [ 19, 76, 93, ..., 9650, 9721, 9735],...
sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError |
I propose to:
|
Maybe it would be easier to write tests that check invariance to translation for the WDYT? |
Yes, this would also scope tests to a lower level, testing not just for neighbors queries but more interfaces and use-cases built on top. |
Closing as made irrelevant by #20254. |
Describe the bug
K-NN search via
neighbors.NearestNeighbors
is not translation invariant currently.Steps/Code to Reproduce
Expected Results
No exception should be thrown.
Actual Results
Distances aren't equal up to a small tolerance as well but I guess this is not that/as important.
More test cases can be found on #20279.
Comment
Working on #20254 made me test the proposed implementation for translation invariance and check if it thus was the case of scikit-learn's.
I don't know if this would be easy to alleviate: are we hitting the representation capabilities of floats?
Versions
The text was updated successfully, but these errors were encountered: