KNN-Search is not translation invariant #20278

jjerphan · 2021-06-16T09:06:59Z

Describe the bug

K-NN search via neighbors.NearestNeighbors is not translation invariant currently.

Steps/Code to Reproduce

import numpy as np
from sklearn import neighbors

n = 10_000
d = 50
n_neighbors = 100
translation = 100_000
metric = 'euclidean'
dtype = np.float64

rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype)
X_test = rng.rand(n, d).astype(dtype)

neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                   algorithm="brute",
                                   metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                 n_neighbors=n_neighbors,
                                                 return_distance=True)

neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                   algorithm="brute",
                                   metric=metric).fit(X_train +
                                                      translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
                             n_neighbors=n_neighbors,
                             return_distance=True)

np.testing.assert_array_equal(reference_nns, nns)

Expected Results

No exception should be thrown.

Actual Results

AssertionError: 
Arrays are not equal

Mismatched elements: 27458 / 1000000 (2.75%)
Max absolute difference: 9926
Max relative difference: 7249.
 x: array([[5326, 6273, 9920, ..., 2516, 6960, 4326],
       [3946, 8598,  457, ..., 6428,  551, 9508],
       [8355, 5921,  617, ..., 4665, 5981, 4848],...
 y: array([[5326, 6273, 9920, ..., 2516, 6960, 4326],
       [3946, 8598,  457, ..., 6428,  551, 9508],
       [8355, 5921,  617, ..., 4665, 5981, 4848],...

Distances aren't equal up to a small tolerance as well but I guess this is not that/as important.
More test cases can be found on #20279.

Comment

Working on #20254 made me test the proposed implementation for translation invariance and check if it thus was the case of scikit-learn's.

I don't know if this would be easy to alleviate: are we hitting the representation capabilities of floats?

Versions

from sklearn import show_versions
show_versions()

System:
    python: 3.9.5 (default, May 18 2021, 19:34:48)  [GCC 7.3.0]
executable: /home/jjerphan/.local/share/miniconda3/envs/sk/bin/python
   machine: Linux-5.12.9-300.fc34.x86_64-x86_64-with-glibc2.33

Python dependencies:
          pip: 21.1.1
   setuptools: 52.0.0.post20210125
      sklearn: 1.0.dev0
        numpy: 1.20.3
        scipy: 1.6.3
       Cython: 0.29.23
       pandas: 1.2.4
   matplotlib: 3.4.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

ogrisel · 2021-06-16T10:00:57Z

Distances aren't equal up to a small tolerance as well but I guess this is not that/as important.

I don't understand what you mean. Which of the following cases happens?

change of distances of nearest neighbors are all within expected rounding errors, in which case it's normal to observe a few rank index inversions in the nn results
change of distances of nearest neighbors are larger than the expected rounding errors, in which case we have a real problem that cannot be explained by rounding errors.

jjerphan · 2021-06-16T10:33:59Z

I should have been clearer: what I meant is that generally one is interested by identifying correctly nearest neighbors (up to a few rank inversions) rather than having exact values for their distances.

Yet, different distances give different nearest neighbors, and this is what is happening in this case, and it is not just a few rank index inversions in the results. For example, to build on top of the given example above, this:

# ... previous example lines

# Testing for identical nearest neighbors sets
np.testing.assert_array_equal(np.sort(reference_nns), np.sort(nns))

gives:

AssertionError: 
Arrays are not equal

Mismatched elements: 8378 / 1000000 (0.838%)
Max absolute difference: 790
Max relative difference: 1.9037037
 x: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
       [  42,   83,  114, ..., 9672, 9676, 9812],
       [  19,   76,   93, ..., 9650, 9721, 9735],...
 y: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
       [  42,   83,  114, ..., 9672, 9676, 9812],
       [  19,   76,   93, ..., 9650, 9721, 9735],...

As for distances:

rtol = 1e-03 if dtype is np.float32 else 1e-05
np.testing.assert_allclose(reference_dist, dist, rtol=rtol)

AssertionError: 
Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 606089 / 1000000 (60.6%)
Max absolute difference: 0.00021526
Max relative difference: 0.0001057
 x: array([[1.997964, 2.05102 , 2.107675, ..., 2.353607, 2.355517, 2.356003],
       [1.933536, 1.992537, 1.993712, ..., 2.280523, 2.282047, 2.283121],
       [1.859538, 1.896128, 1.948302, ..., 2.243647, 2.245024, 2.246497],...
 y: array([[1.997893, 2.05103 , 2.107725, ..., 2.353625, 2.355517, 2.356036],
       [1.93359 , 1.99254 , 1.993719, ..., 2.280474, 2.282039, 2.283135],
       [1.859539, 1.896137, 1.948291, ..., 2.243657, 2.24503 , 2.246579],...

This is a rather small mismatch for nearest neighbors, but in some configurations (float32, bigger translation, etc.) it gets larger (this is what I observe on #20279).

Is this a notable issue? Could / should this be alleviated?

NicolasHug · 2021-06-16T11:17:06Z

np.testing.assert_array_equal(np.sort(reference_nns), np.sort(nns))

Although I'm not sure whether this matters much, this doesn't account for the fact that the ties could happen at the edge of the neighborhood, like a tie between the n_neighbors'th neighbor and the n_neighbors + 1th one.

Max absolute difference: 0.00021526

That sounds reasonably low but it could definitely explain why there are not just rank inversions

ogrisel · 2021-06-16T12:41:13Z

Maybe it's expected to observe neighbors rank inversions caused by near ties and rounding errors with n_neighbors=100. Would this test pass with n_neighbors=10?

NicolasHug · 2021-06-16T12:46:22Z

also maybe try after spreading the samples a bit?

X_train = rng.rand(n, d).astype(dtype) * 100
X_test = rng.rand(n, d).astype(dtype) * 100

See discussion starting from: scikit-learn#20278 (comment)

jjerphan · 2021-06-16T13:47:49Z

With both proposed changes (including a greater spreading, see eb9d406) a few tests are failing, but it's negligible.

pytest trace on eb9d406

 → pytest -k "test_translation_invariance"  sklearn/neighbors/tests/test_neighbors.py
======================================================================================= test session starts =======================================================================================
platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/jjerphan/dev/skd, configfile: setup.cfg
plugins: metadata-1.11.0, html-3.1.1
collected 167 items / 95 deselected / 72 selected

sklearn/neighbors/tests/test_neighbors.py ..F.........FF......F.....F...FFF..........................F............                                                                          [100%]

============================================================================================ FAILURES =============================================================================================
_____________________________________________________________________ test_translation_invariance[float32-10-euclidean-10000] _____________________________________________________________________

translation = 10000, metric = 'euclidean', n_neighbors = 10, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 2 / 100000 (0.002%)
E       Max absolute difference: 2322
E       Max relative difference: 0.24107143
E        x: array([[1629, 3616, 5074, ..., 7344, 8215, 9920],
E              [ 457,  625, 2250, ..., 8429, 8598, 8694],
E              [ 617, 5024, 5490, ..., 8355, 8996, 9721],...
E        y: array([[1629, 3616, 5074, ..., 7344, 8215, 9920],
E              [ 457,  625, 2250, ..., 8429, 8598, 8694],
E              [ 617, 5024, 5490, ..., 8355, 8996, 9721],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
______________________________________________________________________ test_translation_invariance[float32-10-chebyshev-100] ______________________________________________________________________

translation = 100, metric = 'chebyshev', n_neighbors = 10, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 8 / 100000 (0.008%)
E       Max absolute difference: 2526
E       Max relative difference: 0.55176933
E        x: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E              [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E              [ 351, 2022, 2163, ..., 7486, 7620, 8881],...
E        y: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E              [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E              [ 351, 2022, 2163, ..., 7486, 7620, 8881],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
_____________________________________________________________________ test_translation_invariance[float32-10-chebyshev-1000] ______________________________________________________________________

translation = 1000, metric = 'chebyshev', n_neighbors = 10, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 8 / 100000 (0.008%)
E       Max absolute difference: 2526
E       Max relative difference: 0.55176933
E        x: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E              [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E              [ 351, 2022, 2163, ..., 7486, 7620, 8881],...
E        y: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E              [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E              [ 351, 2022, 2163, ..., 7486, 7620, 8881],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
____________________________________________________________________ test_translation_invariance[float32-100-euclidean-10000] _____________________________________________________________________

translation = 10000, metric = 'euclidean', n_neighbors = 100, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 19 / 1000000 (0.0019%)
E       Max absolute difference: 193
E       Max relative difference: 0.0475823
E        x: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
E              [  42,   83,  114, ..., 9672, 9676, 9812],
E              [  19,   76,   93, ..., 9650, 9721, 9735],...
E        y: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
E              [  42,   83,  114, ..., 9672, 9676, 9812],
E              [  19,   76,   93, ..., 9650, 9721, 9735],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
____________________________________________________________________ test_translation_invariance[float32-100-manhattan-10000] _____________________________________________________________________

translation = 10000, metric = 'manhattan', n_neighbors = 100, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 173 / 1000000 (0.0173%)
E       Max absolute difference: 506
E       Max relative difference: 0.31873727
E        x: array([[ 217,  326,  384, ..., 9881, 9920, 9929],
E              [  42,  164,  197, ..., 9672, 9738, 9865],
E              [  76,   93,  116, ..., 9735, 9852, 9900],...
E        y: array([[ 217,  326,  384, ..., 9881, 9920, 9929],
E              [  42,  164,  197, ..., 9672, 9738, 9865],
E              [  76,   93,  116, ..., 9735, 9852, 9900],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
_____________________________________________________________________ test_translation_invariance[float32-100-chebyshev-100] ______________________________________________________________________

translation = 100, metric = 'chebyshev', n_neighbors = 100, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 10 / 1000000 (0.001%)
E       Max absolute difference: 196
E       Max relative difference: 0.03285283
E        x: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...
E        y: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
_____________________________________________________________________ test_translation_invariance[float32-100-chebyshev-1000] _____________________________________________________________________

translation = 1000, metric = 'chebyshev', n_neighbors = 100, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 27 / 1000000 (0.0027%)
E       Max absolute difference: 516
E       Max relative difference: 0.15542169
E        x: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...
E        y: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
____________________________________________________________________ test_translation_invariance[float32-100-chebyshev-10000] _____________________________________________________________________

translation = 10000, metric = 'chebyshev', n_neighbors = 100, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 287 / 1000000 (0.0287%)
E       Max absolute difference: 516
E       Max relative difference: 0.15542169
E        x: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...
E        y: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
___________________________________________________________________ test_translation_invariance[float64-100-euclidean-10000000] ___________________________________________________________________

translation = 10000000, metric = 'euclidean', n_neighbors = 100, dtype = <class 'numpy.float64'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 145 / 1000000 (0.0145%)
E       Max absolute difference: 494
E       Max relative difference: 0.12893642
E        x: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
E              [  42,   83,  114, ..., 9672, 9676, 9812],
E              [  19,   76,   93, ..., 9650, 9721, 9735],...
E        y: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
E              [  42,   83,  114, ..., 9672, 9676, 9812],
E              [  19,   76,   93, ..., 9650, 9721, 9735],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError

jjerphan · 2021-06-16T14:02:33Z

I propose to:

ignore and close this issue and TST Translation invariance for K-NN search #20279
or include TST Translation invariance for K-NN search #20279's tests, asserting almost- rather than strict-equality for nearest neighbors, allowing less than 0.1%-mismatches for instance.

ogrisel · 2021-06-16T14:09:36Z

Maybe it would be easier to write tests that check invariance to translation for the pairwise_distances method instead of neighbors queries. This would allow us to write test that checks that big distances are invariant but ignore differences for small distances that are typically caused by rounding errors.

WDYT?

jjerphan · 2021-06-16T14:14:34Z

Yes, this would also scope tests to a lower level, testing not just for neighbors queries but more interfaces and use-cases built on top.

jjerphan · 2021-06-24T08:51:26Z

I followed up on tests in #20279 with 7d08967.

jjerphan · 2021-10-18T14:18:24Z

Closing as made irrelevant by #20254.

jjerphan added the Bug: triage label Jun 16, 2021

jjerphan mentioned this issue Jun 16, 2021

TST Translation invariance for K-NN search #20279

Closed

jjerphan added a commit to jjerphan/scikit-learn that referenced this issue Jun 16, 2021

Improve test based on discussions

0bca493

See discussion starting from: scikit-learn#20278 (comment)

jjerphan closed this as completed Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KNN-Search is not translation invariant #20278

KNN-Search is not translation invariant #20278

jjerphan commented Jun 16, 2021 •

edited

Loading

ogrisel commented Jun 16, 2021

jjerphan commented Jun 16, 2021

NicolasHug commented Jun 16, 2021

ogrisel commented Jun 16, 2021

NicolasHug commented Jun 16, 2021

jjerphan commented Jun 16, 2021

jjerphan commented Jun 16, 2021 •

edited

Loading

ogrisel commented Jun 16, 2021

jjerphan commented Jun 16, 2021

jjerphan commented Jun 24, 2021

jjerphan commented Oct 18, 2021

KNN-Search is not translation invariant #20278

KNN-Search is not translation invariant #20278

Comments

jjerphan commented Jun 16, 2021 • edited Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Comment

Versions

ogrisel commented Jun 16, 2021

jjerphan commented Jun 16, 2021

NicolasHug commented Jun 16, 2021

ogrisel commented Jun 16, 2021

NicolasHug commented Jun 16, 2021

jjerphan commented Jun 16, 2021

jjerphan commented Jun 16, 2021 • edited Loading

ogrisel commented Jun 16, 2021

jjerphan commented Jun 16, 2021

jjerphan commented Jun 24, 2021

jjerphan commented Oct 18, 2021

jjerphan commented Jun 16, 2021 •

edited

Loading

jjerphan commented Jun 16, 2021 •

edited

Loading