Skip to content

KNN-Search is not translation invariant #20278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jjerphan opened this issue Jun 16, 2021 · 11 comments
Closed

KNN-Search is not translation invariant #20278

jjerphan opened this issue Jun 16, 2021 · 11 comments

Comments

@jjerphan
Copy link
Member

jjerphan commented Jun 16, 2021

Describe the bug

K-NN search via neighbors.NearestNeighbors is not translation invariant currently.

Steps/Code to Reproduce

import numpy as np
from sklearn import neighbors

n = 10_000
d = 50
n_neighbors = 100
translation = 100_000
metric = 'euclidean'
dtype = np.float64

rng = np.random.RandomState(1)
X_train = rng.rand(n, d).astype(dtype)
X_test = rng.rand(n, d).astype(dtype)

neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                   algorithm="brute",
                                   metric=metric).fit(X_train)
reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                 n_neighbors=n_neighbors,
                                                 return_distance=True)

neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                   algorithm="brute",
                                   metric=metric).fit(X_train +
                                                      translation)
dist, nns = neigh.kneighbors(X=X_test + translation,
                             n_neighbors=n_neighbors,
                             return_distance=True)

np.testing.assert_array_equal(reference_nns, nns)

Expected Results

No exception should be thrown.

Actual Results

AssertionError: 
Arrays are not equal

Mismatched elements: 27458 / 1000000 (2.75%)
Max absolute difference: 9926
Max relative difference: 7249.
 x: array([[5326, 6273, 9920, ..., 2516, 6960, 4326],
       [3946, 8598,  457, ..., 6428,  551, 9508],
       [8355, 5921,  617, ..., 4665, 5981, 4848],...
 y: array([[5326, 6273, 9920, ..., 2516, 6960, 4326],
       [3946, 8598,  457, ..., 6428,  551, 9508],
       [8355, 5921,  617, ..., 4665, 5981, 4848],...

Distances aren't equal up to a small tolerance as well but I guess this is not that/as important.
More test cases can be found on #20279.

Comment

Working on #20254 made me test the proposed implementation for translation invariance and check if it thus was the case of scikit-learn's.

I don't know if this would be easy to alleviate: are we hitting the representation capabilities of floats?

Versions

from sklearn import show_versions
show_versions()
System:
    python: 3.9.5 (default, May 18 2021, 19:34:48)  [GCC 7.3.0]
executable: /home/jjerphan/.local/share/miniconda3/envs/sk/bin/python
   machine: Linux-5.12.9-300.fc34.x86_64-x86_64-with-glibc2.33

Python dependencies:
          pip: 21.1.1
   setuptools: 52.0.0.post20210125
      sklearn: 1.0.dev0
        numpy: 1.20.3
        scipy: 1.6.3
       Cython: 0.29.23
       pandas: 1.2.4
   matplotlib: 3.4.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True
@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2021

Distances aren't equal up to a small tolerance as well but I guess this is not that/as important.

I don't understand what you mean. Which of the following cases happens?

  • change of distances of nearest neighbors are all within expected rounding errors, in which case it's normal to observe a few rank index inversions in the nn results
  • change of distances of nearest neighbors are larger than the expected rounding errors, in which case we have a real problem that cannot be explained by rounding errors.

@jjerphan
Copy link
Member Author

I should have been clearer: what I meant is that generally one is interested by identifying correctly nearest neighbors (up to a few rank inversions) rather than having exact values for their distances.

Yet, different distances give different nearest neighbors, and this is what is happening in this case, and it is not just a few rank index inversions in the results. For example, to build on top of the given example above, this:

# ... previous example lines

# Testing for identical nearest neighbors sets
np.testing.assert_array_equal(np.sort(reference_nns), np.sort(nns))                                                                                                                        

gives:

AssertionError: 
Arrays are not equal

Mismatched elements: 8378 / 1000000 (0.838%)
Max absolute difference: 790
Max relative difference: 1.9037037
 x: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
       [  42,   83,  114, ..., 9672, 9676, 9812],
       [  19,   76,   93, ..., 9650, 9721, 9735],...
 y: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
       [  42,   83,  114, ..., 9672, 9676, 9812],
       [  19,   76,   93, ..., 9650, 9721, 9735],...

As for distances:

rtol = 1e-03 if dtype is np.float32 else 1e-05
np.testing.assert_allclose(reference_dist, dist, rtol=rtol)
AssertionError: 
Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 606089 / 1000000 (60.6%)
Max absolute difference: 0.00021526
Max relative difference: 0.0001057
 x: array([[1.997964, 2.05102 , 2.107675, ..., 2.353607, 2.355517, 2.356003],
       [1.933536, 1.992537, 1.993712, ..., 2.280523, 2.282047, 2.283121],
       [1.859538, 1.896128, 1.948302, ..., 2.243647, 2.245024, 2.246497],...
 y: array([[1.997893, 2.05103 , 2.107725, ..., 2.353625, 2.355517, 2.356036],
       [1.93359 , 1.99254 , 1.993719, ..., 2.280474, 2.282039, 2.283135],
       [1.859539, 1.896137, 1.948291, ..., 2.243657, 2.24503 , 2.246579],...

This is a rather small mismatch for nearest neighbors, but in some configurations (float32, bigger translation, etc.) it gets larger (this is what I observe on #20279).

Is this a notable issue? Could / should this be alleviated?

@NicolasHug
Copy link
Member

np.testing.assert_array_equal(np.sort(reference_nns), np.sort(nns))

Although I'm not sure whether this matters much, this doesn't account for the fact that the ties could happen at the edge of the neighborhood, like a tie between the n_neighbors'th neighbor and the n_neighbors + 1th one.

Max absolute difference: 0.00021526

That sounds reasonably low but it could definitely explain why there are not just rank inversions

@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2021

Maybe it's expected to observe neighbors rank inversions caused by near ties and rounding errors with n_neighbors=100. Would this test pass with n_neighbors=10?

@NicolasHug
Copy link
Member

also maybe try after spreading the samples a bit?

X_train = rng.rand(n, d).astype(dtype) * 100
X_test = rng.rand(n, d).astype(dtype) * 100

jjerphan added a commit to jjerphan/scikit-learn that referenced this issue Jun 16, 2021
@jjerphan
Copy link
Member Author

With both proposed changes (including a greater spreading, see eb9d406) a few tests are failing, but it's negligible.

pytest trace on eb9d406
pytest -k "test_translation_invariance"  sklearn/neighbors/tests/test_neighbors.py
======================================================================================= test session starts =======================================================================================
platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/jjerphan/dev/skd, configfile: setup.cfg
plugins: metadata-1.11.0, html-3.1.1
collected 167 items / 95 deselected / 72 selected

sklearn/neighbors/tests/test_neighbors.py ..F.........FF......F.....F...FFF..........................F............                                                                          [100%]

============================================================================================ FAILURES =============================================================================================
_____________________________________________________________________ test_translation_invariance[float32-10-euclidean-10000] _____________________________________________________________________

translation = 10000, metric = 'euclidean', n_neighbors = 10, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 2 / 100000 (0.002%)
E       Max absolute difference: 2322
E       Max relative difference: 0.24107143
E        x: array([[1629, 3616, 5074, ..., 7344, 8215, 9920],
E              [ 457,  625, 2250, ..., 8429, 8598, 8694],
E              [ 617, 5024, 5490, ..., 8355, 8996, 9721],...
E        y: array([[1629, 3616, 5074, ..., 7344, 8215, 9920],
E              [ 457,  625, 2250, ..., 8429, 8598, 8694],
E              [ 617, 5024, 5490, ..., 8355, 8996, 9721],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
______________________________________________________________________ test_translation_invariance[float32-10-chebyshev-100] ______________________________________________________________________

translation = 100, metric = 'chebyshev', n_neighbors = 10, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 8 / 100000 (0.008%)
E       Max absolute difference: 2526
E       Max relative difference: 0.55176933
E        x: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E              [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E              [ 351, 2022, 2163, ..., 7486, 7620, 8881],...
E        y: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E              [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E              [ 351, 2022, 2163, ..., 7486, 7620, 8881],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
_____________________________________________________________________ test_translation_invariance[float32-10-chebyshev-1000] ______________________________________________________________________

translation = 1000, metric = 'chebyshev', n_neighbors = 10, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 8 / 100000 (0.008%)
E       Max absolute difference: 2526
E       Max relative difference: 0.55176933
E        x: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E              [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E              [ 351, 2022, 2163, ..., 7486, 7620, 8881],...
E        y: array([[1335, 2331, 2336, ..., 7116, 8629, 8711],
E              [ 941, 1046, 2426, ..., 8392, 8694, 9812],
E              [ 351, 2022, 2163, ..., 7486, 7620, 8881],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
____________________________________________________________________ test_translation_invariance[float32-100-euclidean-10000] _____________________________________________________________________

translation = 10000, metric = 'euclidean', n_neighbors = 100, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 19 / 1000000 (0.0019%)
E       Max absolute difference: 193
E       Max relative difference: 0.0475823
E        x: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
E              [  42,   83,  114, ..., 9672, 9676, 9812],
E              [  19,   76,   93, ..., 9650, 9721, 9735],...
E        y: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
E              [  42,   83,  114, ..., 9672, 9676, 9812],
E              [  19,   76,   93, ..., 9650, 9721, 9735],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
____________________________________________________________________ test_translation_invariance[float32-100-manhattan-10000] _____________________________________________________________________

translation = 10000, metric = 'manhattan', n_neighbors = 100, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 173 / 1000000 (0.0173%)
E       Max absolute difference: 506
E       Max relative difference: 0.31873727
E        x: array([[ 217,  326,  384, ..., 9881, 9920, 9929],
E              [  42,  164,  197, ..., 9672, 9738, 9865],
E              [  76,   93,  116, ..., 9735, 9852, 9900],...
E        y: array([[ 217,  326,  384, ..., 9881, 9920, 9929],
E              [  42,  164,  197, ..., 9672, 9738, 9865],
E              [  76,   93,  116, ..., 9735, 9852, 9900],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
_____________________________________________________________________ test_translation_invariance[float32-100-chebyshev-100] ______________________________________________________________________

translation = 100, metric = 'chebyshev', n_neighbors = 100, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 10 / 1000000 (0.001%)
E       Max absolute difference: 196
E       Max relative difference: 0.03285283
E        x: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...
E        y: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
_____________________________________________________________________ test_translation_invariance[float32-100-chebyshev-1000] _____________________________________________________________________

translation = 1000, metric = 'chebyshev', n_neighbors = 100, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 27 / 1000000 (0.0027%)
E       Max absolute difference: 516
E       Max relative difference: 0.15542169
E        x: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...
E        y: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
____________________________________________________________________ test_translation_invariance[float32-100-chebyshev-10000] _____________________________________________________________________

translation = 10000, metric = 'chebyshev', n_neighbors = 100, dtype = <class 'numpy.float32'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 287 / 1000000 (0.0287%)
E       Max absolute difference: 516
E       Max relative difference: 0.15542169
E        x: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...
E        y: array([[  69,  293,  323, ..., 9591, 9634, 9910],
E              [  83,  251,  310, ..., 9676, 9757, 9812],
E              [  18,   77,  143, ..., 9652, 9800, 9972],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError
___________________________________________________________________ test_translation_invariance[float64-100-euclidean-10000000] ___________________________________________________________________

translation = 10000000, metric = 'euclidean', n_neighbors = 100, dtype = <class 'numpy.float64'>

    @pytest.mark.parametrize("translation", [10 ** i for i in [2, 3, 4, 5, 6, 7]])
    @pytest.mark.parametrize("metric", ["euclidean", "manhattan", "chebyshev"])
    @pytest.mark.parametrize("n_neighbors", [10, 100])
    @pytest.mark.parametrize("dtype", [np.float32, np.float64])
    def test_translation_invariance(
        translation,
        metric,
        n_neighbors,
        dtype,
    ):
        """ K-NN search must be translation-invariant. """
        n = 10_000
        d = 50

        rng = np.random.RandomState(1)
        X_train = rng.rand(n, d).astype(dtype) * 1000
        X_test = rng.rand(n, d).astype(dtype) * 1000

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train)
        reference_dist, reference_nns = neigh.kneighbors(X=X_test,
                                                         n_neighbors=n_neighbors,
                                                         return_distance=True)

        neigh = neighbors.NearestNeighbors(n_neighbors=n_neighbors,
                                           algorithm="brute",
                                           metric=metric).fit(X_train +
                                                              translation)
        dist, nns = neigh.kneighbors(X=X_test + translation,
                                     n_neighbors=n_neighbors,
                                     return_distance=True)

>       np.testing.assert_array_equal(np.sort(reference_nns, axis=1),
                                      np.sort(nns, axis=1))
E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 145 / 1000000 (0.0145%)
E       Max absolute difference: 494
E       Max relative difference: 0.12893642
E        x: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
E              [  42,   83,  114, ..., 9672, 9676, 9812],
E              [  19,   76,   93, ..., 9650, 9721, 9735],...
E        y: array([[ 308,  384,  482, ..., 9790, 9920, 9929],
E              [  42,   83,  114, ..., 9672, 9676, 9812],
E              [  19,   76,   93, ..., 9650, 9721, 9735],...

sklearn/neighbors/tests/test_neighbors.py:1805: AssertionError

@jjerphan
Copy link
Member Author

jjerphan commented Jun 16, 2021

I propose to:

@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2021

Maybe it would be easier to write tests that check invariance to translation for the pairwise_distances method instead of neighbors queries. This would allow us to write test that checks that big distances are invariant but ignore differences for small distances that are typically caused by rounding errors.

WDYT?

@jjerphan
Copy link
Member Author

Yes, this would also scope tests to a lower level, testing not just for neighbors queries but more interfaces and use-cases built on top.

@jjerphan
Copy link
Member Author

I followed up on tests in #20279 with 7d08967.

@jjerphan
Copy link
Member Author

Closing as made irrelevant by #20254.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants