LLE utilizing kNN for sample points #29752

davidcherney · 2024-08-30T20:31:48Z

Description

This pull request addresses the issue #29715 by

changing the confusing error message reported
using the (pre-existing) functionality of KNeighborsMixin.kneighbors to calculate kNN of the sample points.

Definitions

I refer to

the set $X={(i,x_i) :i=1,...,N}$ of $N$ data points $x_i$ and their indices $i$, used to fit a NearestNeighbors object via nn = NearestNeighbors().fit(X), as the sample, and an element of that set as a sample point.
a point $x$ for whom we calculate the nearest neighbors via nn.kneighbors(X=[x]) as a query point.

The distinction is important because

in the calculation of kNN of a sample point $(i,x_i)$, the sample point itself is excluded as a possible neighbor.
in the calculation of kNN of a query point $x$, the none of the sample points are excluded as a possible neighbors.

From a math perspective there are two kNN functions: one for sample points, and one for query points. From a code perspective

kNN for all sample points is performed via nn.nn.kneighbors(X=None)
kNN for a query point $x$ is performed via nn.nn.kneighbors(X=[x])

Problem

The original code

raised a ValueError when n_neighbors was greater than or equal to n_samples.
- This is not the correct condition for the kNN of a sample point because the point itself must be excluded as a possible neighbor, giving a max value for n_neighbors of n_samples-1.
- This is the correct condition for the kNN of a query point because, with no train set points excluded, the max value for n_neighbors is n_samples.

The kind of kNN being computed for LLE is the former; kNN for points from the sample points.

However, the message given with the value error was written for the latter case; kNN for query points.

Specifically, the code

if n_neighbors >= N:
    raise ValueError(
        "Expected n_neighbors <= n_samples, but n_samples = %d, n_neighbors = %d"
        % (N, n_neighbors)
    )

raised an error when n_neighbors=5 and n_samples=5, and gave the confusing message

"Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 5"

Further, the discussion in PR #29716 showed that the surrounding code was confusing to developers because of the n_neighbors + 1 in calls of the form NearestNeighbors(n_neighbors=n_neighbors + 1, n_jobs=n_jobs)

instantiating the NearestNeighbors object for LLE
in calculation of the barycenters

These calls tried to implement the kNN function for sample points (which should be done with nn.kneighbors(X=None)) by using the kNN function for query points (nn.kneighbors(X=[x])) and then removing the first neighbor

in calculation of the barycenters
within lle, e.g. here

This treatment neglected the edge cases addressed by the kNN function for sample points nn.kneighbors(X=None).

Solution

The code nn.kneighbors(X=None) is now used to calculate kNNs of sample points.

The error message now describes the condition for that function to work.

Tests

The change was tested locally, and all existing tests passed successfully.

The following shows that the method of dropping the first element in the list of nearest neighbors of a sample point sometimes gives neighbors that are the sample point itself.

from sklearn.neighbors import NearestNeighbors
import numpy as np

n_neighbors = 2
k = n_neighbors+1
# Create a sample with an element of multiplicity more than 3 = n_neighbors+1 = k.
sample = np.array([ [0., 0., 0.], 
                    [0., 0., 0.], 
                    [0., 0., 0.],
                    [0., 0., 0.], 
                    [1., 1., .5]])
X = sample
# Find n_neighbors+1 neighbors.
knn = NearestNeighbors(n_neighbors=n_neighbors+1).fit(X)
ind = knn.kneighbors(X=X,n_neighbors=n_neighbors+1,return_distance=False)

for i, neighbor_indices in enumerate(ind):
    flag = i == neighbor_indices[0]
    print(f"It is {flag} that the first NN of sample point x_{i} is x_{i}.")

Additional comments

I'm a new contributor, eager to learn.

Possible future work: It appears that

A NearestNeighbors object nbrs is fit
nbrs is passed to a function
that function uses only the training data from the NearestNeighbors object nbrs to fit another NearestNeighbors object, knn.

The first fit seems unnecessary, and might be doubling the computation time for the algorithm. Advice on how to asses that situation will be appreciated.

github-actions · 2024-08-30T20:33:09Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: ed87bf8. Link to the linter CI: here}

black made linting changes

5e52d30

github-actions bot added the module:manifold label Aug 30, 2024

@davidcherney and others added 2 commits September 6, 2024 06:56

undid sentencification or error message at 297

6265d68

Merge branch 'main' into fix/issue-29715

ed87bf8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

LLE utilizing kNN for sample points #29752

LLE utilizing kNN for sample points #29752

Uh oh!

davidcherney commented Aug 30, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Aug 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

LLE utilizing kNN for sample points #29752

Are you sure you want to change the base?

LLE utilizing kNN for sample points #29752

Uh oh!

Conversation

davidcherney commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Definitions

Problem

Solution

Tests

Additional comments

Uh oh!

github-actions bot commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

davidcherney commented Aug 30, 2024 •

edited

Loading

github-actions bot commented Aug 30, 2024 •

edited

Loading