Skip to content

LLE utilizing kNN for sample points #29752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

davidcherney
Copy link
Contributor

@davidcherney davidcherney commented Aug 30, 2024

Description

This pull request addresses the issue #29715 by

  • changing the confusing error message reported
  • using the (pre-existing) functionality of KNeighborsMixin.kneighbors to calculate kNN of the sample points.

Definitions

I refer to

  • the set $X={(i,x_i) :i=1,...,N}$ of $N$ data points $x_i$ and their indices $i$, used to fit a NearestNeighbors object via nn = NearestNeighbors().fit(X), as the sample, and an element of that set as a sample point.
  • a point $x$ for whom we calculate the nearest neighbors via nn.kneighbors(X=[x]) as a query point.

The distinction is important because

  • in the calculation of kNN of a sample point $(i,x_i)$, the sample point itself is excluded as a possible neighbor.
  • in the calculation of kNN of a query point $x$, the none of the sample points are excluded as a possible neighbors.

From a math perspective there are two kNN functions: one for sample points, and one for query points. From a code perspective

  • kNN for all sample points is performed via nn.nn.kneighbors(X=None)
  • kNN for a query point $x$ is performed via nn.nn.kneighbors(X=[x])

Problem

The original code

  • raised a ValueError when n_neighbors was greater than or equal to n_samples.
    • This is not the correct condition for the kNN of a sample point because the point itself must be excluded as a possible neighbor, giving a max value for n_neighbors of n_samples-1.
    • This is the correct condition for the kNN of a query point because, with no train set points excluded, the max value for n_neighbors is n_samples.

The kind of kNN being computed for LLE is the former; kNN for points from the sample points.

However, the message given with the value error was written for the latter case; kNN for query points.

Specifically, the code

if n_neighbors >= N:
    raise ValueError(
        "Expected n_neighbors <= n_samples, but n_samples = %d, n_neighbors = %d"
        % (N, n_neighbors)
    )

raised an error when n_neighbors=5 and n_samples=5, and gave the confusing message

"Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 5"

Further, the discussion in PR #29716 showed that the surrounding code was confusing to developers because of the n_neighbors + 1 in calls of the form NearestNeighbors(n_neighbors=n_neighbors + 1, n_jobs=n_jobs)

These calls tried to implement the kNN function for sample points (which should be done with nn.kneighbors(X=None)) by using the kNN function for query points (nn.kneighbors(X=[x])) and then removing the first neighbor

This treatment neglected the edge cases addressed by the kNN function for sample points nn.kneighbors(X=None).

Solution

The code nn.kneighbors(X=None) is now used to calculate kNNs of sample points.

The error message now describes the condition for that function to work.

Tests

The change was tested locally, and all existing tests passed successfully.

The following shows that the method of dropping the first element in the list of nearest neighbors of a sample point sometimes gives neighbors that are the sample point itself.

from sklearn.neighbors import NearestNeighbors
import numpy as np

n_neighbors = 2
k = n_neighbors+1
# Create a sample with an element of multiplicity more than 3 = n_neighbors+1 = k.
sample = np.array([ [0., 0., 0.], 
                    [0., 0., 0.], 
                    [0., 0., 0.],
                    [0., 0., 0.], 
                    [1., 1., .5]])
X = sample
# Find n_neighbors+1 neighbors.
knn = NearestNeighbors(n_neighbors=n_neighbors+1).fit(X)
ind = knn.kneighbors(X=X,n_neighbors=n_neighbors+1,return_distance=False)

for i, neighbor_indices in enumerate(ind):
    flag = i == neighbor_indices[0]
    print(f"It is {flag} that the first NN of sample point x_{i} is x_{i}.")

Additional comments

I'm a new contributor, eager to learn.

Possible future work: It appears that

  1. A NearestNeighbors object nbrs is fit
  2. nbrs is passed to a function
  3. that function uses only the training data from the NearestNeighbors object nbrs to fit another NearestNeighbors object, knn.

The first fit seems unnecessary, and might be doubling the computation time for the algorithm. Advice on how to asses that situation will be appreciated.

Copy link

github-actions bot commented Aug 30, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ed87bf8. Link to the linter CI: here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant