Skip to content

Nearest neighbors Gaussian Process #31158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
stavoltafunzia opened this issue Apr 7, 2025 · 4 comments
Open

Nearest neighbors Gaussian Process #31158

stavoltafunzia opened this issue Apr 7, 2025 · 4 comments
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@stavoltafunzia
Copy link

stavoltafunzia commented Apr 7, 2025

Describe the workflow you want to enable

Recently I've been working on a Nearest Neighbor Gaussian Process Regressor as described in Datta 2016 here. This kind of model exists in R, but not in scikit-learn. Nearest Neighbor Gaussian Process Regressor is a simple enhancement over standard GP that allows to use GP on large datasets. It also recently gained interest among the GPytorch package, see e.g. here.

Describe your proposed solution

I already have a scikit-learn-like implementation that I could bring to this project. This implementation becomes more convenient (uses less memory and less runtime) than classic Gaussian Process Regressor from a dataset size of approx 10k. It is based on Datta's work, so it's not as the one in the GPytorch package. If anyone deems this model interesting enough, I'm wiling to make a PR.

Having a baseline CPU-base implementation in scikit-learn could also server as a starting point for future GPU-based implementations, which is were this model really shines (e.g. inheriting from scikit-learn class and implementing in GPU the most time consuming operations). As an example, I also have a cupy-based implementation of Datta's NNGP which competes very well against GPytorch VNNGP.

Describe alternatives you've considered, if relevant

As mentioned above, a version of NNGP is implemented in GPytorch. GPytorch implementation however is not only based on Nearest Neighbors, but also on Variational method. The one from Datta's is simpler being only based on NN and can become competitive with more complex methods VNNGP when using GPUs.

Additional context

No response

@stavoltafunzia stavoltafunzia added Needs Triage Issue requires triage New Feature labels Apr 7, 2025
@ogrisel ogrisel added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Apr 7, 2025
@ogrisel
Copy link
Member

ogrisel commented Apr 7, 2025

According to google scholar, the paper has 775 citations. However I am not familiar enough with the GP literature to properly assess the usefulness/maintenance tradeoff.

If you already have a working implementation, could you please publish it or link to it (e.g. as a personal repo on github) to get an idea on how complex is the code?

This implementation becomes more convenient (uses less memory and less runtime) than classic Gaussian Process Regressor from a dataset size of approx 10k.

Have you tried to run benchmarks? On which kinds of datasets / tasks was this most useful to you?

In particular, do you use it on highdimensional or small dimensional data. I have the feeling that using ball-tree or kd-tree indices might make this work fast in low dimensional data (<~ 10 dimensions).

As an example, I also have a cupy-based implementation of Datta's NNGP which competes very well against GPytorch VNNGP.

Same remark: I would love to see code and numbers to back this claim.

@stavoltafunzia
Copy link
Author

If you already have a working implementation, could you please publish it or link to it (e.g. as a personal repo on github) to get an idea on how complex is the code?

Sure, give me a couple of days to clean it up a bit, add readme, unittest, then I will publish it.

Have you tried to run benchmarks? On which kinds of datasets / tasks was this most useful to you?
In particular, do you use it on highdimensional or small dimensional data. I have the feeling that using ball-tree or kd-tree indices might make this work fast in low dimensional data (<~ 10 dimensions).

Right observation, I also expect the dimensionality of the problem to affect the speed. My statement (10k observations) was based on the 3D-road dataset, which indeed is low dimensional. I'm now checking on something more high-dimensional, like the elevators dataset. Linking to the question above, I can include few benchmarks in the code I will publish.

As an example, I also have a cupy-based implementation of Datta's NNGP which competes very well against GPytorch VNNGP.

If relevant, I can publish some code of what above as well. Though, being cupy-based, I thought it was not much relevant in the context of scikit-learn.

@stavoltafunzia
Copy link
Author

stavoltafunzia commented Apr 10, 2025

As requested, I published the model in this project. I have also added some very basic benchmarks.
In the coming few days I will also add the cuda-based implementation.

@stavoltafunzia
Copy link
Author

I added the Cuda-based NNGPR in the same project to give an idea of the possible speedup. See, for example, benchmark.ipynb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
None yet
Development

No branches or pull requests

2 participants