Nearest neighbors Gaussian Process #31158

stavoltafunzia · 2025-04-07T10:12:34Z

Describe the workflow you want to enable

Recently I've been working on a Nearest Neighbor Gaussian Process Regressor as described in Datta 2016 here. This kind of model exists in R, but not in scikit-learn. Nearest Neighbor Gaussian Process Regressor is a simple enhancement over standard GP that allows to use GP on large datasets. It also recently gained interest among the GPytorch package, see e.g. here.

Describe your proposed solution

I already have a scikit-learn-like implementation that I could bring to this project. This implementation becomes more convenient (uses less memory and less runtime) than classic Gaussian Process Regressor from a dataset size of approx 10k. It is based on Datta's work, so it's not as the one in the GPytorch package. If anyone deems this model interesting enough, I'm wiling to make a PR.

Having a baseline CPU-base implementation in scikit-learn could also server as a starting point for future GPU-based implementations, which is were this model really shines (e.g. inheriting from scikit-learn class and implementing in GPU the most time consuming operations). As an example, I also have a cupy-based implementation of Datta's NNGP which competes very well against GPytorch VNNGP.

Describe alternatives you've considered, if relevant

As mentioned above, a version of NNGP is implemented in GPytorch. GPytorch implementation however is not only based on Nearest Neighbors, but also on Variational method. The one from Datta's is simpler being only based on NN and can become competitive with more complex methods VNNGP when using GPUs.

Additional context

No response

ogrisel · 2025-04-07T12:35:48Z

According to google scholar, the paper has 775 citations. However I am not familiar enough with the GP literature to properly assess the usefulness/maintenance tradeoff.

If you already have a working implementation, could you please publish it or link to it (e.g. as a personal repo on github) to get an idea on how complex is the code?

This implementation becomes more convenient (uses less memory and less runtime) than classic Gaussian Process Regressor from a dataset size of approx 10k.

Have you tried to run benchmarks? On which kinds of datasets / tasks was this most useful to you?

In particular, do you use it on highdimensional or small dimensional data. I have the feeling that using ball-tree or kd-tree indices might make this work fast in low dimensional data (<~ 10 dimensions).

As an example, I also have a cupy-based implementation of Datta's NNGP which competes very well against GPytorch VNNGP.

Same remark: I would love to see code and numbers to back this claim.

stavoltafunzia · 2025-04-07T13:06:05Z

If you already have a working implementation, could you please publish it or link to it (e.g. as a personal repo on github) to get an idea on how complex is the code?

Sure, give me a couple of days to clean it up a bit, add readme, unittest, then I will publish it.

Have you tried to run benchmarks? On which kinds of datasets / tasks was this most useful to you?
In particular, do you use it on highdimensional or small dimensional data. I have the feeling that using ball-tree or kd-tree indices might make this work fast in low dimensional data (<~ 10 dimensions).

Right observation, I also expect the dimensionality of the problem to affect the speed. My statement (10k observations) was based on the 3D-road dataset, which indeed is low dimensional. I'm now checking on something more high-dimensional, like the elevators dataset. Linking to the question above, I can include few benchmarks in the code I will publish.

As an example, I also have a cupy-based implementation of Datta's NNGP which competes very well against GPytorch VNNGP.

If relevant, I can publish some code of what above as well. Though, being cupy-based, I thought it was not much relevant in the context of scikit-learn.

stavoltafunzia · 2025-04-10T14:41:42Z

As requested, I published the model in this project. I have also added some very basic benchmarks.
In the coming few days I will also add the cuda-based implementation.

stavoltafunzia · 2025-04-22T15:52:56Z

I added the Cuda-based NNGPR in the same project to give an idea of the possible speedup. See, for example, benchmark.ipynb.

stavoltafunzia added Needs Triage Issue requires triage New Feature labels Apr 7, 2025

ogrisel added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nearest neighbors Gaussian Process #31158

Nearest neighbors Gaussian Process #31158

stavoltafunzia commented Apr 7, 2025 •

edited

Loading

ogrisel commented Apr 7, 2025

stavoltafunzia commented Apr 7, 2025

stavoltafunzia commented Apr 10, 2025 •

edited

Loading

stavoltafunzia commented Apr 22, 2025

Nearest neighbors Gaussian Process #31158

Nearest neighbors Gaussian Process #31158

Comments

stavoltafunzia commented Apr 7, 2025 • edited Loading

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

ogrisel commented Apr 7, 2025

stavoltafunzia commented Apr 7, 2025

stavoltafunzia commented Apr 10, 2025 • edited Loading

stavoltafunzia commented Apr 22, 2025

stavoltafunzia commented Apr 7, 2025 •

edited

Loading

stavoltafunzia commented Apr 10, 2025 •

edited

Loading