-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Nearest neighbors Gaussian Process #31158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
According to google scholar, the paper has 775 citations. However I am not familiar enough with the GP literature to properly assess the usefulness/maintenance tradeoff. If you already have a working implementation, could you please publish it or link to it (e.g. as a personal repo on github) to get an idea on how complex is the code?
Have you tried to run benchmarks? On which kinds of datasets / tasks was this most useful to you? In particular, do you use it on highdimensional or small dimensional data. I have the feeling that using ball-tree or kd-tree indices might make this work fast in low dimensional data (<~ 10 dimensions).
Same remark: I would love to see code and numbers to back this claim. |
Sure, give me a couple of days to clean it up a bit, add readme, unittest, then I will publish it.
Right observation, I also expect the dimensionality of the problem to affect the speed. My statement (10k observations) was based on the 3D-road dataset, which indeed is low dimensional. I'm now checking on something more high-dimensional, like the elevators dataset. Linking to the question above, I can include few benchmarks in the code I will publish.
If relevant, I can publish some code of what above as well. Though, being cupy-based, I thought it was not much relevant in the context of scikit-learn. |
As requested, I published the model in this project. I have also added some very basic benchmarks. |
I added the Cuda-based NNGPR in the same project to give an idea of the possible speedup. See, for example, benchmark.ipynb. |
Describe the workflow you want to enable
Recently I've been working on a Nearest Neighbor Gaussian Process Regressor as described in Datta 2016 here. This kind of model exists in R, but not in scikit-learn. Nearest Neighbor Gaussian Process Regressor is a simple enhancement over standard GP that allows to use GP on large datasets. It also recently gained interest among the GPytorch package, see e.g. here.
Describe your proposed solution
I already have a scikit-learn-like implementation that I could bring to this project. This implementation becomes more convenient (uses less memory and less runtime) than classic Gaussian Process Regressor from a dataset size of approx 10k. It is based on Datta's work, so it's not as the one in the GPytorch package. If anyone deems this model interesting enough, I'm wiling to make a PR.
Having a baseline CPU-base implementation in scikit-learn could also server as a starting point for future GPU-based implementations, which is were this model really shines (e.g. inheriting from scikit-learn class and implementing in GPU the most time consuming operations). As an example, I also have a cupy-based implementation of Datta's NNGP which competes very well against GPytorch VNNGP.
Describe alternatives you've considered, if relevant
As mentioned above, a version of NNGP is implemented in GPytorch. GPytorch implementation however is not only based on Nearest Neighbors, but also on Variational method. The one from Datta's is simpler being only based on NN and can become competitive with more complex methods VNNGP when using GPUs.
Additional context
No response
The text was updated successfully, but these errors were encountered: