[WIP] Gaussian Process revisited (refactoring of correlation models + some new features/bugfixes) #3388

jmetzen · 2014-07-15T13:42:42Z

This PR contains a proposal for a refactoring of the gaussian_process subpackage (mainly the correlation_models module) and some novel features which become possible because of this refactoring. Moreover, some subtle bugs are fixed. In general, I tried to keep the external interface of GaussianProcess fixed and to change only internals.The main purpose of this PR is to begin a discussion on the future of the GP subpackage and revive its further development.

CHANGES:

Correlation models are now classes instead of functions and inherit from abstract base class StationaryCorrelation. This reduces code-redundancy (DRY) and separates GPs and their correlation models more strictly. Moreover, non-stationary correlation models become possible (see examples/gaussian_process/plot_gp_nonstationary.py) This was not possible formerly since only the differences of the datapoints were passed to the correlation models but not the datapoints themselves.
The hyperparameters theta are estimated as maximum-a-posterior estimates rather than maximum-likelihood estimates. At the moment, the prior on the hyperparameters theta is uniform such that both result in the same value. However, a non-uniform prior could be used in the future in cases where the risk of overfitting theta to the data is serious (e.g., forcing the factor analysis distance to be close to diagonal etc.)
Hyperparameters theta are represented as 1D ndarrays instead of 2D 1xn arrays in GaussianProcess.

NEW FEATURES:

Matern correlation models for nu=1.5 and nu=2.5 have been added (see https://en.wikipedia.org/wiki/Mat%C3%A9rn_covariance_function). An example script showing the potential benefit of the Matern correlation model compared to squared-exponential and absolute-exponential was added under examples/gaussian_process/plot_matern_kernel.py (see attached image)
squared_exponential, absolute_exponential and Matern correlation models support factor analysis distance. This can be seen as an extension of learning dimension-specific length scales in which also correlations of feature dimensions can be taken into account. See Rasmussen and Williams 2006, p107 for details. An example script showing the potential benefit of this extension was added under examples/gaussian_process/plot_gp_learning_curve.py (see attached image).
GP supports additional optimizers for ML estimation passed as callables. In addition to 'fmin_cobyla' and 'Welch', GaussianProcess allows now to pass other optimizers directly as callables via the parameter optimizer.

BUGFIXES

Setting optimal_rlf_value correctly in GPs and actually maximizing likelihood over several random starts. The sign of the optimal_rlf_value was set wrongly in _arg_max_reduced_likelihood_function() since it was set to the negative of the return of reduced_likelihood_function(). This error was probably caused by confusing reduced_likelihood_function(theta) with minus_reduced_likelihood_function(log10t). It resulted, however, in _arg_max_reduced_likelihood_function() returning the worst local maxima instead of the best one when performing several random_starts.
A typo in gp_diabetes_dataset.py was fixed which caused a crash of the example (gp.theta_ versus gp.theta)

To be discussed:

The factor analysis distance has hyperparameters that can can take on arbitrary real values (not just positive ones). Since sklearn enforces the hyperparameters theta to be positive, this is internally handled by taking the log of the corresponding components of theta. Are there better ideas?

…lation (work-in-progress) Advantages * GPs and their correlation models more strictly separated * Allow non-stationary correlation models in the future * Reduce code-redundanca

…class

…, self.corr to class)

…ly when required

…ing prediction

…eters are tuned

… 2D 1xn arrays

…g likelihood over several random starts. The sign of the optimal_rlf_value was set wrongly in _arg_max_reduced_likelihood_function() since it was set to the negative of the return of reduced_likelihood_function(). This error was probably caused by confusing reduced_likelihood_function(theta) with minus_reduced_likelihood_function(log10t). It resulted, however, in _arg_max_reduced_likelihood_function() returning the worst local maxima instead of the best one when performing several random_starts.

…s callables In addition to the 'fmin_cobyla', 'Welch', GaussianProcess allows now to pass other optimizers directly as callables via the parameter optimizer.

…owing its advantages

This allows to call __init__ of more complex correlation models outside of the GP before the training data X used in fit() is known

…ead of ML Instead of choosing the hyperparameters that maximize the likelihood, the maximum-a-posteriori is used. While the prior is assumed to be uniform for the implemented stationary correlation models, more complex correlation models with more hyperparameters may use an appropriate prior to reduce the risk of overfitting when the training set is small.

jmetzen · 2014-07-15T13:51:25Z

This PR might be interesting for @dubourg, @agramfort, @larsmans, @JohnFNovak, and @AlexanderFabisch

kastnerkyle · 2014-07-15T14:18:51Z

sklearn/gaussian_process/correlation_models.py

+            raise Exception("Multiple input features cannot have the same"
+                            " value.")
+
+    def __call__(self, theta, X=None):


Is there any way we could avoid overriding the call method, and call transform() explicitly instead by setting properties on calls to fit()?

kastnerkyle · 2014-07-15T14:23:52Z

Several things here:

While we are refactoring, is there a way we can deprecate the term "nugget"? I really, really dislike the name - there has to be some clearer title for general users.
I don't know about the two abstract base classes. On the one hand, I see the logic in doing it in this way, but I also feel that explicit independent functions for each make things more modular (which is the reasoning behind the change), and would lend a hand toward using these kernels in an eventual KernelMixin class, which was discussed by @mblondel somewhere around here. At very least, I currently don't see why overriding call is preferable to simple making that functionality live in .transform() or another method, and setting the hyperparameters on fit().

kastnerkyle · 2014-07-15T14:31:26Z

Actually, I am not 100% sure why these classes need all the fit() machinery/stateful behavior at all. It seems like there isn't much need for the class structure, and could be implemented as pure functions that also use some shared functions from the module level. Maybe I am missing something crucial?

jmetzen · 2014-07-15T14:32:23Z

regarding 1) I also dislike nugget, but it is part of the external GaussianProcess API so I didn't change it.
regarding 2) I am open to use an other interface than call; I mainly used call because it allows to use the correlation model in a similar way as the old functional correlation models . I don't know about calling it transform; it is not really transforming the data and can even be called without data X (computing the autocorrelation on self.X). Regarding splitting the abstract base class into several independent functions: I am not sure if this could easily be passed to the GP via the corr parameter of the constructor?

kastnerkyle · 2014-07-15T14:41:18Z

On the first, I am glad to see that I am not the only person who is bugged by the name :) . It would be a solid opportunity to deprecate nugget at the API level, then introduce a new parameter that we expose as well and use internally, though I am not sure of a good name yet.

I will try to work up a skeleton example off your branch to show what I mean on the second count. Thanks for the work so far - GP definitely needs some love!

jmetzen · 2014-07-15T14:41:50Z

Regarding the fit() machinery/stateful behavior: The l1_cross_differences stored in StationaryCorrelation.D need to be computed only once. Formerly, they have been stored in the GP itself (where they are not really required). The stateful behavior would also be required for implementing some non-stationary correlation models from the literature (not part of this PR).

kastnerkyle · 2014-07-15T14:59:57Z

On the l1_cross_differences, it would be pretty easy to do an "if None" check to avoid recomputing.

I would actually prefer storing state in the GP itself - if that was the case, we could still use the state stored in the GP to implement non-stationary correlation models, right? It seems more explicit to me to keep the state in the class itself, and let the correlations stay as simple as possible. In any case, I will try and work out an example of "stupid" correlations off your branch to see if it is a reasonable approach.

kastnerkyle · 2014-07-15T17:17:00Z

Ultimately, I think my comments are better addressed by a larger refactor of the GP class. Ignore me for now, and we can analyze this PR on it's own merits.

In general we should work toward making the GP class more transparent and accessible - I currently find it quite complicated for (possibly) no reason, and much of the class does not adhere to modern sklearn API standards.

Adding a class hierarchy to the kernel methods which is not shared with other kernel classes, pairwise_metrics, etc. doesn't really help this. If we decide to merge this PR, I would prefer to see these implementation details (the two metaclasses and their subclasses) private rather than public, in order to make a larger refactor easier in the future. Making the correlations/kernels/covariances very dumb and keeping as much state contained to the GP class as possible is also a step in this direction. That said, I think this PR can definitely help people work in the GP class now, which is very important.

There is a lot in common between kernel ridge regression and GP/kriging - I do not advocate writing "the class to end all classes" but it is definitely worth seeing if we can merge some common functionality now and in the future.

larsmans · 2014-07-15T17:52:26Z

This is a big PR, let's not start changing the API ("nugget" vs. whatever) if we don't need to. Too many good PRs go stale because to many changes are introduced at the same time and the review process becomes a nightmare.

kastnerkyle · 2014-07-15T21:40:42Z

Good point. If it helps get this merged, let's minimize the surface area
and go with what it does now. I think it is very good, but I am unhappy
with the GP in general. This should help that :)

On Tue, Jul 15, 2014 at 12:52 PM, Lars Buitinck notifications@github.com
wrote:

This is a big PR, let's not start changing the API ("nugget" vs. whatever)
if we don't need to. Too many good PRs go stale because to many changes are
introduced at the same time and the review process becomes a nightmare.

—
Reply to this email directly or view it on GitHub
#3388 (comment)
.

jmetzen · 2014-07-16T06:28:19Z

+1 that the GP class needs to get more transparent and accessible
In my opinion, the best way for this is to focus on one part of it at a time (probably this PR is already too big). Maybe we can leave the ML vs. MAP part for later and focus first on getting the correlation models straight. Here my thoughts on this:

correlation_models vs. pairwise_metrics: the strongest overlap here is in my opinion the function l1_cross_differences which is pretty similar to manhattan_distances from pairwise (for Y==None). The main difference seems to be that l1_cross_differences exploits symmetry of D (when Y==None) and has thus a smaller memory footprint. Another overlap is between SquaredExponential and the rbf_kernel from pairwise. rbf_kernel is restricted to the isometric case at the moment, however.
correlation models as functions versus stateful objects: for the correlation models implemented currently in sklearn, the state consists basically of the pairwise distances of the training data (via D and ij). This could be kept in the GaussianProcess class. I see many use cases for custom correlation models (and have some myself), which need not become part of sklearn itself. It would be sufficient to keep them external but be able to use them with sklearn's GP as long as they have a compatible interface with sklearn's correlation models. Since some kernels might not only differ in computation but also in their internal state (one of my examples stores a clustering of X internally), stateful objects are in my opinion better suited for this. The specific interface of StationaryCorrelation is certainly not optimal; I am open for suggestions ;-)

kastnerkyle · 2014-07-19T16:30:32Z

Is there any way we could separate the bugfix stuff and the refactoring/API change proposal into two PRs? I really, really like the bugfixes but we will need to have a lot of thought and input about the API stuff. If we can split this into two smaller PRs I think it is more likely to get looked at.

jmetzen · 2014-07-21T05:41:56Z

I will create a separate PR for the bugfixes and corresponding tests

kastnerkyle · 2014-07-21T12:18:06Z

Great - there are a lot of fixes here :)

jmetzen · 2014-07-22T18:25:29Z

Oops, seems that we fixed the same issue twice. In order to get PR #2632 finally merged, I added a proposal for a regression test in the comments.

jmetzen · 2014-08-30T07:43:20Z

Now that the bugfixes are merged via PR #3545, what is the general opinion regarding the new features? And using classes for the correlation_models?

If this PR is too big, I can focus on a subset of the features/refacotoring and create a smaller PR.

jmetzen · 2014-12-19T14:57:22Z

If have created a separate PR (PR #3885) for adding the Matern kernel to pairwise.py. Once that is done, I will try refactoring correlation_models.py such that it uses as much from pairwise.py as possible and close this PR. What is your opinion, @kastnerkyle?

kastnerkyle · 2014-12-19T16:59:43Z

That sounds awesome! Thanks for all your work on this, it is really nice to
see the GP getting some fixes - sharing the kernels via pairwise.py should
be good for many estimators. It will also provide a clear roadmap for
people wanting to use their own kernels.

On Fri, Dec 19, 2014 at 9:57 AM, Jan Hendrik Metzen <
notifications@github.com> wrote:

If have created a separate PR (PR #3885
#3885) for adding the
Matern kernel to pairwise.py. Once that is done, I will try refactoring
correlation_models.py such that it uses as much from pairwise.py as
possible and close this PR. What is your opinion, @kastnerkyle
https://github.com/kastnerkyle?

—
Reply to this email directly or view it on GitHub
#3388 (comment)
.

amueller · 2015-01-17T20:38:08Z

@jmetzen you closed this becaue you want to break it up? Sorry for the lack of feedback before.

jmetzen · 2015-01-18T09:06:40Z

I plan to split the contents of this PR into several smaller PRs. First step would be to refactor correlation_models.py such that it reuses the kernels from pairwise.py.

amueller · 2015-01-19T17:31:27Z

Btw, did you follow the (somewhat) recent discussion on the mailing list on whether it would be worth to have a language for specifying kernels in sklearn?
When using the pairwise model we might end up with a lot of redundant computation if we combine many kernels, right?
I guess it would be better than what we currently have, though.

mblondel · 2015-01-20T02:04:32Z

I suspsect that we'll need kernel classes anyway for kernel parameter
optimization by gradient descent. Also popular kernels tend to be different
for SVMs and GPs.

amueller · 2015-01-20T15:45:52Z

So if we need kernel classes, I'm not sure the proposed refactoring is the best idea.

mblondel · 2015-01-20T15:58:32Z

My advice would be to worry about code factorization after the GP module got rewritten.

jmetzen · 2015-01-20T17:50:36Z

Makes sense to defer refactoring until the GP module is rewritten. I also checked how pairwise's kernels could be reused in the GP context and there are at least 2 drawbacks: pairwise's kernels are restricted to the isotropic case (scalar gamma) and they don't exploit that k(X, X) is symmetric - in contrast to the current GP implementation. Thus, code reuse between pairwise and correlation_models would probably also require some non-trivial refactoring of pairwise.

BTW: Is anyone currently working on the GP refactoring?

amueller · 2015-01-20T17:55:00Z

Unfortunately not.
@dfm said he might have some time to help with improving documentation, but he seems busy finding a new earth ;)

Jan Hendrik Metzen added 21 commits July 4, 2014 09:53

REFACTOR Correlation models now subclasses of generic StationaryCorre…

e36452b

…lation (work-in-progress) Advantages * GPs and their correlation models more strictly separated * Allow non-stationary correlation models in the future * Reduce code-redundanca

REFACTOR Factored out common code of all correlation models into base…

bd293b6

…class

PEP8 Minor polishing of test_gaussian_process.py

d95c1a5

DOC Documentation and interface of StationaryCorrelation

883c916

FIX Welch optimizer in GPs (self.corr_ refers nw to specific instance…

0f4febf

…, self.corr to class)

DOC Adapted documentation of GP to changed correlation model interface

5dd83bd

REFACTOR Conversion to absolute value of componentwise differences on…

01a022f

…ly when required

REFACTOR Using l1_cross_differences instead of l1_cross_distances

d8a5929

FIX Using l1_cross_differences instead of l1_cross_distances also dur…

a33acd6

…ing prediction

FIX reduced_likelihood_function_value_ in GP correct if no hyperparam…

1334db1

…eters are tuned

REFACTOR Representing hyperparameters theta as 1D ndarrays instead of…

4f47269

… 2D 1xn arrays

FIX Adapted Welch optimizer to 1D shape of theta introduced in 4f4726

6c675b5

REFACTOR GP supports additional optimizers for ML estimation passed a…

0627904

…s callables In addition to the 'fmin_cobyla', 'Welch', GaussianProcess allows now to pass other optimizers directly as callables via the parameter optimizer.

ADD Materin correlation models for nu=1.5 and nu=2.5

92220aa

ADD Factor analysis distance for _quadratic_activation and example sh…

b7a7ec0

…owing its advantages

REFACTOR Correlation models separate between __init__ and fit()

ff5ef89

This allows to call __init__ of more complex correlation models outside of the GP before the training data X used in fit() is known

ADD Example of a non-stationary correlation model

ed22c8a

DOC Gaussian Process documentation reflects MAP estimation

f46634d

MISC PEP8 and typos fixed in gaussian_process.py

3d5cd8c

jmetzen mentioned this pull request Jul 15, 2014

ENH Enhanced correlation models and noise estimation for Gaussian Process #2930

Closed

kastnerkyle reviewed Jul 15, 2014
View reviewed changes

kastnerkyle mentioned this pull request Jul 20, 2014

Implementing correlation model kernels #3138

Closed

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

jmetzen closed this Jan 17, 2015

jmetzen mentioned this pull request Jan 17, 2015

[MRG] Matern kernel #3885

Closed

2 tasks

Uh oh!

[WIP] Gaussian Process revisited (refactoring of correlation models + some new features/bugfixes) #3388

[WIP] Gaussian Process revisited (refactoring of correlation models + some new features/bugfixes) #3388

Uh oh!

Conversation

jmetzen commented Jul 15, 2014

Uh oh!

jmetzen commented Jul 15, 2014

Uh oh!

kastnerkyle Jul 15, 2014

Choose a reason for hiding this comment

Uh oh!

kastnerkyle commented Jul 15, 2014

Uh oh!

kastnerkyle commented Jul 15, 2014

Uh oh!

jmetzen commented Jul 15, 2014

Uh oh!

kastnerkyle commented Jul 15, 2014

Uh oh!

jmetzen commented Jul 15, 2014

Uh oh!

kastnerkyle commented Jul 15, 2014

Uh oh!

kastnerkyle commented Jul 15, 2014

Uh oh!

larsmans commented Jul 15, 2014

Uh oh!

kastnerkyle commented Jul 15, 2014

Uh oh!

jmetzen commented Jul 16, 2014

Uh oh!

kastnerkyle commented Jul 19, 2014

Uh oh!

jmetzen commented Jul 21, 2014

Uh oh!

kastnerkyle commented Jul 21, 2014

Uh oh!

jmetzen commented Jul 22, 2014

Uh oh!

jmetzen commented Aug 30, 2014

Uh oh!

jmetzen commented Dec 19, 2014

Uh oh!

kastnerkyle commented Dec 19, 2014

Uh oh!

amueller commented Jan 17, 2015

Uh oh!

jmetzen commented Jan 18, 2015

Uh oh!

amueller commented Jan 19, 2015

Uh oh!

mblondel commented Jan 20, 2015

Uh oh!

amueller commented Jan 20, 2015

Uh oh!

mblondel commented Jan 20, 2015

Uh oh!

jmetzen commented Jan 20, 2015

Uh oh!

amueller commented Jan 20, 2015

Uh oh!

Uh oh!