Skip to content

[WIP] Gaussian Process revisited (refactoring of correlation models + some new features/bugfixes) #3388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from

Conversation

jmetzen
Copy link
Member

@jmetzen jmetzen commented Jul 15, 2014

This PR contains a proposal for a refactoring of the gaussian_process subpackage (mainly the correlation_models module) and some novel features which become possible because of this refactoring. Moreover, some subtle bugs are fixed. In general, I tried to keep the external interface of GaussianProcess fixed and to change only internals.The main purpose of this PR is to begin a discussion on the future of the GP subpackage and revive its further development.

CHANGES:

  • Correlation models are now classes instead of functions and inherit from abstract base class StationaryCorrelation. This reduces code-redundancy (DRY) and separates GPs and their correlation models more strictly. Moreover, non-stationary correlation models become possible (see examples/gaussian_process/plot_gp_nonstationary.py) This was not possible formerly since only the differences of the datapoints were passed to the correlation models but not the datapoints themselves.
  • The hyperparameters theta are estimated as maximum-a-posterior estimates rather than maximum-likelihood estimates. At the moment, the prior on the hyperparameters theta is uniform such that both result in the same value. However, a non-uniform prior could be used in the future in cases where the risk of overfitting theta to the data is serious (e.g., forcing the factor analysis distance to be close to diagonal etc.)
  • Hyperparameters theta are represented as 1D ndarrays instead of 2D 1xn arrays in GaussianProcess.

NEW FEATURES:

  • Matern correlation models for nu=1.5 and nu=2.5 have been added (see https://en.wikipedia.org/wiki/Mat%C3%A9rn_covariance_function). An example script showing the potential benefit of the Matern correlation model compared to squared-exponential and absolute-exponential was added under examples/gaussian_process/plot_matern_kernel.py (see attached image)
  • squared_exponential, absolute_exponential and Matern correlation models support factor analysis distance. This can be seen as an extension of learning dimension-specific length scales in which also correlations of feature dimensions can be taken into account. See Rasmussen and Williams 2006, p107 for details. An example script showing the potential benefit of this extension was added under examples/gaussian_process/plot_gp_learning_curve.py (see attached image).
  • GP supports additional optimizers for ML estimation passed as callables. In addition to 'fmin_cobyla' and 'Welch', GaussianProcess allows now to pass other optimizers directly as callables via the parameter optimizer.

BUGFIXES

  • Setting optimal_rlf_value correctly in GPs and actually maximizing likelihood over several random starts. The sign of the optimal_rlf_value was set wrongly in _arg_max_reduced_likelihood_function() since it was set to the negative of the return of reduced_likelihood_function(). This error was probably caused by confusing reduced_likelihood_function(theta) with minus_reduced_likelihood_function(log10t). It resulted, however, in _arg_max_reduced_likelihood_function() returning the worst local maxima instead of the best one when performing several random_starts.
  • A typo in gp_diabetes_dataset.py was fixed which caused a crash of the example (gp.theta_ versus gp.theta)

To be discussed:

  • The factor analysis distance has hyperparameters that can can take on arbitrary real values (not just positive ones). Since sklearn enforces the hyperparameters theta to be positive, this is internally handled by taking the log of the corresponding components of theta. Are there better ideas?

plot_gp_learning_curve
plot_matern_kernel

Jan Hendrik Metzen added 21 commits July 4, 2014 09:53
…lation (work-in-progress)

Advantages
 * GPs and their correlation models more strictly separated
 * Allow non-stationary correlation models in the future
 * Reduce code-redundanca
…g likelihood over several random starts.

The sign of the optimal_rlf_value was set wrongly in 
_arg_max_reduced_likelihood_function() since it was set to the negative
of the return of reduced_likelihood_function(). This error was probably
caused by confusing reduced_likelihood_function(theta) 
with minus_reduced_likelihood_function(log10t). It resulted, however,
in _arg_max_reduced_likelihood_function() returning the worst local
maxima instead of the best one when performing several random_starts.
…s callables

In addition to the 'fmin_cobyla', 'Welch', GaussianProcess allows now
to pass other optimizers directly as callables via the parameter optimizer.
This allows to call __init__  of more complex correlation models outside
of the GP before the training data X used in fit() is known
…ead of ML

Instead of choosing the hyperparameters that maximize the likelihood,
the maximum-a-posteriori is used. While the prior is assumed to be
uniform for the implemented stationary correlation models, more 
complex correlation models with more hyperparameters may use an
appropriate prior to reduce the risk of overfitting when the training
set is small.
@jmetzen
Copy link
Member Author

jmetzen commented Jul 15, 2014

This PR might be interesting for @dubourg, @agramfort, @larsmans, @JohnFNovak, and @AlexanderFabisch

raise Exception("Multiple input features cannot have the same"
" value.")

def __call__(self, theta, X=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we could avoid overriding the call method, and call transform() explicitly instead by setting properties on calls to fit()?

@kastnerkyle
Copy link
Member

Several things here:

  1. While we are refactoring, is there a way we can deprecate the term "nugget"? I really, really dislike the name - there has to be some clearer title for general users.
  2. I don't know about the two abstract base classes. On the one hand, I see the logic in doing it in this way, but I also feel that explicit independent functions for each make things more modular (which is the reasoning behind the change), and would lend a hand toward using these kernels in an eventual KernelMixin class, which was discussed by @mblondel somewhere around here. At very least, I currently don't see why overriding call is preferable to simple making that functionality live in .transform() or another method, and setting the hyperparameters on fit().

@kastnerkyle
Copy link
Member

Actually, I am not 100% sure why these classes need all the fit() machinery/stateful behavior at all. It seems like there isn't much need for the class structure, and could be implemented as pure functions that also use some shared functions from the module level. Maybe I am missing something crucial?

@jmetzen
Copy link
Member Author

jmetzen commented Jul 15, 2014

regarding 1) I also dislike nugget, but it is part of the external GaussianProcess API so I didn't change it.
regarding 2) I am open to use an other interface than call; I mainly used call because it allows to use the correlation model in a similar way as the old functional correlation models . I don't know about calling it transform; it is not really transforming the data and can even be called without data X (computing the autocorrelation on self.X). Regarding splitting the abstract base class into several independent functions: I am not sure if this could easily be passed to the GP via the corr parameter of the constructor?

@kastnerkyle
Copy link
Member

On the first, I am glad to see that I am not the only person who is bugged by the name :) . It would be a solid opportunity to deprecate nugget at the API level, then introduce a new parameter that we expose as well and use internally, though I am not sure of a good name yet.

I will try to work up a skeleton example off your branch to show what I mean on the second count. Thanks for the work so far - GP definitely needs some love!

@jmetzen
Copy link
Member Author

jmetzen commented Jul 15, 2014

Regarding the fit() machinery/stateful behavior: The l1_cross_differences stored in StationaryCorrelation.D need to be computed only once. Formerly, they have been stored in the GP itself (where they are not really required). The stateful behavior would also be required for implementing some non-stationary correlation models from the literature (not part of this PR).

@kastnerkyle
Copy link
Member

On the l1_cross_differences, it would be pretty easy to do an "if None" check to avoid recomputing.

I would actually prefer storing state in the GP itself - if that was the case, we could still use the state stored in the GP to implement non-stationary correlation models, right? It seems more explicit to me to keep the state in the class itself, and let the correlations stay as simple as possible. In any case, I will try and work out an example of "stupid" correlations off your branch to see if it is a reasonable approach.

@kastnerkyle
Copy link
Member

Ultimately, I think my comments are better addressed by a larger refactor of the GP class. Ignore me for now, and we can analyze this PR on it's own merits.

In general we should work toward making the GP class more transparent and accessible - I currently find it quite complicated for (possibly) no reason, and much of the class does not adhere to modern sklearn API standards.

Adding a class hierarchy to the kernel methods which is not shared with other kernel classes, pairwise_metrics, etc. doesn't really help this. If we decide to merge this PR, I would prefer to see these implementation details (the two metaclasses and their subclasses) private rather than public, in order to make a larger refactor easier in the future. Making the correlations/kernels/covariances very dumb and keeping as much state contained to the GP class as possible is also a step in this direction. That said, I think this PR can definitely help people work in the GP class now, which is very important.

There is a lot in common between kernel ridge regression and GP/kriging - I do not advocate writing "the class to end all classes" but it is definitely worth seeing if we can merge some common functionality now and in the future.

@larsmans
Copy link
Member

This is a big PR, let's not start changing the API ("nugget" vs. whatever) if we don't need to. Too many good PRs go stale because to many changes are introduced at the same time and the review process becomes a nightmare.

@kastnerkyle
Copy link
Member

Good point. If it helps get this merged, let's minimize the surface area
and go with what it does now. I think it is very good, but I am unhappy
with the GP in general. This should help that :)

On Tue, Jul 15, 2014 at 12:52 PM, Lars Buitinck notifications@github.com
wrote:

This is a big PR, let's not start changing the API ("nugget" vs. whatever)
if we don't need to. Too many good PRs go stale because to many changes are
introduced at the same time and the review process becomes a nightmare.


Reply to this email directly or view it on GitHub
#3388 (comment)
.

@jmetzen
Copy link
Member Author

jmetzen commented Jul 16, 2014

+1 that the GP class needs to get more transparent and accessible
In my opinion, the best way for this is to focus on one part of it at a time (probably this PR is already too big). Maybe we can leave the ML vs. MAP part for later and focus first on getting the correlation models straight. Here my thoughts on this:

  • correlation_models vs. pairwise_metrics: the strongest overlap here is in my opinion the function l1_cross_differences which is pretty similar to manhattan_distances from pairwise (for Y==None). The main difference seems to be that l1_cross_differences exploits symmetry of D (when Y==None) and has thus a smaller memory footprint. Another overlap is between SquaredExponential and the rbf_kernel from pairwise. rbf_kernel is restricted to the isometric case at the moment, however.
  • correlation models as functions versus stateful objects: for the correlation models implemented currently in sklearn, the state consists basically of the pairwise distances of the training data (via D and ij). This could be kept in the GaussianProcess class. I see many use cases for custom correlation models (and have some myself), which need not become part of sklearn itself. It would be sufficient to keep them external but be able to use them with sklearn's GP as long as they have a compatible interface with sklearn's correlation models. Since some kernels might not only differ in computation but also in their internal state (one of my examples stores a clustering of X internally), stateful objects are in my opinion better suited for this. The specific interface of StationaryCorrelation is certainly not optimal; I am open for suggestions ;-)

@kastnerkyle
Copy link
Member

Is there any way we could separate the bugfix stuff and the refactoring/API change proposal into two PRs? I really, really like the bugfixes but we will need to have a lot of thought and input about the API stuff. If we can split this into two smaller PRs I think it is more likely to get looked at.

@jmetzen
Copy link
Member Author

jmetzen commented Jul 21, 2014

I will create a separate PR for the bugfixes and corresponding tests

@kastnerkyle
Copy link
Member

Great - there are a lot of fixes here :)

@jmetzen
Copy link
Member Author

jmetzen commented Jul 22, 2014

Oops, seems that we fixed the same issue twice. In order to get PR #2632 finally merged, I added a proposal for a regression test in the comments.

@jmetzen
Copy link
Member Author

jmetzen commented Aug 30, 2014

Now that the bugfixes are merged via PR #3545, what is the general opinion regarding the new features? And using classes for the correlation_models?

If this PR is too big, I can focus on a subset of the features/refacotoring and create a smaller PR.

@jmetzen
Copy link
Member Author

jmetzen commented Dec 19, 2014

If have created a separate PR (PR #3885) for adding the Matern kernel to pairwise.py. Once that is done, I will try refactoring correlation_models.py such that it uses as much from pairwise.py as possible and close this PR. What is your opinion, @kastnerkyle?

@kastnerkyle
Copy link
Member

That sounds awesome! Thanks for all your work on this, it is really nice to
see the GP getting some fixes - sharing the kernels via pairwise.py should
be good for many estimators. It will also provide a clear roadmap for
people wanting to use their own kernels.

On Fri, Dec 19, 2014 at 9:57 AM, Jan Hendrik Metzen <
notifications@github.com> wrote:

If have created a separate PR (PR #3885
#3885) for adding the
Matern kernel to pairwise.py. Once that is done, I will try refactoring
correlation_models.py such that it uses as much from pairwise.py as
possible and close this PR. What is your opinion, @kastnerkyle
https://github.com/kastnerkyle?


Reply to this email directly or view it on GitHub
#3388 (comment)
.

@jmetzen jmetzen closed this Jan 17, 2015
@jmetzen jmetzen mentioned this pull request Jan 17, 2015
2 tasks
@amueller
Copy link
Member

@jmetzen you closed this becaue you want to break it up? Sorry for the lack of feedback before.

@jmetzen
Copy link
Member Author

jmetzen commented Jan 18, 2015

I plan to split the contents of this PR into several smaller PRs. First step would be to refactor correlation_models.py such that it reuses the kernels from pairwise.py.

@amueller
Copy link
Member

Btw, did you follow the (somewhat) recent discussion on the mailing list on whether it would be worth to have a language for specifying kernels in sklearn?
When using the pairwise model we might end up with a lot of redundant computation if we combine many kernels, right?
I guess it would be better than what we currently have, though.

@mblondel
Copy link
Member

I suspsect that we'll need kernel classes anyway for kernel parameter
optimization by gradient descent. Also popular kernels tend to be different
for SVMs and GPs.

@amueller
Copy link
Member

So if we need kernel classes, I'm not sure the proposed refactoring is the best idea.

@mblondel
Copy link
Member

My advice would be to worry about code factorization after the GP module got rewritten.

@jmetzen
Copy link
Member Author

jmetzen commented Jan 20, 2015

Makes sense to defer refactoring until the GP module is rewritten. I also checked how pairwise's kernels could be reused in the GP context and there are at least 2 drawbacks: pairwise's kernels are restricted to the isotropic case (scalar gamma) and they don't exploit that k(X, X) is symmetric - in contrast to the current GP implementation. Thus, code reuse between pairwise and correlation_models would probably also require some non-trivial refactoring of pairwise.

BTW: Is anyone currently working on the GP refactoring?

@amueller
Copy link
Member

Unfortunately not.
@dfm said he might have some time to help with improving documentation, but he seems busy finding a new earth ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants