Skip to content

[MRG+2] Allowing Gaussian process kernels on structured data - updated #15557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

yhtang
Copy link
Contributor

@yhtang yhtang commented Nov 7, 2019

Fixes #13936.

This PR is a from-scratch refresh of #13954 (with all previous modifications incorporated) based on the latest master branch.

Made the Gaussian process regressor and classifier compatible with generic kernels on arbitrary data. Specifically:

  1. Added a requires_vector_input property to all the built-in GP kernels in gaussian_process/kernels.py.
  2. Added a GenericKernelMixin base classes to overload the requires_vector_input property for kernels that can work on structured and/or generic data.
  3. Let the GP regressor and classifier use different check_array and check_X_y parameters for vector/generic input. I.e., do not force the X and Y array to be at least 2D and numeric if the kernel can operate on generic input. Updated docstrings accordingly.
  4. Provided a minimal example of GP regression and classification using a sequence kernel on variable-length strings in plot_gpr_on_structured_data.py.

@jnothman
Copy link
Member

jnothman commented Nov 7, 2019

Why do you regard this as WIP? What work remains other than review?

@yhtang
Copy link
Contributor Author

yhtang commented Nov 7, 2019

Why do you regard this as WIP? What work remains other than review?

I have one more commit that updates the changelog that has not yet been pushed : ) Should be there in a few minutes.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking good to me

@yhtang yhtang changed the title [WIP] Allowing Gaussian process kernels on structured data - updated [MRG] Allowing Gaussian process kernels on structured data - updated Nov 7, 2019
@yhtang
Copy link
Contributor Author

yhtang commented Nov 7, 2019

@jnothman How could I restart a failed pipeline caused by network error?

@yhtang
Copy link
Contributor Author

yhtang commented Nov 7, 2019

@jnothman Could you please let me know the next steps to do?

@jnothman
Copy link
Member

jnothman commented Nov 7, 2019

Await another reviewer... unfortunately, this may miss the cutoff for 0.22.

@yhtang
Copy link
Contributor Author

yhtang commented Nov 8, 2019

Await another reviewer... unfortunately, this may miss the cutoff for 0.22.

I see... Do I need to update the PR name to MRG + 1 etc.?
Thanks very much anyway!

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nitpicks, thanks for this nice contribution !

yhtang and others added 4 commits November 15, 2019 14:25
Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
@yhtang
Copy link
Contributor Author

yhtang commented Nov 15, 2019

Only nitpicks, thanks for this nice contribution !

Thanks for reviewing!

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
@yhtang yhtang changed the title [MRG] Allowing Gaussian process kernels on structured data - updated [MRG+2] Allowing Gaussian process kernels on structured data - updated Nov 16, 2019
@TomDLT TomDLT merged commit 8360786 into scikit-learn:master Nov 16, 2019
@jnothman
Copy link
Member

@adrinjalali are you keeping track of things that will either need their change log entries moved to 0.23 or will need to be picked into 0.22 final? (You don't necessarily need to keep track, just to be aware when selecting commits for release)

@NicolasHug
Copy link
Member

@jnothman @TomDLT , this PR allows passing sequences of different lengths to the GP estimators.

The HMM module was deprecated because supporting sequences was considered out of scope for scikit-learn (API compatibility issues, etc).

Can you explain why you think it's worth supporting this for GPs?

@yhtang
Copy link
Contributor Author

yhtang commented Jan 15, 2020

@NicolasHug Hi Nicolas, the primary motivation that I brought up the PR was due to a need to perform Gaussian process regressions on an ensemble of graphs. While the PR intends to bridge scikit-learn's GPR module to data including and beyond variable-length sequences, the actual changes involves no more than allowing non-vectorial data to be passed through the GP regressors/classifiers --- without being touched at all.

Thanks to the kernel trick, as discussed in the original issue, the logic of computing the kernel matrix from samples is delegated to a kernel, which will be user-supplied for sequence and generic data. As such, I would argue that this introduces a very minimal amount of burden to the development and maintenance of the GP module and does not disrupt the API and/or future development, while at the same time greatly extends the applicability of the module.

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GaussianProcessRegressor unnecessarily forcing training samples to be array_like
4 participants