[MRG+2] Allowing Gaussian process kernels on structured data - updated #15557

yhtang · 2019-11-07T00:03:59Z

This PR is a from-scratch refresh of #13954 (with all previous modifications incorporated) based on the latest master branch.

Made the Gaussian process regressor and classifier compatible with generic kernels on arbitrary data. Specifically:

Added a requires_vector_input property to all the built-in GP kernels in gaussian_process/kernels.py.
Added a GenericKernelMixin base classes to overload the requires_vector_input property for kernels that can work on structured and/or generic data.
Let the GP regressor and classifier use different check_array and check_X_y parameters for vector/generic input. I.e., do not force the X and Y array to be at least 2D and numeric if the kernel can operate on generic input. Updated docstrings accordingly.
Provided a minimal example of GP regression and classification using a sequence kernel on variable-length strings in plot_gpr_on_structured_data.py.

…een kernels that operate on explicit feature vectors or structured objects.

updated corresponding tests

…n variable-length sequences

…base `Kernel` class. Removed `VectorKernelMixin` which is redundant in presence of a default `requires_vector_input()` method.

jnothman · 2019-11-07T07:39:44Z

Why do you regard this as WIP? What work remains other than review?

yhtang · 2019-11-07T07:41:02Z

Why do you regard this as WIP? What work remains other than review?

I have one more commit that updates the changelog that has not yet been pushed : ) Should be there in a few minutes.

jnothman

this is looking good to me

examples/gaussian_process/plot_gpr_on_structured_data.py

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

yhtang · 2019-11-07T08:19:26Z

@jnothman How could I restart a failed pipeline caused by network error?

yhtang · 2019-11-07T21:36:42Z

@jnothman Could you please let me know the next steps to do?

jnothman · 2019-11-07T22:35:06Z

Await another reviewer... unfortunately, this may miss the cutoff for 0.22.

yhtang · 2019-11-08T00:03:54Z

Await another reviewer... unfortunately, this may miss the cutoff for 0.22.

I see... Do I need to update the PR name to MRG + 1 etc.?
Thanks very much anyway!

TomDLT

Only nitpicks, thanks for this nice contribution !

examples/gaussian_process/plot_gpr_on_structured_data.py

sklearn/base.py

sklearn/gaussian_process/_gpc.py

examples/gaussian_process/plot_gpr_on_structured_data.py

sklearn/gaussian_process/_gpr.py

sklearn/gaussian_process/kernels.py

sklearn/gaussian_process/tests/_mini_sequence_kernel.py

sklearn/gaussian_process/tests/test_kernels.py

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

sklearn/gaussian_process/tests/_mini_sequence_kernel.py

yhtang · 2019-11-15T23:48:17Z

Only nitpicks, thanks for this nice contribution !

Thanks for reviewing!

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

jnothman · 2019-11-17T07:43:00Z

@adrinjalali are you keeping track of things that will either need their change log entries moved to 0.23 or will need to be picked into 0.22 final? (You don't necessarily need to keep track, just to be aware when selecting commits for release)

…n#15557)

NicolasHug · 2020-01-14T14:42:53Z

@jnothman @TomDLT , this PR allows passing sequences of different lengths to the GP estimators.

The HMM module was deprecated because supporting sequences was considered out of scope for scikit-learn (API compatibility issues, etc).

Can you explain why you think it's worth supporting this for GPs?

yhtang · 2020-01-15T19:01:17Z

@NicolasHug Hi Nicolas, the primary motivation that I brought up the PR was due to a need to perform Gaussian process regressions on an ensemble of graphs. While the PR intends to bridge scikit-learn's GPR module to data including and beyond variable-length sequences, the actual changes involves no more than allowing non-vectorial data to be passed through the GP regressors/classifiers --- without being touched at all.

Thanks to the kernel trick, as discussed in the original issue, the logic of computing the kernel matrix from samples is delegated to a kernel, which will be user-supplied for sequence and generic data. As such, I would argue that this introduces a very minimal amount of burden to the development and maintenance of the GP module and does not disrupt the API and/or future development, while at the same time greatly extends the applicability of the module.

…n#15557)

yhtang added 12 commits November 6, 2019 16:01

verify CI

685d980

verify CI

797cab5

Implemented mixins and property propagations for differentiating betw…

492a557

…een kernels that operate on explicit feature vectors or structured objects.

name changes and updates to tests

0196550

updated the GPR and GPC models for generic kernels

1429e63

updated corresponding tests

fixed some lint errors

8cb0fc8

adding a missed new file...

62d35b3

added an example of using structured kernel to train GPR/GPC models o…

aad9cbd

…n variable-length sequences

Let the requires_vector_input() method return True by default in the …

1386a3f

…base `Kernel` class. Removed `VectorKernelMixin` which is redundant in presence of a default `requires_vector_input()` method.

use or instead of np.any for binary comparisons

7356de3

made requires_vector_input a property instead of a method

3249778

added a record of this new feature to the documentation

90c3807

yhtang mentioned this pull request Nov 7, 2019

[MRG]Allow Gaussian process kernels on structured data #13936 #13954

Closed

jnothman reviewed Nov 7, 2019

View reviewed changes

yhtang and others added 6 commits November 6, 2019 23:51

updating docstrings

f89bb5f

Case change in docstring.

17d82ad

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

used rst's footnote syntax for references in docstring.

0595126

added section headings to examples

b704da6

sticks to the state-machine interface of MPL in the new example.

8391b56

fixed some lint errors.

8ecdf0a

jnothman approved these changes Nov 7, 2019

View reviewed changes

yhtang changed the title ~~[WIP] Allowing Gaussian process kernels on structured data - updated~~ [MRG] Allowing Gaussian process kernels on structured data - updated Nov 7, 2019

yhtang added 2 commits November 7, 2019 00:58

fixed MPL xticks() compatibility issue.

f93f5c4

documentation update

6003644

jnothman added the Waiting for Reviewer label Nov 7, 2019

TomDLT approved these changes Nov 15, 2019

View reviewed changes

yhtang and others added 4 commits November 15, 2019 14:25

use non-transparent scatter points

d193041

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

Apply suggestions from code review

c016981

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

Address discussion at scikit-learn#15557 (comment)

57da4d6

lint

208b5ce

TomDLT approved these changes Nov 15, 2019

View reviewed changes

sklearn/gaussian_process/tests/_mini_sequence_kernel.py Outdated Show resolved Hide resolved

Apply suggestions from code review

28904c0

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

yhtang changed the title ~~[MRG] Allowing Gaussian process kernels on structured data - updated~~ [MRG+2] Allowing Gaussian process kernels on structured data - updated Nov 16, 2019

TomDLT merged commit 8360786 into scikit-learn:master Nov 16, 2019

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 18, 2019

ENH Allowing Gaussian process kernels on structured data (scikit-lear…

60df404

…n#15557)

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 18, 2019

ENH Allowing Gaussian process kernels on structured data (scikit-lear…

b36c673

…n#15557)

adrinjalali pushed a commit that referenced this pull request Nov 19, 2019

ENH Allowing Gaussian process kernels on structured data (#15557)

d970dae

NicolasHug mentioned this pull request Jan 14, 2020

MNT Introduction of n_features_in_ attr with _validate_data mtd #16112

Merged

adrinjalali mentioned this pull request Jan 15, 2020

API: do we allow a list of generic objects as X? #16130

Open

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

ENH Allowing Gaussian process kernels on structured data (scikit-lear…

400eeab

…n#15557)

ogrisel mentioned this pull request May 6, 2025

Fix ConvergenceWarning in plot_gpr_on_structured_data.py example #31164

Open

Uh oh!

[MRG+2] Allowing Gaussian process kernels on structured data - updated #15557

[MRG+2] Allowing Gaussian process kernels on structured data - updated #15557

Uh oh!

Conversation

yhtang commented Nov 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Nov 7, 2019

Uh oh!

yhtang commented Nov 7, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yhtang commented Nov 7, 2019

Uh oh!

yhtang commented Nov 7, 2019

Uh oh!

jnothman commented Nov 7, 2019

Uh oh!

yhtang commented Nov 8, 2019

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yhtang commented Nov 15, 2019

Uh oh!

jnothman commented Nov 17, 2019

Uh oh!

NicolasHug commented Jan 14, 2020

Uh oh!

yhtang commented Jan 15, 2020

Uh oh!

Uh oh!

yhtang commented Nov 7, 2019 •

edited

Loading