-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC add cross-reference to examples instead of duplicating content for GPR #20003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a0d84a5
286408e
0383114
69a984e
e3f359d
5cababf
bd281fa
7de916b
8e90750
37d1cef
d073926
1a674e5
d08a09c
7c2d3a6
83cb215
c965bdb
a20e370
9bedda0
a2e4057
76a4a11
d0a4396
e1467af
e0b5422
41a8b2e
7cfc324
8d5a356
2329354
49db719
9e87455
2089316
88f4278
1d7bf95
39f62e0
d3509e3
81ee851
9ac4237
8a468d1
4552275
3e86ff7
da32d5e
5677f19
bd8bb62
a547d34
87c9272
45bfb48
53b2eda
b06bdca
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,3 @@ | ||
|
||
|
||
.. _gaussian_process: | ||
|
||
================== | ||
|
@@ -8,7 +6,7 @@ Gaussian Processes | |
|
||
.. currentmodule:: sklearn.gaussian_process | ||
|
||
**Gaussian Processes (GP)** are a generic supervised learning method designed | ||
**Gaussian Processes (GP)** are a nonparametric supervised learning method used | ||
to solve *regression* and *probabilistic classification* problems. | ||
|
||
The advantages of Gaussian processes are: | ||
|
@@ -27,8 +25,8 @@ The advantages of Gaussian processes are: | |
|
||
The disadvantages of Gaussian processes include: | ||
|
||
- They are not sparse, i.e., they use the whole samples/features information to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sparse Gaussian Processes is a thing, just not in Scikit. One defines a set of inducing points (smaller than the data) and use them for learning the GP instead of the data. |
||
perform the prediction. | ||
- Our implementation is not sparse, i.e., they use the whole samples/features | ||
information to perform the prediction. | ||
|
||
- They lose efficiency in high dimensional spaces -- namely when the number | ||
of features exceeds a few dozens. | ||
|
@@ -42,31 +40,44 @@ Gaussian Process Regression (GPR) | |
.. currentmodule:: sklearn.gaussian_process | ||
|
||
The :class:`GaussianProcessRegressor` implements Gaussian processes (GP) for | ||
regression purposes. For this, the prior of the GP needs to be specified. The | ||
prior mean is assumed to be constant and zero (for ``normalize_y=False``) or the | ||
training data's mean (for ``normalize_y=True``). The prior's | ||
covariance is specified by passing a :ref:`kernel <gp_kernels>` object. The | ||
hyperparameters of the kernel are optimized during fitting of | ||
GaussianProcessRegressor by maximizing the log-marginal-likelihood (LML) based | ||
on the passed ``optimizer``. As the LML may have multiple local optima, the | ||
optimizer can be started repeatedly by specifying ``n_restarts_optimizer``. The | ||
first run is always conducted starting from the initial hyperparameter values | ||
of the kernel; subsequent runs are conducted from hyperparameter values | ||
that have been chosen randomly from the range of allowed values. | ||
If the initial hyperparameters should be kept fixed, `None` can be passed as | ||
optimizer. | ||
regression purposes. For this, the prior of the GP needs to be specified. GP | ||
will combine this prior and the likelihood function based on training samples. | ||
It allows to give a probabilistic approach to prediction by giving the mean and | ||
standard deviation as output when predicting. | ||
|
||
The noise level in the targets can be specified by passing it via the | ||
parameter ``alpha``, either globally as a scalar or per datapoint. | ||
Note that a moderate noise level can also be helpful for dealing with numeric | ||
issues during fitting as it is effectively implemented as Tikhonov | ||
regularization, i.e., by adding it to the diagonal of the kernel matrix. An | ||
alternative to specifying the noise level explicitly is to include a | ||
WhiteKernel component into the kernel, which can estimate the global noise | ||
level from the data (see example below). | ||
.. figure:: ../auto_examples/gaussian_process/images/sphx_glr_plot_gpr_noisy_targets_002.png | ||
:target: ../auto_examples/gaussian_process/plot_gpr_noisy_targets.html | ||
:align: center | ||
|
||
The prior mean is assumed to be constant and zero (for `normalize_y=False`) or | ||
the training data's mean (for `normalize_y=True`). The prior's covariance is | ||
specified by passing a :ref:`kernel <gp_kernels>` object. The hyperparameters | ||
of the kernel are optimized when fitting the :class:`GaussianProcessRegressor` | ||
by maximizing the log-marginal-likelihood (LML) based on the passed | ||
`optimizer`. As the LML may have multiple local optima, the optimizer can be | ||
started repeatedly by specifying `n_restarts_optimizer`. The first run is | ||
always conducted starting from the initial hyperparameter values of the kernel; | ||
subsequent runs are conducted from hyperparameter values that have been chosen | ||
randomly from the range of allowed values. If the initial hyperparameters | ||
should be kept fixed, `None` can be passed as optimizer. | ||
|
||
The noise level in the targets can be specified by passing it via the parameter | ||
`alpha`, either globally as a scalar or per datapoint. Note that a moderate | ||
noise level can also be helpful for dealing with numeric instabilities during | ||
fitting as it is effectively implemented as Tikhonov regularization, i.e., by | ||
adding it to the diagonal of the kernel matrix. An alternative to specifying | ||
the noise level explicitly is to include a | ||
:class:`~sklearn.gaussian_process.kernels.WhiteKernel` component into the | ||
kernel, which can estimate the global noise level from the data (see example | ||
below). The figure below shows the effect of noisy target handled by setting | ||
the parameter `alpha`. | ||
|
||
.. figure:: ../auto_examples/gaussian_process/images/sphx_glr_plot_gpr_noisy_targets_003.png | ||
:target: ../auto_examples/gaussian_process/plot_gpr_noisy_targets.html | ||
:align: center | ||
|
||
The implementation is based on Algorithm 2.1 of [RW2006]_. In addition to | ||
the API of standard scikit-learn estimators, GaussianProcessRegressor: | ||
the API of standard scikit-learn estimators, :class:`GaussianProcessRegressor`: | ||
|
||
* allows prediction without prior fitting (based on the GP prior) | ||
|
||
|
@@ -77,149 +88,12 @@ the API of standard scikit-learn estimators, GaussianProcessRegressor: | |
externally for other ways of selecting hyperparameters, e.g., via | ||
Markov chain Monte Carlo. | ||
|
||
.. topic:: Examples | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe rename to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a topic and not the name of the section. I don't think you can use this to reference. |
||
|
||
GPR examples | ||
============ | ||
|
||
GPR with noise-level estimation | ||
------------------------------- | ||
This example illustrates that GPR with a sum-kernel including a WhiteKernel can | ||
estimate the noise level of data. An illustration of the | ||
log-marginal-likelihood (LML) landscape shows that there exist two local | ||
maxima of LML. | ||
|
||
.. figure:: ../auto_examples/gaussian_process/images/sphx_glr_plot_gpr_noisy_003.png | ||
:target: ../auto_examples/gaussian_process/plot_gpr_noisy.html | ||
:align: center | ||
|
||
The first corresponds to a model with a high noise level and a | ||
large length scale, which explains all variations in the data by noise. | ||
|
||
.. figure:: ../auto_examples/gaussian_process/images/sphx_glr_plot_gpr_noisy_004.png | ||
:target: ../auto_examples/gaussian_process/plot_gpr_noisy.html | ||
:align: center | ||
|
||
The second one has a smaller noise level and shorter length scale, which explains | ||
most of the variation by the noise-free functional relationship. The second | ||
model has a higher likelihood; however, depending on the initial value for the | ||
hyperparameters, the gradient-based optimization might also converge to the | ||
high-noise solution. It is thus important to repeat the optimization several | ||
times for different initializations. | ||
|
||
.. figure:: ../auto_examples/gaussian_process/images/sphx_glr_plot_gpr_noisy_005.png | ||
:target: ../auto_examples/gaussian_process/plot_gpr_noisy.html | ||
:align: center | ||
|
||
|
||
Comparison of GPR and Kernel Ridge Regression | ||
--------------------------------------------- | ||
|
||
Both kernel ridge regression (KRR) and GPR learn | ||
a target function by employing internally the "kernel trick". KRR learns a | ||
linear function in the space induced by the respective kernel which corresponds | ||
to a non-linear function in the original space. The linear function in the | ||
kernel space is chosen based on the mean-squared error loss with | ||
ridge regularization. GPR uses the kernel to define the covariance of | ||
a prior distribution over the target functions and uses the observed training | ||
data to define a likelihood function. Based on Bayes theorem, a (Gaussian) | ||
posterior distribution over target functions is defined, whose mean is used | ||
for prediction. | ||
|
||
A major difference is that GPR can choose the kernel's hyperparameters based | ||
on gradient-ascent on the marginal likelihood function while KRR needs to | ||
perform a grid search on a cross-validated loss function (mean-squared error | ||
loss). A further difference is that GPR learns a generative, probabilistic | ||
model of the target function and can thus provide meaningful confidence | ||
intervals and posterior samples along with the predictions while KRR only | ||
provides predictions. | ||
|
||
The following figure illustrates both methods on an artificial dataset, which | ||
consists of a sinusoidal target function and strong noise. The figure compares | ||
the learned model of KRR and GPR based on a ExpSineSquared kernel, which is | ||
suited for learning periodic functions. The kernel's hyperparameters control | ||
the smoothness (length_scale) and periodicity of the kernel (periodicity). | ||
Moreover, the noise level | ||
of the data is learned explicitly by GPR by an additional WhiteKernel component | ||
in the kernel and by the regularization parameter alpha of KRR. | ||
|
||
.. figure:: ../auto_examples/gaussian_process/images/sphx_glr_plot_compare_gpr_krr_005.png | ||
:target: ../auto_examples/gaussian_process/plot_compare_gpr_krr.html | ||
:align: center | ||
|
||
The figure shows that both methods learn reasonable models of the target | ||
function. GPR provides reasonable confidence bounds on the prediction which are not | ||
available for KRR. A major difference between the two methods is the time | ||
required for fitting and predicting: while fitting KRR is fast in principle, | ||
the grid-search for hyperparameter optimization scales exponentially with the | ||
number of hyperparameters ("curse of dimensionality"). The gradient-based | ||
optimization of the parameters in GPR does not suffer from this exponential | ||
scaling and is thus considerably faster on this example with 3-dimensional | ||
hyperparameter space. The time for predicting is similar; however, generating | ||
the variance of the predictive distribution of GPR takes considerably longer | ||
than just predicting the mean. | ||
|
||
GPR on Mauna Loa CO2 data | ||
------------------------- | ||
|
||
This example is based on Section 5.4.3 of [RW2006]_. | ||
It illustrates an example of complex kernel engineering and | ||
hyperparameter optimization using gradient ascent on the | ||
log-marginal-likelihood. The data consists of the monthly average atmospheric | ||
CO2 concentrations (in parts per million by volume (ppmv)) collected at the | ||
Mauna Loa Observatory in Hawaii, between 1958 and 1997. The objective is to | ||
model the CO2 concentration as a function of the time t. | ||
|
||
The kernel is composed of several terms that are responsible for explaining | ||
different properties of the signal: | ||
|
||
- a long term, smooth rising trend is to be explained by an RBF kernel. The | ||
RBF kernel with a large length-scale enforces this component to be smooth; | ||
it is not enforced that the trend is rising which leaves this choice to the | ||
GP. The specific length-scale and the amplitude are free hyperparameters. | ||
|
||
- a seasonal component, which is to be explained by the periodic | ||
ExpSineSquared kernel with a fixed periodicity of 1 year. The length-scale | ||
of this periodic component, controlling its smoothness, is a free parameter. | ||
In order to allow decaying away from exact periodicity, the product with an | ||
RBF kernel is taken. The length-scale of this RBF component controls the | ||
decay time and is a further free parameter. | ||
|
||
- smaller, medium term irregularities are to be explained by a | ||
RationalQuadratic kernel component, whose length-scale and alpha parameter, | ||
which determines the diffuseness of the length-scales, are to be determined. | ||
According to [RW2006]_, these irregularities can better be explained by | ||
a RationalQuadratic than an RBF kernel component, probably because it can | ||
accommodate several length-scales. | ||
|
||
- a "noise" term, consisting of an RBF kernel contribution, which shall | ||
explain the correlated noise components such as local weather phenomena, | ||
and a WhiteKernel contribution for the white noise. The relative amplitudes | ||
and the RBF's length scale are further free parameters. | ||
|
||
Maximizing the log-marginal-likelihood after subtracting the target's mean | ||
yields the following kernel with an LML of -83.214: | ||
|
||
:: | ||
|
||
34.4**2 * RBF(length_scale=41.8) | ||
+ 3.27**2 * RBF(length_scale=180) * ExpSineSquared(length_scale=1.44, | ||
periodicity=1) | ||
+ 0.446**2 * RationalQuadratic(alpha=17.7, length_scale=0.957) | ||
+ 0.197**2 * RBF(length_scale=0.138) + WhiteKernel(noise_level=0.0336) | ||
|
||
Thus, most of the target signal (34.4ppm) is explained by a long-term rising | ||
trend (length-scale 41.8 years). The periodic component has an amplitude of | ||
3.27ppm, a decay time of 180 years and a length-scale of 1.44. The long decay | ||
time indicates that we have a locally very close to periodic seasonal | ||
component. The correlated noise has an amplitude of 0.197ppm with a length | ||
scale of 0.138 years and a white-noise contribution of 0.197ppm. Thus, the | ||
overall noise level is very small, indicating that the data can be very well | ||
explained by the model. The figure shows also that the model makes very | ||
confident predictions until around 2015 | ||
|
||
.. figure:: ../auto_examples/gaussian_process/images/sphx_glr_plot_gpr_co2_003.png | ||
:target: ../auto_examples/gaussian_process/plot_gpr_co2.html | ||
:align: center | ||
* :ref:`sphx_glr_auto_examples_gaussian_process_plot_gpr_noisy_targets.py` | ||
* :ref:`sphx_glr_auto_examples_gaussian_process_plot_gpr_noisy.py` | ||
* :ref:`sphx_glr_auto_examples_gaussian_process_plot_compare_gpr_krr.py` | ||
* :ref:`sphx_glr_auto_examples_gaussian_process_plot_gpr_co2.py` | ||
|
||
.. _gpc: | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would strongly encourage including the term NONPARAMETRIC when discussing GPs.
I would rephrase the opening (lines 9-11):
Gaussian Processes (GP) are a nonparametric supervised learning method used
to solve regression and probabilistic classification problems.