Skip to content

Commit 62372a9

Browse files
StefanieSengerjeremiedbb
authored andcommitted
DOC Notebook style and enhanced descriptions in SVM kernels example (#26805)
1 parent c36eab0 commit 62372a9

File tree

3 files changed

+274
-80
lines changed

3 files changed

+274
-80
lines changed

doc/tutorial/statistical_inference/supervised_learning.rst

+13-3
Original file line numberDiff line numberDiff line change
@@ -465,7 +465,7 @@ Linear kernel
465465

466466
>>> svc = svm.SVC(kernel='linear')
467467

468-
.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_001.png
468+
.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_002.png
469469
:target: ../../auto_examples/svm/plot_svm_kernels.html
470470

471471
Polynomial kernel
@@ -477,7 +477,7 @@ Polynomial kernel
477477
... degree=3)
478478
>>> # degree: polynomial degree
479479

480-
.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_002.png
480+
.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_003.png
481481
:target: ../../auto_examples/svm/plot_svm_kernels.html
482482

483483
RBF kernel (Radial Basis Function)
@@ -489,7 +489,17 @@ RBF kernel (Radial Basis Function)
489489
>>> # gamma: inverse of size of
490490
>>> # radial kernel
491491

492-
.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_003.png
492+
.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_004.png
493+
:target: ../../auto_examples/svm/plot_svm_kernels.html
494+
495+
Sigmoid kernel
496+
^^^^^^^^^^^^^^
497+
498+
::
499+
500+
>>> svc = svm.SVC(kernel='sigmoid')
501+
502+
.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_005.png
493503
:target: ../../auto_examples/svm/plot_svm_kernels.html
494504

495505

examples/svm/plot_svm_kernels.py

+250-70
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,273 @@
11
"""
22
=========================================================
3-
SVM-Kernels
3+
Plot classification boundaries with different SVM Kernels
44
=========================================================
5+
This example shows how different kernels in a :class:`~sklearn.svm.SVC` (Support Vector
6+
Classifier) influence the classification boundaries in a binary, two-dimensional
7+
classification problem.
58
6-
Three different types of SVM-Kernels are displayed below.
7-
The polynomial and RBF are especially useful when the
8-
data-points are not linearly separable.
9+
SVCs aim to find a hyperplane that effectively separates the classes in their training
10+
data by maximizing the margin between the outermost data points of each class. This is
11+
achieved by finding the best weight vector :math:`w` that defines the decision boundary
12+
hyperplane and minimizes the sum of hinge losses for misclassified samples, as measured
13+
by the :func:`~sklearn.metrics.hinge_loss` function. By default, regularization is
14+
applied with the parameter `C=1`, which allows for a certain degree of misclassification
15+
tolerance.
916
17+
If the data is not linearly separable in the original feature space, a non-linear kernel
18+
parameter can be set. Depending on the kernel, the process involves adding new features
19+
or transforming existing features to enrich and potentially add meaning to the data.
20+
When a kernel other than `"linear"` is set, the SVC applies the `kernel trick
21+
<https://en.wikipedia.org/wiki/Kernel_method#Mathematics:_the_kernel_trick>`__, which
22+
computes the similarity between pairs of data points using the kernel function without
23+
explicitly transforming the entire dataset. The kernel trick surpasses the otherwise
24+
necessary matrix transformation of the whole dataset by only considering the relations
25+
between all pairs of data points. The kernel function maps two vectors (each pair of
26+
observations) to their similarity using their dot product.
1027
28+
The hyperplane can then be calculated using the kernel function as if the dataset were
29+
represented in a higher-dimensional space. Using a kernel function instead of an
30+
explicit matrix transformation improves performance, as the kernel function has a time
31+
complexity of :math:`O({n}^2)`, whereas matrix transformation scales according to the
32+
specific transformation being applied.
33+
34+
In this example, we compare the most common kernel types of Support Vector Machines: the
35+
linear kernel (`"linear"`), the polynomial kernel (`"poly"`), the radial basis function
36+
kernel (`"rbf"`) and the sigmoid kernel (`"sigmoid"`).
1137
"""
1238

1339
# Code source: Gaël Varoquaux
1440
# License: BSD 3 clause
1541

42+
# %%
43+
# Creating a dataset
44+
# ------------------
45+
# We create a two-dimensional classification dataset with 16 samples and two classes. We
46+
# plot the samples with the colors matching their respective targets.
1647
import matplotlib.pyplot as plt
1748
import numpy as np
1849

50+
X = np.array(
51+
[
52+
[0.4, -0.7],
53+
[-1.5, -1.0],
54+
[-1.4, -0.9],
55+
[-1.3, -1.2],
56+
[-1.1, -0.2],
57+
[-1.2, -0.4],
58+
[-0.5, 1.2],
59+
[-1.5, 2.1],
60+
[1.0, 1.0],
61+
[1.3, 0.8],
62+
[1.2, 0.5],
63+
[0.2, -2.0],
64+
[0.5, -2.4],
65+
[0.2, -2.3],
66+
[0.0, -2.7],
67+
[1.3, 2.1],
68+
]
69+
)
70+
71+
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])
72+
73+
# Plotting settings
74+
fig, ax = plt.subplots(figsize=(4, 3))
75+
x_min, x_max, y_min, y_max = -3, 3, -3, 3
76+
ax.set(xlim=(x_min, x_max), ylim=(y_min, y_max))
77+
78+
# Plot samples by color and add legend
79+
scatter = ax.scatter(X[:, 0], X[:, 1], s=150, c=y, label=y, edgecolors="k")
80+
ax.legend(*scatter.legend_elements(), loc="upper right", title="Classes")
81+
ax.set_title("Samples in two-dimensional feature space")
82+
_ = plt.show()
83+
84+
# %%
85+
# We can see that the samples are not clearly separable by a straight line.
86+
#
87+
# Training SVC model and plotting decision boundaries
88+
# ---------------------------------------------------
89+
# We define a function that fits a :class:`~sklearn.svm.SVC` classifier,
90+
# allowing the `kernel` parameter as an input, and then plots the decision
91+
# boundaries learned by the model using
92+
# :class:`~sklearn.inspection.DecisionBoundaryDisplay`.
93+
#
94+
# Notice that for the sake of simplicity, the `C` parameter is set to its
95+
# default value (`C=1`) in this example and the `gamma` parameter is set to
96+
# `gamma=2` across all kernels, although it is automatically ignored for the
97+
# linear kernel. In a real classification task, where performance matters,
98+
# parameter tuning (by using :class:`~sklearn.model_selection.GridSearchCV` for
99+
# instance) is highly recommended to capture different structures within the
100+
# data.
101+
#
102+
# Setting `response_method="predict"` in
103+
# :class:`~sklearn.inspection.DecisionBoundaryDisplay` colors the areas based
104+
# on their predicted class. Using `response_method="decision_function"` allows
105+
# us to also plot the decision boundary and the margins to both sides of it.
106+
# Finally the support vectors used during training (which always lay on the
107+
# margins) are identified by means of the `support_vectors_` attribute of
108+
# the trained SVCs, and plotted as well.
19109
from sklearn import svm
110+
from sklearn.inspection import DecisionBoundaryDisplay
111+
112+
113+
def plot_training_data_with_decision_boundary(kernel):
114+
# Train the SVC
115+
clf = svm.SVC(kernel=kernel, gamma=2).fit(X, y)
116+
117+
# Settings for plotting
118+
_, ax = plt.subplots(figsize=(4, 3))
119+
x_min, x_max, y_min, y_max = -3, 3, -3, 3
120+
ax.set(xlim=(x_min, x_max), ylim=(y_min, y_max))
121+
122+
# Plot decision boundary and margins
123+
common_params = {"estimator": clf, "X": X, "ax": ax}
124+
DecisionBoundaryDisplay.from_estimator(
125+
**common_params,
126+
response_method="predict",
127+
plot_method="pcolormesh",
128+
alpha=0.3,
129+
)
130+
DecisionBoundaryDisplay.from_estimator(
131+
**common_params,
132+
response_method="decision_function",
133+
plot_method="contour",
134+
levels=[-1, 0, 1],
135+
colors=["k", "k", "k"],
136+
linestyles=["--", "-", "--"],
137+
)
20138

21-
# Our dataset and targets
22-
X = np.c_[
23-
(0.4, -0.7),
24-
(-1.5, -1),
25-
(-1.4, -0.9),
26-
(-1.3, -1.2),
27-
(-1.1, -0.2),
28-
(-1.2, -0.4),
29-
(-0.5, 1.2),
30-
(-1.5, 2.1),
31-
(1, 1),
32-
# --
33-
(1.3, 0.8),
34-
(1.2, 0.5),
35-
(0.2, -2),
36-
(0.5, -2.4),
37-
(0.2, -2.3),
38-
(0, -2.7),
39-
(1.3, 2.1),
40-
].T
41-
Y = [0] * 8 + [1] * 8
42-
43-
# figure number
44-
fignum = 1
45-
46-
# fit the model
47-
for kernel in ("linear", "poly", "rbf"):
48-
clf = svm.SVC(kernel=kernel, gamma=2)
49-
clf.fit(X, Y)
50-
51-
# plot the line, the points, and the nearest vectors to the plane
52-
plt.figure(fignum, figsize=(4, 3))
53-
plt.clf()
54-
55-
plt.scatter(
139+
# Plot bigger circles around samples that serve as support vectors
140+
ax.scatter(
56141
clf.support_vectors_[:, 0],
57142
clf.support_vectors_[:, 1],
58-
s=80,
143+
s=250,
59144
facecolors="none",
60-
zorder=10,
61145
edgecolors="k",
62146
)
63-
plt.scatter(X[:, 0], X[:, 1], c=Y, zorder=10, cmap=plt.cm.Paired, edgecolors="k")
64-
65-
plt.axis("tight")
66-
x_min = -3
67-
x_max = 3
68-
y_min = -3
69-
y_max = 3
70-
71-
XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
72-
Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])
73-
74-
# Put the result into a color plot
75-
Z = Z.reshape(XX.shape)
76-
plt.figure(fignum, figsize=(4, 3))
77-
plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
78-
plt.contour(
79-
XX,
80-
YY,
81-
Z,
82-
colors=["k", "k", "k"],
83-
linestyles=["--", "-", "--"],
84-
levels=[-0.5, 0, 0.5],
85-
)
147+
# Plot samples by color and add legend
148+
ax.scatter(X[:, 0], X[:, 1], c=y, s=150, edgecolors="k")
149+
ax.legend(*scatter.legend_elements(), loc="upper right", title="Classes")
150+
ax.set_title(f" Decision boundaries of {kernel} kernel in SVC")
151+
152+
_ = plt.show()
153+
154+
155+
# %%
156+
# Linear kernel
157+
# *************
158+
# Linear kernel is the dot product of the input samples:
159+
#
160+
# .. math:: K(\mathbf{x}_1, \mathbf{x}_2) = \mathbf{x}_1^\top \mathbf{x}_2
161+
#
162+
# It is then applied to any combination of two data points (samples) in the
163+
# dataset. The dot product of the two points determines the
164+
# :func:`~sklearn.metrics.pairwise.cosine_similarity` between both points. The
165+
# higher the value, the more similar the points are.
166+
plot_training_data_with_decision_boundary("linear")
167+
168+
# %%
169+
# Training a :class:`~sklearn.svm.SVC` on a linear kernel results in an
170+
# untransformed feature space, where the hyperplane and the margins are
171+
# straight lines. Due to the lack of expressivity of the linear kernel, the
172+
# trained classes do not perfectly capture the training data.
173+
#
174+
# Polynomial kernel
175+
# *****************
176+
# The polynomial kernel changes the notion of similarity. The kernel function
177+
# is defined as:
178+
#
179+
# .. math::
180+
# K(\mathbf{x}_1, \mathbf{x}_2) = (\gamma \cdot \
181+
# \mathbf{x}_1^\top\mathbf{x}_2 + r)^d
182+
#
183+
# where :math:`{d}` is the degree (`degree`) of the polynomial, :math:`{\gamma}`
184+
# (`gamma`) controls the influence of each individual training sample on the
185+
# decision boundary and :math:`{r}` is the bias term (`coef0`) that shifts the
186+
# data up or down. Here, we use the default value for the degree of the
187+
# polynomal in the kernel funcion (`degree=3`). When `coef0=0` (the default),
188+
# the data is only transformed, but no additional dimension is added. Using a
189+
# polynomial kernel is equivalent to creating
190+
# :class:`~sklearn.preprocessing.PolynomialFeatures` and then fitting a
191+
# :class:`~sklearn.svm.SVC` with a linear kernel on the transformed data,
192+
# although this alternative approach would be computationally expensive for most
193+
# datasets.
194+
plot_training_data_with_decision_boundary("poly")
195+
196+
# %%
197+
# The polynomial kernel with `gamma=2`` adapts well to the training data,
198+
# causing the margins on both sides of the hyperplane to bend accordingly.
199+
#
200+
# RBF kernel
201+
# **********
202+
# The radial basis function (RBF) kernel, also known as the Gaussian kernel, is
203+
# the default kernel for Support Vector Machines in scikit-learn. It measures
204+
# similarity between two data points in infinite dimensions and then approaches
205+
# classification by majority vote. The kernel function is defined as:
206+
#
207+
# .. math::
208+
# K(\mathbf{x}_1, \mathbf{x}_2) = \exp\left(-\gamma \cdot
209+
# {\|\mathbf{x}_1 - \mathbf{x}_2\|^2}\right)
210+
#
211+
# where :math:`{\gamma}` (`gamma`) controls the influence of each individual
212+
# training sample on the decision boundary.
213+
#
214+
# The larger the euclidean distance between two points
215+
# :math:`\|\mathbf{x}_1 - \mathbf{x}_2\|^2`
216+
# the closer the kernel function is to zero. This means that two points far away
217+
# are more likely to be dissimilar.
218+
plot_training_data_with_decision_boundary("rbf")
219+
220+
# %%
221+
# In the plot we can see how the decision boundaries tend to contract around
222+
# data points that are close to each other.
223+
#
224+
# Sigmoid kernel
225+
# **************
226+
# The sigmoid kernel function is defined as:
227+
#
228+
# .. math::
229+
# K(\mathbf{x}_1, \mathbf{x}_2) = \tanh(\gamma \cdot
230+
# \mathbf{x}_1^\top\mathbf{x}_2 + r)
231+
#
232+
# where the kernel coefficient :math:`{\gamma}` (`gamma`) controls the influence
233+
# of each individual training sample on the decision boundary and :math:`{r}` is
234+
# the bias term (`coef0`) that shifts the data up or down.
235+
#
236+
# In the sigmoid kernel, the similarity between two data points is computed
237+
# using the hyperbolic tangent function (:math:`\tanh`). The kernel function
238+
# scales and possibly shifts the dot product of the two points
239+
# (:math:`\mathbf{x}_1` and :math:`\mathbf{x}_2`).
86240

87-
plt.xlim(x_min, x_max)
88-
plt.ylim(y_min, y_max)
241+
plot_training_data_with_decision_boundary("sigmoid")
89242

90-
plt.xticks(())
91-
plt.yticks(())
92-
fignum = fignum + 1
93-
plt.show()
243+
# %%
244+
# We can see that the decision boundaries obtained with the sigmoid kernel
245+
# appear curved and irregular. The decision boundary tries to separate the
246+
# classes by fitting a sigmoid-shaped curve, resulting in a complex boundary
247+
# that may not generalize well to unseen data. From this example it becomes
248+
# obvious, that the sigmoid kernel has very specific use cases, when dealing
249+
# with data that exhibits a sigmoidal shape. In this example, careful fine
250+
# tuning might find more generalizable decision boundaries. Because of it's
251+
# specificity, the sigmoid kernel is less commonly used in practice compared to
252+
# other kernels.
253+
#
254+
# Conclusion
255+
# ----------
256+
# In this example, we have visualized the decision boundaries trained with the
257+
# provided dataset. The plots serve as an intuitive demonstration of how
258+
# different kernels utilize the training data to determine the classification
259+
# boundaries.
260+
#
261+
# The hyperplanes and margins, although computed indirectly, can be imagined as
262+
# planes in the transformed feature space. However, in the plots, they are
263+
# represented relative to the original feature space, resulting in curved
264+
# decision boundaries for the polynomial, RBF, and sigmoid kernels.
265+
#
266+
# Please note that the plots do not evaluate the individual kernel's accuracy or
267+
# quality. They are intended to provide a visual understanding of how the
268+
# different kernels use the training data.
269+
#
270+
# For a comprehensive evaluation, fine-tuning of :class:`~sklearn.svm.SVC`
271+
# parameters using techniques such as
272+
# :class:`~sklearn.model_selection.GridSearchCV` is recommended to capture the
273+
# underlying structures within the data.

0 commit comments

Comments
 (0)