|
1 | 1 | """
|
2 | 2 | =========================================================
|
3 |
| -SVM-Kernels |
| 3 | +Plot classification boundaries with different SVM Kernels |
4 | 4 | =========================================================
|
| 5 | +This example shows how different kernels in a :class:`~sklearn.svm.SVC` (Support Vector |
| 6 | +Classifier) influence the classification boundaries in a binary, two-dimensional |
| 7 | +classification problem. |
5 | 8 |
|
6 |
| -Three different types of SVM-Kernels are displayed below. |
7 |
| -The polynomial and RBF are especially useful when the |
8 |
| -data-points are not linearly separable. |
| 9 | +SVCs aim to find a hyperplane that effectively separates the classes in their training |
| 10 | +data by maximizing the margin between the outermost data points of each class. This is |
| 11 | +achieved by finding the best weight vector :math:`w` that defines the decision boundary |
| 12 | +hyperplane and minimizes the sum of hinge losses for misclassified samples, as measured |
| 13 | +by the :func:`~sklearn.metrics.hinge_loss` function. By default, regularization is |
| 14 | +applied with the parameter `C=1`, which allows for a certain degree of misclassification |
| 15 | +tolerance. |
9 | 16 |
|
| 17 | +If the data is not linearly separable in the original feature space, a non-linear kernel |
| 18 | +parameter can be set. Depending on the kernel, the process involves adding new features |
| 19 | +or transforming existing features to enrich and potentially add meaning to the data. |
| 20 | +When a kernel other than `"linear"` is set, the SVC applies the `kernel trick |
| 21 | +<https://en.wikipedia.org/wiki/Kernel_method#Mathematics:_the_kernel_trick>`__, which |
| 22 | +computes the similarity between pairs of data points using the kernel function without |
| 23 | +explicitly transforming the entire dataset. The kernel trick surpasses the otherwise |
| 24 | +necessary matrix transformation of the whole dataset by only considering the relations |
| 25 | +between all pairs of data points. The kernel function maps two vectors (each pair of |
| 26 | +observations) to their similarity using their dot product. |
10 | 27 |
|
| 28 | +The hyperplane can then be calculated using the kernel function as if the dataset were |
| 29 | +represented in a higher-dimensional space. Using a kernel function instead of an |
| 30 | +explicit matrix transformation improves performance, as the kernel function has a time |
| 31 | +complexity of :math:`O({n}^2)`, whereas matrix transformation scales according to the |
| 32 | +specific transformation being applied. |
| 33 | +
|
| 34 | +In this example, we compare the most common kernel types of Support Vector Machines: the |
| 35 | +linear kernel (`"linear"`), the polynomial kernel (`"poly"`), the radial basis function |
| 36 | +kernel (`"rbf"`) and the sigmoid kernel (`"sigmoid"`). |
11 | 37 | """
|
12 | 38 |
|
13 | 39 | # Code source: Gaël Varoquaux
|
14 | 40 | # License: BSD 3 clause
|
15 | 41 |
|
| 42 | +# %% |
| 43 | +# Creating a dataset |
| 44 | +# ------------------ |
| 45 | +# We create a two-dimensional classification dataset with 16 samples and two classes. We |
| 46 | +# plot the samples with the colors matching their respective targets. |
16 | 47 | import matplotlib.pyplot as plt
|
17 | 48 | import numpy as np
|
18 | 49 |
|
| 50 | +X = np.array( |
| 51 | + [ |
| 52 | + [0.4, -0.7], |
| 53 | + [-1.5, -1.0], |
| 54 | + [-1.4, -0.9], |
| 55 | + [-1.3, -1.2], |
| 56 | + [-1.1, -0.2], |
| 57 | + [-1.2, -0.4], |
| 58 | + [-0.5, 1.2], |
| 59 | + [-1.5, 2.1], |
| 60 | + [1.0, 1.0], |
| 61 | + [1.3, 0.8], |
| 62 | + [1.2, 0.5], |
| 63 | + [0.2, -2.0], |
| 64 | + [0.5, -2.4], |
| 65 | + [0.2, -2.3], |
| 66 | + [0.0, -2.7], |
| 67 | + [1.3, 2.1], |
| 68 | + ] |
| 69 | +) |
| 70 | + |
| 71 | +y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]) |
| 72 | + |
| 73 | +# Plotting settings |
| 74 | +fig, ax = plt.subplots(figsize=(4, 3)) |
| 75 | +x_min, x_max, y_min, y_max = -3, 3, -3, 3 |
| 76 | +ax.set(xlim=(x_min, x_max), ylim=(y_min, y_max)) |
| 77 | + |
| 78 | +# Plot samples by color and add legend |
| 79 | +scatter = ax.scatter(X[:, 0], X[:, 1], s=150, c=y, label=y, edgecolors="k") |
| 80 | +ax.legend(*scatter.legend_elements(), loc="upper right", title="Classes") |
| 81 | +ax.set_title("Samples in two-dimensional feature space") |
| 82 | +_ = plt.show() |
| 83 | + |
| 84 | +# %% |
| 85 | +# We can see that the samples are not clearly separable by a straight line. |
| 86 | +# |
| 87 | +# Training SVC model and plotting decision boundaries |
| 88 | +# --------------------------------------------------- |
| 89 | +# We define a function that fits a :class:`~sklearn.svm.SVC` classifier, |
| 90 | +# allowing the `kernel` parameter as an input, and then plots the decision |
| 91 | +# boundaries learned by the model using |
| 92 | +# :class:`~sklearn.inspection.DecisionBoundaryDisplay`. |
| 93 | +# |
| 94 | +# Notice that for the sake of simplicity, the `C` parameter is set to its |
| 95 | +# default value (`C=1`) in this example and the `gamma` parameter is set to |
| 96 | +# `gamma=2` across all kernels, although it is automatically ignored for the |
| 97 | +# linear kernel. In a real classification task, where performance matters, |
| 98 | +# parameter tuning (by using :class:`~sklearn.model_selection.GridSearchCV` for |
| 99 | +# instance) is highly recommended to capture different structures within the |
| 100 | +# data. |
| 101 | +# |
| 102 | +# Setting `response_method="predict"` in |
| 103 | +# :class:`~sklearn.inspection.DecisionBoundaryDisplay` colors the areas based |
| 104 | +# on their predicted class. Using `response_method="decision_function"` allows |
| 105 | +# us to also plot the decision boundary and the margins to both sides of it. |
| 106 | +# Finally the support vectors used during training (which always lay on the |
| 107 | +# margins) are identified by means of the `support_vectors_` attribute of |
| 108 | +# the trained SVCs, and plotted as well. |
19 | 109 | from sklearn import svm
|
| 110 | +from sklearn.inspection import DecisionBoundaryDisplay |
| 111 | + |
| 112 | + |
| 113 | +def plot_training_data_with_decision_boundary(kernel): |
| 114 | + # Train the SVC |
| 115 | + clf = svm.SVC(kernel=kernel, gamma=2).fit(X, y) |
| 116 | + |
| 117 | + # Settings for plotting |
| 118 | + _, ax = plt.subplots(figsize=(4, 3)) |
| 119 | + x_min, x_max, y_min, y_max = -3, 3, -3, 3 |
| 120 | + ax.set(xlim=(x_min, x_max), ylim=(y_min, y_max)) |
| 121 | + |
| 122 | + # Plot decision boundary and margins |
| 123 | + common_params = {"estimator": clf, "X": X, "ax": ax} |
| 124 | + DecisionBoundaryDisplay.from_estimator( |
| 125 | + **common_params, |
| 126 | + response_method="predict", |
| 127 | + plot_method="pcolormesh", |
| 128 | + alpha=0.3, |
| 129 | + ) |
| 130 | + DecisionBoundaryDisplay.from_estimator( |
| 131 | + **common_params, |
| 132 | + response_method="decision_function", |
| 133 | + plot_method="contour", |
| 134 | + levels=[-1, 0, 1], |
| 135 | + colors=["k", "k", "k"], |
| 136 | + linestyles=["--", "-", "--"], |
| 137 | + ) |
20 | 138 |
|
21 |
| -# Our dataset and targets |
22 |
| -X = np.c_[ |
23 |
| - (0.4, -0.7), |
24 |
| - (-1.5, -1), |
25 |
| - (-1.4, -0.9), |
26 |
| - (-1.3, -1.2), |
27 |
| - (-1.1, -0.2), |
28 |
| - (-1.2, -0.4), |
29 |
| - (-0.5, 1.2), |
30 |
| - (-1.5, 2.1), |
31 |
| - (1, 1), |
32 |
| - # -- |
33 |
| - (1.3, 0.8), |
34 |
| - (1.2, 0.5), |
35 |
| - (0.2, -2), |
36 |
| - (0.5, -2.4), |
37 |
| - (0.2, -2.3), |
38 |
| - (0, -2.7), |
39 |
| - (1.3, 2.1), |
40 |
| -].T |
41 |
| -Y = [0] * 8 + [1] * 8 |
42 |
| - |
43 |
| -# figure number |
44 |
| -fignum = 1 |
45 |
| - |
46 |
| -# fit the model |
47 |
| -for kernel in ("linear", "poly", "rbf"): |
48 |
| - clf = svm.SVC(kernel=kernel, gamma=2) |
49 |
| - clf.fit(X, Y) |
50 |
| - |
51 |
| - # plot the line, the points, and the nearest vectors to the plane |
52 |
| - plt.figure(fignum, figsize=(4, 3)) |
53 |
| - plt.clf() |
54 |
| - |
55 |
| - plt.scatter( |
| 139 | + # Plot bigger circles around samples that serve as support vectors |
| 140 | + ax.scatter( |
56 | 141 | clf.support_vectors_[:, 0],
|
57 | 142 | clf.support_vectors_[:, 1],
|
58 |
| - s=80, |
| 143 | + s=250, |
59 | 144 | facecolors="none",
|
60 |
| - zorder=10, |
61 | 145 | edgecolors="k",
|
62 | 146 | )
|
63 |
| - plt.scatter(X[:, 0], X[:, 1], c=Y, zorder=10, cmap=plt.cm.Paired, edgecolors="k") |
64 |
| - |
65 |
| - plt.axis("tight") |
66 |
| - x_min = -3 |
67 |
| - x_max = 3 |
68 |
| - y_min = -3 |
69 |
| - y_max = 3 |
70 |
| - |
71 |
| - XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j] |
72 |
| - Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()]) |
73 |
| - |
74 |
| - # Put the result into a color plot |
75 |
| - Z = Z.reshape(XX.shape) |
76 |
| - plt.figure(fignum, figsize=(4, 3)) |
77 |
| - plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired) |
78 |
| - plt.contour( |
79 |
| - XX, |
80 |
| - YY, |
81 |
| - Z, |
82 |
| - colors=["k", "k", "k"], |
83 |
| - linestyles=["--", "-", "--"], |
84 |
| - levels=[-0.5, 0, 0.5], |
85 |
| - ) |
| 147 | + # Plot samples by color and add legend |
| 148 | + ax.scatter(X[:, 0], X[:, 1], c=y, s=150, edgecolors="k") |
| 149 | + ax.legend(*scatter.legend_elements(), loc="upper right", title="Classes") |
| 150 | + ax.set_title(f" Decision boundaries of {kernel} kernel in SVC") |
| 151 | + |
| 152 | + _ = plt.show() |
| 153 | + |
| 154 | + |
| 155 | +# %% |
| 156 | +# Linear kernel |
| 157 | +# ************* |
| 158 | +# Linear kernel is the dot product of the input samples: |
| 159 | +# |
| 160 | +# .. math:: K(\mathbf{x}_1, \mathbf{x}_2) = \mathbf{x}_1^\top \mathbf{x}_2 |
| 161 | +# |
| 162 | +# It is then applied to any combination of two data points (samples) in the |
| 163 | +# dataset. The dot product of the two points determines the |
| 164 | +# :func:`~sklearn.metrics.pairwise.cosine_similarity` between both points. The |
| 165 | +# higher the value, the more similar the points are. |
| 166 | +plot_training_data_with_decision_boundary("linear") |
| 167 | + |
| 168 | +# %% |
| 169 | +# Training a :class:`~sklearn.svm.SVC` on a linear kernel results in an |
| 170 | +# untransformed feature space, where the hyperplane and the margins are |
| 171 | +# straight lines. Due to the lack of expressivity of the linear kernel, the |
| 172 | +# trained classes do not perfectly capture the training data. |
| 173 | +# |
| 174 | +# Polynomial kernel |
| 175 | +# ***************** |
| 176 | +# The polynomial kernel changes the notion of similarity. The kernel function |
| 177 | +# is defined as: |
| 178 | +# |
| 179 | +# .. math:: |
| 180 | +# K(\mathbf{x}_1, \mathbf{x}_2) = (\gamma \cdot \ |
| 181 | +# \mathbf{x}_1^\top\mathbf{x}_2 + r)^d |
| 182 | +# |
| 183 | +# where :math:`{d}` is the degree (`degree`) of the polynomial, :math:`{\gamma}` |
| 184 | +# (`gamma`) controls the influence of each individual training sample on the |
| 185 | +# decision boundary and :math:`{r}` is the bias term (`coef0`) that shifts the |
| 186 | +# data up or down. Here, we use the default value for the degree of the |
| 187 | +# polynomal in the kernel funcion (`degree=3`). When `coef0=0` (the default), |
| 188 | +# the data is only transformed, but no additional dimension is added. Using a |
| 189 | +# polynomial kernel is equivalent to creating |
| 190 | +# :class:`~sklearn.preprocessing.PolynomialFeatures` and then fitting a |
| 191 | +# :class:`~sklearn.svm.SVC` with a linear kernel on the transformed data, |
| 192 | +# although this alternative approach would be computationally expensive for most |
| 193 | +# datasets. |
| 194 | +plot_training_data_with_decision_boundary("poly") |
| 195 | + |
| 196 | +# %% |
| 197 | +# The polynomial kernel with `gamma=2`` adapts well to the training data, |
| 198 | +# causing the margins on both sides of the hyperplane to bend accordingly. |
| 199 | +# |
| 200 | +# RBF kernel |
| 201 | +# ********** |
| 202 | +# The radial basis function (RBF) kernel, also known as the Gaussian kernel, is |
| 203 | +# the default kernel for Support Vector Machines in scikit-learn. It measures |
| 204 | +# similarity between two data points in infinite dimensions and then approaches |
| 205 | +# classification by majority vote. The kernel function is defined as: |
| 206 | +# |
| 207 | +# .. math:: |
| 208 | +# K(\mathbf{x}_1, \mathbf{x}_2) = \exp\left(-\gamma \cdot |
| 209 | +# {\|\mathbf{x}_1 - \mathbf{x}_2\|^2}\right) |
| 210 | +# |
| 211 | +# where :math:`{\gamma}` (`gamma`) controls the influence of each individual |
| 212 | +# training sample on the decision boundary. |
| 213 | +# |
| 214 | +# The larger the euclidean distance between two points |
| 215 | +# :math:`\|\mathbf{x}_1 - \mathbf{x}_2\|^2` |
| 216 | +# the closer the kernel function is to zero. This means that two points far away |
| 217 | +# are more likely to be dissimilar. |
| 218 | +plot_training_data_with_decision_boundary("rbf") |
| 219 | + |
| 220 | +# %% |
| 221 | +# In the plot we can see how the decision boundaries tend to contract around |
| 222 | +# data points that are close to each other. |
| 223 | +# |
| 224 | +# Sigmoid kernel |
| 225 | +# ************** |
| 226 | +# The sigmoid kernel function is defined as: |
| 227 | +# |
| 228 | +# .. math:: |
| 229 | +# K(\mathbf{x}_1, \mathbf{x}_2) = \tanh(\gamma \cdot |
| 230 | +# \mathbf{x}_1^\top\mathbf{x}_2 + r) |
| 231 | +# |
| 232 | +# where the kernel coefficient :math:`{\gamma}` (`gamma`) controls the influence |
| 233 | +# of each individual training sample on the decision boundary and :math:`{r}` is |
| 234 | +# the bias term (`coef0`) that shifts the data up or down. |
| 235 | +# |
| 236 | +# In the sigmoid kernel, the similarity between two data points is computed |
| 237 | +# using the hyperbolic tangent function (:math:`\tanh`). The kernel function |
| 238 | +# scales and possibly shifts the dot product of the two points |
| 239 | +# (:math:`\mathbf{x}_1` and :math:`\mathbf{x}_2`). |
86 | 240 |
|
87 |
| - plt.xlim(x_min, x_max) |
88 |
| - plt.ylim(y_min, y_max) |
| 241 | +plot_training_data_with_decision_boundary("sigmoid") |
89 | 242 |
|
90 |
| - plt.xticks(()) |
91 |
| - plt.yticks(()) |
92 |
| - fignum = fignum + 1 |
93 |
| -plt.show() |
| 243 | +# %% |
| 244 | +# We can see that the decision boundaries obtained with the sigmoid kernel |
| 245 | +# appear curved and irregular. The decision boundary tries to separate the |
| 246 | +# classes by fitting a sigmoid-shaped curve, resulting in a complex boundary |
| 247 | +# that may not generalize well to unseen data. From this example it becomes |
| 248 | +# obvious, that the sigmoid kernel has very specific use cases, when dealing |
| 249 | +# with data that exhibits a sigmoidal shape. In this example, careful fine |
| 250 | +# tuning might find more generalizable decision boundaries. Because of it's |
| 251 | +# specificity, the sigmoid kernel is less commonly used in practice compared to |
| 252 | +# other kernels. |
| 253 | +# |
| 254 | +# Conclusion |
| 255 | +# ---------- |
| 256 | +# In this example, we have visualized the decision boundaries trained with the |
| 257 | +# provided dataset. The plots serve as an intuitive demonstration of how |
| 258 | +# different kernels utilize the training data to determine the classification |
| 259 | +# boundaries. |
| 260 | +# |
| 261 | +# The hyperplanes and margins, although computed indirectly, can be imagined as |
| 262 | +# planes in the transformed feature space. However, in the plots, they are |
| 263 | +# represented relative to the original feature space, resulting in curved |
| 264 | +# decision boundaries for the polynomial, RBF, and sigmoid kernels. |
| 265 | +# |
| 266 | +# Please note that the plots do not evaluate the individual kernel's accuracy or |
| 267 | +# quality. They are intended to provide a visual understanding of how the |
| 268 | +# different kernels use the training data. |
| 269 | +# |
| 270 | +# For a comprehensive evaluation, fine-tuning of :class:`~sklearn.svm.SVC` |
| 271 | +# parameters using techniques such as |
| 272 | +# :class:`~sklearn.model_selection.GridSearchCV` is recommended to capture the |
| 273 | +# underlying structures within the data. |
0 commit comments