scikit-learn · jeremiedbb · Jul 10, 2025 · Jul 10, 2025 · Jul 10, 2025 · Jul 10, 2025
diff --git a/doc/modules/density.rst b/doc/modules/density.rst
@@ -90,7 +90,7 @@ Here we have used ``kernel='gaussian'``, as seen above.
 Mathematically, a kernel is a positive function :math:`K(x;h)`
 which is controlled by the bandwidth parameter :math:`h`.
 Given this kernel form, the density estimate at a point :math:`y` within
-a group of points :math:`x_i; i=1\cdots N` is given by:
+a group of points :math:`x_i; i=1, \cdots, N` is given by:
 
 .. math::
     \rho_K(y) = \sum_{i=1}^{N} K(y - x_i; h)

diff --git a/doc/modules/gaussian_process.rst b/doc/modules/gaussian_process.rst
@@ -337,7 +337,7 @@ of a :class:`Sum` kernel, where it modifies the mean of the Gaussian process.
 It depends on a parameter :math:`constant\_value`. It is defined as:
 
 .. math::
-   k(x_i, x_j) = constant\_value \;\forall\; x_1, x_2
+   k(x_i, x_j) = constant\_value \;\forall\; x_i, x_j
 
 The main use-case of the :class:`WhiteKernel` kernel is as part of a
 sum-kernel where it explains the noise-component of the signal. Tuning its

diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
@@ -383,7 +383,7 @@ scikit-learn.
   For a linear Gaussian model, the maximum log-likelihood is defined as:
 
   .. math::
-      \log(\hat{L}) = - \frac{n}{2} \log(2 \pi) - \frac{n}{2} \ln(\sigma^2) - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{2\sigma^2}
+      \log(\hat{L}) = - \frac{n}{2} \log(2 \pi) - \frac{n}{2} \log(\sigma^2) - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{2\sigma^2}
 
   where :math:`\sigma^2` is an estimate of the noise variance,
   :math:`y_i` and :math:`\hat{y}_i` are respectively the true and predicted

diff --git a/doc/modules/neural_networks_supervised.rst b/doc/modules/neural_networks_supervised.rst
@@ -22,7 +22,7 @@ Multi-layer Perceptron
 **Multi-layer Perceptron (MLP)** is a supervised learning algorithm that learns
 a function :math:`f: R^m \rightarrow R^o` by training on a dataset,
 where :math:`m` is the number of dimensions for input and :math:`o` is the
-number of dimensions for output. Given a set of features :math:`X = {x_1, x_2, ..., x_m}`
+number of dimensions for output. Given a set of features :math:`X = \{x_1, x_2, ..., x_m\}`
 and a target :math:`y`, it can learn a non-linear function approximator for either
 classification or regression. It is different from logistic regression, in that
 between the input and the output layer, there can be one or more non-linear
@@ -233,7 +233,7 @@ training.
 
 .. dropdown:: Mathematical formulation
 
-  Given a set of training examples :math:`(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)`
+  Given a set of training examples :math:`\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}`
   where :math:`x_i \in \mathbf{R}^n` and :math:`y_i \in \{0, 1\}`, a one hidden
   layer one hidden neuron MLP learns the function :math:`f(x) = W_2 g(W_1^T x + b_1) + b_2`
   where :math:`W_1 \in \mathbf{R}^m` and :math:`W_2, b_1, b_2 \in \mathbf{R}` are

diff --git a/doc/modules/sgd.rst b/doc/modules/sgd.rst
@@ -405,7 +405,7 @@ Mathematical formulation
 We describe here the mathematical details of the SGD procedure. A good
 overview with convergence rates can be found in [#6]_.
 
-Given a set of training examples :math:`(x_1, y_1), \ldots, (x_n, y_n)` where
+Given a set of training examples :math:`\{(x_1, y_1), \ldots, (x_n, y_n)\}` where
 :math:`x_i \in \mathbf{R}^m` and :math:`y_i \in \mathbf{R}`
 (:math:`y_i \in \{-1, 1\}` for classification),
 our goal is to learn a linear scoring function