scikit-learn · OmarManzoor · Aug 12, 2025 · May 24, 2024 · May 25, 2024 · May 25, 2024
diff --git a/doc/api_reference.py b/doc/api_reference.py
@@ -587,7 +587,7 @@ def _get_submodule(module_name, submodule_name):
                 "autosummary": [
                     "LogisticRegression",
                     "LogisticRegressionCV",
-                    "PassiveAggressiveClassifier",
+                    "PassiveAggressiveClassifier",  # TODO(1.10): remove
                     "Perceptron",
                     "RidgeClassifier",
                     "RidgeClassifierCV",
@@ -672,7 +672,7 @@ def _get_submodule(module_name, submodule_name):
             {
                 "title": "Miscellaneous",
                 "autosummary": [
-                    "PassiveAggressiveRegressor",
+                    "PassiveAggressiveRegressor",  # TODO(1.10): remove
                     "enet_path",
                     "lars_path",
                     "lars_path_gram",

diff --git a/doc/computing/computational_performance.rst b/doc/computing/computational_performance.rst
@@ -154,10 +154,9 @@ prediction latency too much. We will now review this idea for different
 families of supervised models.
 
 For :mod:`sklearn.linear_model` (e.g. Lasso, ElasticNet,
-SGDClassifier/Regressor, Ridge & RidgeClassifier,
-PassiveAggressiveClassifier/Regressor, LinearSVC, LogisticRegression...) the
-decision function that is applied at prediction time is the same (a dot product)
-, so latency should be equivalent.
+SGDClassifier/Regressor, Ridge & RidgeClassifier, LinearSVC, LogisticRegression...) the
+decision function that is applied at prediction time is the same (a dot product), so
+latency should be equivalent.
 
 Here is an example using
 :class:`~linear_model.SGDClassifier` with the

diff --git a/doc/computing/scaling_strategies.rst b/doc/computing/scaling_strategies.rst
@@ -63,11 +63,9 @@ Here is a list of incremental estimators for different tasks:
     + :class:`sklearn.naive_bayes.BernoulliNB`
     + :class:`sklearn.linear_model.Perceptron`
     + :class:`sklearn.linear_model.SGDClassifier`
-    + :class:`sklearn.linear_model.PassiveAggressiveClassifier`
     + :class:`sklearn.neural_network.MLPClassifier`
 - Regression
     + :class:`sklearn.linear_model.SGDRegressor`
-    + :class:`sklearn.linear_model.PassiveAggressiveRegressor`
     + :class:`sklearn.neural_network.MLPRegressor`
 - Clustering
     + :class:`sklearn.cluster.MiniBatchKMeans`
@@ -91,7 +89,7 @@ classes to the first ``partial_fit`` call using the ``classes=`` parameter.
 Another aspect to consider when choosing a proper algorithm is that not all of
 them put the same importance on each example over time. Namely, the
 ``Perceptron`` is still sensitive to badly labeled examples even after many
-examples whereas the ``SGD*`` and ``PassiveAggressive*`` families are more
+examples whereas the ``SGD*`` family is more
 robust to this kind of artifacts. Conversely, the latter also tend to give less
 importance to remarkably different, yet properly labeled examples when they
 come late in the stream as their learning rate decreases over time.
@@ -130,7 +128,7 @@ Notes
 ......
 
 .. [1] Depending on the algorithm the mini-batch size can influence results or
-       not. SGD*, PassiveAggressive*, and discrete NaiveBayes are truly online
+       not. SGD* and discrete NaiveBayes are truly online
        and are not affected by batch size. Conversely, MiniBatchKMeans
        convergence rate is affected by the batch size. Also, its memory
        footprint can vary dramatically with batch size.
diff --git a/doc/conf.py b/doc/conf.py
@@ -866,6 +866,8 @@ def setup(app):
         " non-GUI backend, so cannot show the figure."
     ),
 )
+# TODO(1.10): remove PassiveAggressive
+warnings.filterwarnings("ignore", category=FutureWarning, message="PassiveAggressive")
 if os.environ.get("SKLEARN_WARNINGS_AS_ERRORS", "0") != "0":
     turn_warnings_into_errors()
 

diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
@@ -846,7 +846,7 @@ text classification tasks.
 
 Note that the dimensionality does not affect the CPU training time of
 algorithms which operate on CSR matrices (``LinearSVC(dual=True)``,
-``Perceptron``, ``SGDClassifier``, ``PassiveAggressive``) but it does for
+``Perceptron``, ``SGDClassifier``) but it does for
 algorithms that work with CSC matrices (``LinearSVC(dual=False)``, ``Lasso()``,
 etc.).
 

diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
@@ -1335,10 +1335,10 @@ You can refer to the dedicated :ref:`sgd` documentation section for more details
 .. _perceptron:
 
 Perceptron
-==========
+----------
 
 The :class:`Perceptron` is another simple classification algorithm suitable for
-large scale learning. By default:
+large scale learning and derives from SGD. By default:
 
 - It does not require a learning rate.
 
@@ -1358,18 +1358,18 @@ for more details.
 .. _passive_aggressive:
 
 Passive Aggressive Algorithms
-=============================
-
-The passive-aggressive algorithms are a family of algorithms for large-scale
-learning. They are similar to the Perceptron in that they do not require a
-learning rate. However, contrary to the Perceptron, they include a
-regularization parameter ``C``.
-
-For classification, :class:`PassiveAggressiveClassifier` can be used with
-``loss='hinge'`` (PA-I) or ``loss='squared_hinge'`` (PA-II).  For regression,
-:class:`PassiveAggressiveRegressor` can be used with
-``loss='epsilon_insensitive'`` (PA-I) or
-``loss='squared_epsilon_insensitive'`` (PA-II).
+-----------------------------
+
+The passive-aggressive (PA) algorithms are another family of 2 algorithms (PA-I and
+PA-II) for large-scale online learning that derive from SGD. They are similar to the
+Perceptron in that they do not require a learning rate. However, contrary to the
+Perceptron, they include a regularization parameter ``PA_C``.
+
+For classification,
+:class:`SGDClassifier(loss="hinge", penalty=None, learning_rate="pa1", PA_C=1.0)` can
+be used for PA-I or with ``learning_rate="pa2"`` for PA-II. For regression,
+:class:`SGDRegressor(loss="epsilon_insensitive", penalty=None, learning_rate="pa1",
+PA_C=1.0)` can be used for PA-I or with ``learning_rate="pa2"`` for PA-II.
 
 .. dropdown:: References
 

diff --git a/doc/modules/multiclass.rst b/doc/modules/multiclass.rst
@@ -90,7 +90,6 @@ can provide additional strategies beyond what is built-in:
   - :class:`linear_model.LogisticRegressionCV` (most solvers)
   - :class:`linear_model.SGDClassifier`
   - :class:`linear_model.Perceptron`
-  - :class:`linear_model.PassiveAggressiveClassifier`
 
 
 - **Support multilabel:**

diff --git a/doc/whats_new/upcoming_changes/sklearn.linear_model/29097.api.rst b/doc/whats_new/upcoming_changes/sklearn.linear_model/29097.api.rst
@@ -0,0 +1,6 @@
+- `PassiveAggressiveClassifier` and `PassiveAggressiveRegressor` are deprecated
+  and will be removed in 1.10. Equivalent estimators are available with `SGDClassifier`
+  and `SGDRegressor`, both of which expose the options `learning_rate="pa1"` and
+  `"pa2"` as well as the new parameter `PA_C` for the aggressiveness parameter of the
+  Passive-Aggressive-Algorithms.
+  By :user:`Christian Lorentzen <lorentzenchr>`.
diff --git a/examples/applications/plot_out_of_core_classification.py b/examples/applications/plot_out_of_core_classification.py
@@ -33,7 +33,7 @@
 
 from sklearn.datasets import get_data_home
 from sklearn.feature_extraction.text import HashingVectorizer
-from sklearn.linear_model import PassiveAggressiveClassifier, Perceptron, SGDClassifier
+from sklearn.linear_model import Perceptron, SGDClassifier
 from sklearn.naive_bayes import MultinomialNB
 
 
@@ -208,7 +208,9 @@ def progress(blocknum, bs, size):
     "SGD": SGDClassifier(max_iter=5),
     "Perceptron": Perceptron(),
     "NB Multinomial": MultinomialNB(alpha=0.01),
-    "Passive-Aggressive": PassiveAggressiveClassifier(),
+    "Passive-Aggressive": SGDClassifier(
+        loss="hinge", penalty=None, learning_rate="pa1", PA_C=1.0
+    ),
 }
 
 

diff --git a/sklearn/feature_selection/tests/test_from_model.py b/sklearn/feature_selection/tests/test_from_model.py
@@ -20,7 +20,6 @@
     LassoCV,
     LinearRegression,
     LogisticRegression,
-    PassiveAggressiveClassifier,
     SGDClassifier,
 )
 from sklearn.pipeline import make_pipeline
@@ -393,8 +392,8 @@ def test_2d_coef():
 
 
 def test_partial_fit():
-    est = PassiveAggressiveClassifier(
-        random_state=0, shuffle=False, max_iter=5, tol=None
+    est = SGDClassifier(
+        random_state=0, shuffle=False, max_iter=5, tol=None, learning_rate="pa1"
     )
     transformer = SelectFromModel(estimator=est)
     transformer.partial_fit(data, y, classes=np.unique(y))

diff --git a/sklearn/linear_model/_passive_aggressive.py b/sklearn/linear_model/_passive_aggressive.py
@@ -9,18 +9,41 @@
     BaseSGDClassifier,
     BaseSGDRegressor,
 )
+from sklearn.utils import deprecated
 from sklearn.utils._param_validation import Interval, StrOptions
 
 
+# TODO(1.10): Remove
+@deprecated(
+    "this is deprecated in version 1.8 and will be removed in 1.10. "
+    "Use `SGDClassifier(loss='hinge', penalty=None, learning_rate='pa1', PA_C=1.0)` "
+    "instead."
+)
 class PassiveAggressiveClassifier(BaseSGDClassifier):
     """Passive Aggressive Classifier.
 
+    .. deprecated:: 1.8
+        The whole class `PassiveAggressiveClassifier` was deprecated in version 1.8
+        and will be removed in 1.10. Instead use:
+
+        .. code-block:: python
+
+            clf = SGDClassifier(
+                loss="hinge",
+                penalty=None,
+                learning_rate="pa1",  # or "pa2"
+                PA_C=1.0,  # for parameter C
+            )
+
     Read more in the :ref:`User Guide <passive_aggressive>`.
 
     Parameters
     ----------
     C : float, default=1.0
-        Maximum step size (regularization). Defaults to 1.0.
+        Aggressiveness parameter for the passive-agressive algorithm, see [1].
+        For PA-I it is the maximum step size. For PA-II it regularizes the
+        step size (the smaller `PA_C` the more it regularizes).
+        As a general rule-of-thumb, `PA_C` should be small when the data is noisy.
 
     fit_intercept : bool, default=True
         Whether the intercept should be estimated or not. If False, the
@@ -154,9 +177,9 @@ class PassiveAggressiveClassifier(BaseSGDClassifier):
 
     References
     ----------
-    Online Passive-Aggressive Algorithms
-    <http://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf>
-    K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, Y. Singer - JMLR (2006)
+    .. [1] Online Passive-Aggressive Algorithms
+       <http://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf>
+       K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, Y. Singer - JMLR (2006)
 
     Examples
     --------
@@ -212,6 +235,7 @@ def __init__(
             verbose=verbose,
             random_state=random_state,
             eta0=1.0,
+            PA_C=C,
             warm_start=warm_start,
             class_weight=class_weight,
             average=average,
@@ -262,12 +286,13 @@ def partial_fit(self, X, y, classes=None):
                     "parameter."
                 )
 
+        # For an explanation, see
+        # https://github.com/scikit-learn/scikit-learn/pull/1259#issuecomment-9818044
         lr = "pa1" if self.loss == "hinge" else "pa2"
         return self._partial_fit(
             X,
             y,
             alpha=1.0,
-            C=self.C,
             loss="hinge",
             learning_rate=lr,
             max_iter=1,
@@ -307,24 +332,45 @@ def fit(self, X, y, coef_init=None, intercept_init=None):
             X,
             y,
             alpha=1.0,
-            C=self.C,
             loss="hinge",
             learning_rate=lr,
             coef_init=coef_init,
             intercept_init=intercept_init,
         )
 
 
+# TODO(1.10): Remove
+@deprecated(
+    "this is deprecated in version 1.8 and will be removed in 1.10. "
+    "Use `SGDRegressor(loss='epsilon_insensitive', penalty=None, learning_rate='pa1', "
+    "PA_C = 1.0)` instead."
+)
 class PassiveAggressiveRegressor(BaseSGDRegressor):
     """Passive Aggressive Regressor.
 
+    .. deprecated:: 1.8
+        The whole class `PassiveAggressiveRegressor` was deprecated in version 1.8
+        and will be removed in 1.10. Instead use:
+
+        .. code-block:: python
+
+            reg = SGDRegressor(
+                loss="epsilon_insensitive",
+                penalty=None,
+                learning_rate="pa1",  # or "pa2"
+                PA_C=1.0,  # for parameter C
+            )
+
     Read more in the :ref:`User Guide <passive_aggressive>`.
 
     Parameters
     ----------
 
     C : float, default=1.0
-        Maximum step size (regularization). Defaults to 1.0.
+        Aggressiveness parameter for the passive-agressive algorithm, see [1].
+        For PA-I it is the maximum step size. For PA-II it regularizes the
+        step size (the smaller `PA_C` the more it regularizes).
+        As a general rule-of-thumb, `PA_C` should be small when the data is noisy.
 
     fit_intercept : bool, default=True
         Whether the intercept should be estimated or not. If False, the
@@ -486,10 +532,12 @@ def __init__(
         average=False,
     ):
         super().__init__(
+            loss=loss,
             penalty=None,
             l1_ratio=0,
             epsilon=epsilon,
             eta0=1.0,
+            PA_C=C,
             fit_intercept=fit_intercept,
             max_iter=max_iter,
             tol=tol,
@@ -503,7 +551,6 @@ def __init__(
             average=average,
         )
         self.C = C
-        self.loss = loss
 
     @_fit_context(prefer_skip_nested_validation=True)
     def partial_fit(self, X, y):
@@ -530,7 +577,6 @@ def partial_fit(self, X, y):
             X,
             y,
             alpha=1.0,
-            C=self.C,
             loss="epsilon_insensitive",
             learning_rate=lr,
             max_iter=1,
@@ -569,7 +615,6 @@ def fit(self, X, y, coef_init=None, intercept_init=None):
             X,
             y,
             alpha=1.0,
-            C=self.C,
             loss="epsilon_insensitive",
             learning_rate=lr,
             coef_init=coef_init,