Skip to content

DEP PassiveAggressiveClassifier and PassiveAggressiveRegressor #29097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Aug 12, 2025

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented May 24, 2024

Reference Issues/PRs

solves #15161
#29088 (comment)

What does this implement/fix? Explain your changes.

This PR deprecates the 2 classes PassiveAggressiveClassifier and PassiveAggressiveRegressor that were introduces in #1259. A user can easily use SGDClassifier and SGDRegressor instead and I cannot figure out the added value of these 2 classes.

@scikit-learn/core-devs ping for decision (in particular in case you are against the deprecation)

Copy link

github-actions bot commented May 24, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b2144ee. Link to the linter CI: here

@adrinjalali
Copy link
Member

I'm +1 on this:

  • I haven't seen anybody use it (except maybe @amueller ?)
  • If we deprecate and people start complaining, we can revert deprecation. So in a sense deprecation is a way to check usage here
  • The deprecation message clearly should always point to an alternative, which I think is the case here.

@glemaitre
Copy link
Member

The deprecation message clearly should always point to an alternative, which I think is the case here.

I think that we should do a bit more and specify how you get a similar behaviour by setting the right parameters.

@lorentzenchr
Copy link
Member Author

In fa3d842, I added C to SGDClassifier and SGDRegressor to make it more accessible.
I could imagine to put the equivalent PA estimators into the SGD docstring - maybe once the deprecation is carried out.

@adrinjalali
Copy link
Member

Nice. We need to add more information on the docstring about C though. It's very short and not clear what it does when I read it. Otherwise LGTM.

@lorentzenchr lorentzenchr added this to the 1.6 milestone Jun 17, 2024
@adrinjalali
Copy link
Member

@lorentzenchr getting close to the release, would you mind resolving issues here?

@lorentzenchr
Copy link
Member Author

@lorentzenchr getting close to the release, would you mind resolving issues here?

I can resolve merge conflicts, but I won't do

We need to add more information on the docstring about C though.

SGDRegressor(
penalty=None,
alpha=1.0,
C=1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must be missing something, our SGDRegressor doesn't have a C param.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseSGDClassifier has an attibute C. This PR exposes this attribute to SGDClassifier (same for regressors).

I don't really like that. Proposition: I revert the exposition of C and write in the deprecation note, using somewhat private API:

clf = SGDClassifier(
    penalty=None,
    alpha=1.0,
    eta0=1.0,
    learning_rate="pa1",
    loss="hinge",
)
clf.C = 1.0  # THIS IS USING PRIVATE API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See c24c95f. I can revert if you dislike it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're deprecating these estimators and telling users to use SGDRegressor and SGDClassifier instead, it makes sense to actually have a public way of doing that. So I'm in favor of exposing C in both SGD estimators, document them, and then here simply create and equivalent estimator using them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about leaving them private for the time being?

I would like to have a dedicated discussion whether to

  • expose C in SGD; xor
  • remove C and all passive aggressive things altogether

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is the right place to have that discussion. I'd like others to weigh in about this point.

maybe @ogrisel @amueller @GaelVaroquaux @adam2392 would have an opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with passive aggressive vs SGD but a loose read suggests passive aggressive allows more online updating. Though i thought SGD easily allows online training too, so why would someone use one over the other?

Is it true that the only parameter that makes these different is "C" step size parameter?

Pragmatically: if no one uses "C", then i kinda agree might be simplest to just remove it? If there's something I'm missing tho, lmk.

Copy link
Member Author

@lorentzenchr lorentzenchr Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PassiveAggressiveClassifier was first implemented in #1259 in 2012. I have never seen it used anywhere and I think our SGD... offer enough for online learning (partial_fit). So I would be fine with removing it completely, i.e. also removing C in SGD. Unless someone steps in for keeping it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having read through the code and partially papers, I could also live with making learning_rate="pa1" and "pa2" as well as C public in SGD.... In that case I would rename C to something like pa_C.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am partially in favor of just simplification and deprecating those parameters / the PAClf altogether, unless someone has an actual use case for them.

@lorentzenchr
Copy link
Member Author

The decorator deprecated interferes with the signature, e.g. in check_do_not_raise_errors_in_init_or_set_params:

    params = signature(Estimator).parameters

I need help how to avoid that (I won't fix deprecated) or rather to (temporarily) switch of those tests for PassiveAggressiveXX.

@lorentzenchr lorentzenchr force-pushed the dep_passive_aggressive branch from 7b21cfb to c367698 Compare October 23, 2024 21:20
@lorentzenchr
Copy link
Member Author

lorentzenchr commented Oct 24, 2024

Note that 306a5fa should be made a PR of its own in case this PR is not merged.

Edit: I opened #30145.

@glemaitre glemaitre removed this from the 1.6 milestone Nov 25, 2024
@lorentzenchr
Copy link
Member Author

In both our pytest and doc build, we can set the warning filter for this specific warning to be ignored. That should fix the CI

Where can I filter warnings for the doc build?

@adrinjalali
Copy link
Member

scikit-learn/doc/conf.py

Lines 861 to 868 in ae9d088

warnings.filterwarnings(
"ignore",
category=UserWarning,
message=(
"Matplotlib is currently using agg, which is a"
" non-GUI backend, so cannot show the figure."
),
)

@lorentzenchr lorentzenchr force-pushed the dep_passive_aggressive branch from a9e758b to 64fde84 Compare July 31, 2025 14:06
@lorentzenchr
Copy link
Member Author

While the CI failure of Ubuntu_Jammy_Jellyfish pymin_conda_forge_openblas_ubuntu_2204 is clearer because it sets "SKLEARN_WARNINGS_AS_ERRORS": "1" (not 100% clear though which option supersedes which), the failure of ci/circleci: doc is not clear to me. I added the filterwarning but it still errors. It does not print where and I can't reproduce locally => I need help with this.

@lesteve
Copy link
Member

lesteve commented Jul 31, 2025

Maybe add your warning to

# Plotly deprecated something which we're not using, but internally it's used
# and needs to be fixed on their side.
# https://github.com/plotly/plotly.py/issues/4997
WarningInfo(
"ignore",
message=".+scattermapbox.+deprecated.+scattermap.+instead",
category=DeprecationWarning,

This thing is used to centralize the warnings that we can not do much about both for the doc build and the tests.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Aug 1, 2025

Over the time with this PR, I become some sort of expert for the Passive-Aggressive algos with two insights:

  1. Interestingly, PA can be generalized for all loss functions $\ell$ with minimum zero not just Hinge and epsilon insensitive loss (was unclear before, see WIP: Adding Passive Aggressive learning rates #1259 (comment)).
  2. It can be formulated as a normal SGD scheme ($\tau$ in the Online Passive-Aggressive Algorithms becomes $\eta$ in SGD). This way it becomes clear that it actually minimizes a loss (unclear before, see WIP: Adding Passive Aggressive learning rates #1259 (comment) and WIP: Adding Passive Aggressive learning rates #1259 (comment))

Let $\ell(x\cdot w, y)$ be the loss for a single sample vector $x$ with weights $w$ and observed class $y$ with $\min \ell(x\cdot w, y) = 0$.

The main trick is to uses a linear approximation of the loss as $\ell(x\cdot w, y) = \ell(x\cdot w_t, y) + (w - w_t)\cdot x \ell^\prime(x\cdot w_t, y)$ (derivative w.r.t. to its first argument), which seems appropriate for the purpose of SGD.
Note that $\frac{d}{dw}\ell(x\cdot w, y) = x \ell^\prime(x\cdot w_t, y)$.

PA-I

(Eq. 6) Update current weights $w_t$ as $w_{t+1} = \mathrm{argmin}_w \frac{1}{2}\lVert w - w_t\rVert^2 + C\xi$ such that $\ell(x\cdot w, y)\leq\xi$ and $0\leq\xi$. C is a free positive parameter controlling the step size.

Solution: $w_{t+1} = w_t - \eta x \ell^\prime(x\cdot w_t, y)$ with $\eta=\min(C, \frac{\ell(x\cdot w_t, y)}{\lVert x\rVert^2\ell^\prime(x\cdot w_t, y)^2})$.

For Hinge loss with $y=\pm 1$, $\ell^{\prime 2} = y^2=1$ one recovers $\eta=\min(C, \frac{\ell(x\cdot w_t, y)}{\lVert x\rVert^2})$ of the paper (there called $\tau$) and update $w_{t+1} = w_t + \eta x y$.

PA-II

(Eq. 7) Update current weights $w_t$ as $w_{t+1} = \mathrm{argmin}_w \frac{1}{2}\lVert w - w_t\rVert^2 + C\xi^2$ such that $\ell(x\cdot w, y)\leq\xi$.

Solution: $w_{t+1} = w_t - \eta x \ell^\prime(x\cdot w_t, y)$ with $\eta=\frac{\ell(x\cdot w_t, y)}{\lVert x\rVert^2\ell^\prime(x\cdot w_t, y)^2 + \frac{1}{2C}}$.

I might consider a follow-up PR which will generalize, but actually simplify the SGD code.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty awesome!

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Aug 5, 2025

Maybe for another PR: We could think of avoiding to introduce the new parameter PA_C and repurpose an existing parameter, e.g., eta0.

@lorentzenchr lorentzenchr added the Waiting for Second Reviewer First reviewer is done, need a second one! label Aug 11, 2025
@lorentzenchr
Copy link
Member Author

If any of the participating possible reviewers could give a 2nd approval, that would be great. I really would like to have this over the finish line before possible merge conflicts (for example #31856).

When it's merged, I might open another PR to use eta0 instead of PA_C. Then we have time until the 1.8 release. But I only have time until end of August.

@adrinjalali
Copy link
Member

@adam2392 or @OmarManzoor could maybe have a look?

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @lorentzenchr

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @lorentzenchr

@OmarManzoor OmarManzoor enabled auto-merge (squash) August 12, 2025 13:23
@lorentzenchr lorentzenchr disabled auto-merge August 12, 2025 13:28
@lorentzenchr
Copy link
Member Author

@OmarManzoor You spotted an error in the docstring of pa2. Should be corrected with
b2144ee. Could you (auto-) merge again?

@OmarManzoor OmarManzoor enabled auto-merge (squash) August 12, 2025 13:38
@OmarManzoor OmarManzoor merged commit 3c74809 into scikit-learn:main Aug 12, 2025
36 checks passed
@lorentzenchr lorentzenchr deleted the dep_passive_aggressive branch August 12, 2025 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:linear_model Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do we want to keep the Passive Agressive estimators?
9 participants