Skip to content

ENH add gap safe screening rules to enet_coordinate_descent #31882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Aug 5, 2025

Reference Issues/PRs

Solves #229. The 2nd oldest open issue at the time of opening this PR!!!

What does this implement/fix? Explain your changes.

This PR adds Gap Safe Screening Rules for the private function enet_coordinate_descent.

In order to make it testable for others, I added a private attribute _do_screening (bool) to several classes like Lasso.

Any other comments?

Following https://arxiv.org/abs/1802.07481 but without acceleration or working sets. To keep it simple.

For reviewers: Please first merge #31906 and #31905.

@lorentzenchr lorentzenchr changed the title Gap safe ENH add gap safe screening rules to enet_coordinate_descent Aug 5, 2025
Copy link

github-actions bot commented Aug 5, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 537b41a. Link to the linter CI: here

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Aug 5, 2025

A little benchmark on the leukemia dataset

1. Full Lasso Path with 100 steps

n_samples = 72, n_features = 7129

Starting path computation...
Lasso without screening - time: 0.18 s
Lasso with screening    - time: 0.10 s
Lasso without screening - time: 1.66 s
Lasso with screening    - time: 0.53 s
Lasso without screening - time: 9.75 s
Lasso with screening    - time: 1.62 s
image

Code

# https://github.com/mathurinm/celer/blob/main/examples/plot_leukemia_path.py
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import lasso_path
from sklearn.datasets import fetch_openml

print(__doc__)

print("Loading data...")
dataset = fetch_openml("leukemia")
X = np.asfortranarray(dataset.data.astype(float))
y = 2 * ((dataset.target != "AML") - 0.5)
n_samples = len(y)

y -= np.mean(y)
y /= np.std(y)

print("Starting path computation...")
alpha_max = np.max(np.abs(X.T.dot(y))) / n_samples

n_alphas = 100
alphas = alpha_max * np.geomspace(1, 0.01, n_alphas)

tols = [1e-2, 1e-3, 1e-4]
results = np.zeros([2, len(tols)])
for tol_ix, tol in enumerate(tols):
    t0 = time.time()
    _, coefs, gaps = lasso_path(
        X, y, alphas=alphas, tol=tol, max_iter=10_000, do_screening=False)
    results[0, tol_ix] = time.time() - t0
    print('Lasso without screening - time: %.2f s' % results[0, tol_ix])

    t0 = time.time()
    _, coefs, dual_gaps = lasso_path(
        X, y, alphas=alphas, tol=tol, max_iter=10_000, do_screening=True)
    results[1, tol_ix] = time.time() - t0
    print('Lasso with screening    - time: %.2f s' % results[1, tol_ix])

df = pd.DataFrame(results.T, columns=["no screening", "with screening"])
df.index = [str(tol) for tol in tols]
df.plot.bar(rot=0)
plt.xlabel("stopping tolerance")
plt.ylabel("path computation time (s)")
plt.tight_layout()
plt.show()

2. Single alpha

image

Code

# For single values of alpha
import time
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.linear_model import lasso_path
from sklearn.datasets import fetch_openml


print("Loading data...")
dataset = fetch_openml("leukemia")
X = np.asfortranarray(dataset.data.astype(float))
y = 2 * ((dataset.target != "AML") - 0.5)
n_samples = len(y)

y -= np.mean(y)
y /= np.std(y)

print("Starting path computation...")
alpha_max = np.max(np.abs(X.T.dot(y))) / n_samples

n_alphas = 1

tols = [1e-2, 1e-3, 1e-4]
results = []  # scale_alpha_max, tol, do_screening, time
penalty_scales = [1e-1, 1e-2, 1e-5]

for alpha_idx, scale_alpha_max in enumerate(penalty_scales):
    alpha = [alpha_max * scale_alpha_max]
    print(f"alpha = alpha_max * {scale_alpha_max}")
    for tol_ix, tol in enumerate(tols):
        t0 = time.time()
        _, coefs, gaps = lasso_path(
            X, y, alphas=alpha, tol=tol, max_iter=10_000, do_screening=False)
        t = time.time() - t0
        results += [[scale_alpha_max, tol, False, t]] 
        print('  Lasso without screening - time: %.2f s' % t)

        t0 = time.time()
        _, coefs, dual_gaps = lasso_path(
            X, y, alphas=alpha, tol=tol, max_iter=10_000, do_screening=True)
        t = time.time() - t0
        results += [[scale_alpha_max, tol, True, t]] 
        print('  Lasso with screening    - time: %.2f s' % t)

df = pd.DataFrame(
    results,
    columns=["scale_alpha_max", "tol", "do_screening", "time [s]"],
)
df

sns.catplot(
    df, x="tol", y="time [s]", hue="do_screening", col="scale_alpha_max", kind="bar", errorbar=None, sharey=False)

@ogrisel ogrisel self-assigned this Aug 6, 2025
@ogrisel
Copy link
Member

ogrisel commented Aug 6, 2025

@ogrisel
Copy link
Member

ogrisel commented Aug 6, 2025

Based on the build failure of some CI runners, it seems that this code requires bumping up the minimum Cython version. I think it's fine because it's a not runtime dependency.

@ogrisel
Copy link
Member

ogrisel commented Aug 6, 2025

Any idea why this PR would lead to a change of predicted values in this case?

2117 Examples
2118 --------
2119 >>> from sklearn.linear_model import LassoCV
2120 >>> from sklearn.datasets import make_regression
2121 >>> X, y = make_regression(noise=4, random_state=0)
2122 >>> reg = LassoCV(cv=5, random_state=0).fit(X, y)
2123 >>> reg.score(X, y)
2124 0.9993
2125 >>> reg.predict(X[:1,])
Expected:
    array([-78.4951])
Got:
    array([-79.4755331])

Maybe we have nearly tied choices for the optimal value of alpha on this data?

@lorentzenchr
Copy link
Member Author

Any idea why this PR would lead to a change of predicted values in this case?

The solution is within the uncertainty of tol. One of the main changes is that this PR adds an additional convergence check before the main loop: if the provided (warm started) coefficients already satisfy the stopping criterion, then stop. In main, it always calculates one complete cycle over coordinates/features.

model.set_params(warm_start=True)
model.fit(X, y)
n_iter_warm_start = model.n_iter_
assert n_iter_warm_start == 1
if sparse_X:
# TODO: sparse_enet_coordinate_descent is not yet updated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly we should address this TODO before merging this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No for 2 reasons:

  1. Let‘s keep this PR small
  2. testing: If we only touch this one CD-function, our tests compare it to the others and garantees same results with high precision (small tol)

@ogrisel
Copy link
Member

ogrisel commented Aug 7, 2025

Also: could you update the docstrings and the user guide to reference the gap safe screening rule paper?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a cursory glance at the diff and the paper, this PR implements some basic active set based on the gap safe rule written in the Celer paper, however it does not seem to implement the precise active set construction described in Algorithm 4. On top of this, the Celer project implements other things such as the "dual extrapolation" from https://arxiv.org/abs/1907.05830 which probably amplifies the expected speed-ups. So I suspect that the performance of this PR is still far from the potential gains one could obtain from the full-fledged Celer implementation. Have you tried to run some benchmark against it to assess the remaining difference?

I don't mean to block this PR for that because it seems to already provide a net improvement over main, and I am fine with proceeding iteratively, but we should not imply in the doc or changelog that scikit-learn implements all the things described in the Celer paper (and in even less so if you consider the dual extrapolation paper on top).

Could you please add a changelog entry to describe the contents of this PR?

Once done, and once #31882 (comment) (both with cyclic and shuffling CD) and #31882 (comment) are addressed, LGTM.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Aug 8, 2025

First of all, @ogrisel, thanks for your review, much appreciated.

From a cursory glance at the diff and the paper, this PR implements some basic active set based on the gap safe rule written in the Celer paper, however it does not seem to implement the precise active set construction described in Algorithm 4. On top of this, the Celer project implements other things such as the "dual extrapolation" from https://arxiv.org/abs/1907.05830 which probably amplifies the expected speed-ups. So I suspect that the performance of this PR is still far from the potential gains one could obtain from the full-fledged Celer implementation. Have you tried to run some benchmark against it to assess the remaining difference?

And I have nowhere claimed to have added working sets or dual extrapolation, see PR description. Again, this is (under 200 lines of code) was (now over 500 loc) a pretty neat and small PR, introducing screening rules for a single Cython function that already gives a massive (but not maximum) speedup.

The neat thing is that it is minimal:

  • Add check of stopping criterion before entering the main loop.
  • Whenever the stopping criterion is calulcated, we already have the dual gap. Therefore we then screen variables. (This step could later be fine tuned).

I don't mean to block this PR for that because it seems to already provide a net improvement over main, and I am fine with proceeding iteratively, but we should not imply in the doc or changelog that scikit-learn implements all the things described in the Celer paper (and in even less so if you consider the dual extrapolation paper on top).

It nowhere does!

Also: could you update the docstrings and the user guide to reference the gap safe screening rule paper?

This took me a whole day, that's longer than the implementation of the screening rule itself!

@ogrisel
Copy link
Member

ogrisel commented Aug 11, 2025

And I have nowhere claimed to have added working sets or dual extrapolation, see PR description.

Sorry, I did not mean to imply that you made that claim in the PR, but I was just pointing that we should not leave the impression that we implement the full Celer algorithm when referencing the Celer paper in the docstring of the estimators (or the user guide).

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few typos, but otherwise, LGTM.

Note: the lockfiles have been concurrently updated this morning hence the conflict.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Aug 11, 2025

Task list:

Coordinate Descent with Gap Safe Screening Rules
------------------------------------------------

Coordinate descent (CD) is a strategy so solve the minimization problem that considers a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is the way it is presented in sklearn, but just for completeness, CD is not usually exact minimization along coordiante $j$. It rather corresponds to taking a proximal gradient descent step to approxmiately minimize the function wrt its j-th coordinate.
Here the smooth part of the objective function is quadratic, so this proximal step turns out to exactly minimize the function. But that's a very particular case. CD for logistic regression for example, does not exactly minimize the function along coordiante j.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the wider context.

For the historical context - and my education - didn't proximal gradient thinking come later than CD?

@lorentzenchr
Copy link
Member Author

@mathurinm Thanks for your review. Very valuable!

@lorentzenchr
Copy link
Member Author

If CD pipelines are green this is ready for merge.

Note that I removed all the private do_screening things. Only in enet_path one can pass do_screening as private (and undocumented) variable (in params).

@lorentzenchr
Copy link
Member Author

Were are back at a neat PR: Without the addition to the user guide the changed lines of code are: +278 -78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strong Rules / Gap Safe Screening for Discarding Predictors in Lasso-type Problems
3 participants