Skip to content

ENH Add trust-ncg solver to LogisticRegression #22236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Jan 17, 2022

Reference Issues/PRs
Fixes #17125
Picks up stalled PR #17877

What does this implement/fix? Explain your changes.
PR #17877: Implements trust-ncg option in LogisticRegression when multi_class = 'multinomial'.
This PR: Addresses remaining review comments, including generating benchmarks against the new solver. Reconciles implementation with overhauled loss module implementation. Expands test coverage.

Any other comments?
This picks up the wonderful work done by @rithvikrao and @rubywerman

@Micky774
Copy link
Contributor Author

@thomasjpfan
Okay so it looks like in all of my initial tests, using datasets created via datasets.make_classification(...) on varying arguments, trust-ncg never converges to the same results as the other solvers -- if it even converges in the first place. I haven't explicitly calculated how many times it does vs does not converge yet. This lack of convergence was validated by comparing coef_ values, but is also readily seen in graphs of the NLL. See here:
image

All other solvers overlap on the graph, since they converge to the same solution. Before mentioning any other metrics or analysis, it's worth qualifying that none of these datasets are sparse-format, even the so-named "sparse"-style dataset -- that refers to the density of useful vs. noisy features. I'll run a second batch of tests on a dense synthetic dataset, and the sparse news dataset.

For now, I also observed that two things were almost always true:

  1. trust-ncg failed to converge because "...a bad approximation caused failure to predict improvement..."
  2. trust-ncg num_iter_<=10 after fit

During the next phase of benchmarking, I'll use the news group dataset as mentioned prior, as well as generate some memory-usage stats.

@Micky774
Copy link
Contributor Author

@thomasjpfan @ogrisel

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to investigate why trust-ncg does not coverage. Here is a simple test that should pass:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from numpy.testing import assert_allclose

X, y = make_classification(n_features=10, n_informative=5, random_state=0)
log_reg_trust = LogisticRegression(solver="trust-ncg", fit_intercept=False)
log_reg_trust.fit(X, y)

log_reg_lb = LogisticRegression(solver="lbfgs", fit_intercept=False)
log_reg_lb.fit(X, y)

# Mostly the same as decimal=3 like in test_logistic_regression_solvers
assert_allclose(log_reg_trust.coef_, log_reg_lb.coef_, atol=1.5e-3)

I have not dived deeply into the error, but I see it is triggered here:

https://github.com/scipy/scipy/blob/4871f3d1c61bdb296ae03e3480f5f584f5c67256/scipy/optimize/_trustregion.py#L239-L241

@Micky774
Copy link
Contributor Author

Micky774 commented Jan 28, 2022

I'm also now getting an overflow error in my follow-up benchmark:

C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\scipy\optimize\_trustregion_ncg.py:107: RuntimeWarning: overflow encountered in multiply
  z_next = z + alpha * d
Traceback (most recent call last):
  File "D:\work\scikit-learn\tmp_bench\benchmark_sparsity.py", line 159, in <module>
    main()
  File "D:\work\scikit-learn\tmp_bench\benchmark_sparsity.py", line 102, in main
    data.extend([build_row(X, y, conf, dset) for conf in CONFIG])
  File "D:\work\scikit-learn\tmp_bench\benchmark_sparsity.py", line 102, in <listcomp>
    data.extend([build_row(X, y, conf, dset) for conf in CONFIG])
  File "D:\work\scikit-learn\tmp_bench\benchmark_sparsity.py", line 83, in build_row
    lr.fit(X, y)
  File "d:\work\scikit-learn\sklearn\linear_model\_logistic.py", line 1615, in fit
    fold_coefs_ = Parallel(
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\joblib\parallel.py", line 1046, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\joblib\parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "d:\work\scikit-learn\sklearn\utils\fixes.py", line 216, in __call__
    return self.function(*args, **kwargs)
  File "d:\work\scikit-learn\sklearn\linear_model\_logistic.py", line 830, in _logistic_regression_path
    opt_res = optimize.minimize(
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\scipy\optimize\_minimize.py", line 641, in minimize
    return _minimize_trust_ncg(fun, x0, args, jac, hess, hessp,
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\scipy\optimize\_trustregion_ncg.py", line 37, in _minimize_trust_ncg
    return _minimize_trust_region(fun, x0, args=args, jac=jac, hess=hess,
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\scipy\optimize\_trustregion.py", line 207, in _minimize_trust_region
    p, hits_boundary = m.solve(trust_radius)
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\scipy\optimize\_trustregion_ncg.py", line 108, in solve
    if scipy.linalg.norm(z_next) >= trust_radius:
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\scipy\linalg\misc.py", line 145, in norm
    a = np.asarray_chkfinite(a)
  File "C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\numpy\lib\function_base.py", line 488, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

I don't yet have any way to minimally reproduce this. The dataset and logistic regression configuration are as follows:

X, y = datasets.make_classification(n_samples=15000, n_informative=15, n_classes=15)
X= StandardScaler().fit_transform(X)
lr = LogisticRegression(solver='trust-ncg', multi_class='ovr', penalty='none', max_iter=1000, verbose=1)
lr.fit(X,y)

yet this code doesn't produce the same error. I'm a bit puzzled. Regardless, this provides more evidence that there's something wrong in calculating the trust region.

Edit: I randomly got this as well, which narrows it a smidge

C:\Users\Meekail\Anaconda3\envs\scikit-dev\lib\site-packages\scipy\optimize\_trustregion_ncg.py:106: RuntimeWarning: overflow encountered in double_scalars
  alpha = r_squared / dBd

@lorentzenchr
Copy link
Member

@Micky774 Could you merge with main? The implementation of the log-loss function has change. I would be interested if the summary remains the same with those changes.

@Micky774 Micky774 changed the title [WIP] ENH Add trust-ncg option to LogisticRegression ENH Add trust-ncg solver to LogisticRegression Jun 8, 2022
@Micky774
Copy link
Contributor Author

Micky774 commented Jun 8, 2022

New benchmarks with this script (run in Jupyter). Tested w/ X arrays of shape (n_samples, 100) and y of shape (n_samples, ) ranging over 4 classes. All data are dense. Repeated 20 times.
output

@thomasjpfan
Copy link
Member

For dense data, it looks like lbfgs is always better than trust-ncg. From #17125, we were trying to improve the situation for sparse data. Is the benchmark different with sparse data?

@Micky774
Copy link
Contributor Author

Micky774 commented Jun 9, 2022

Additional benchmarks w/ sparse X w/ density=0.1:

sparse_output

Overall, it seems lbfgs > trust-ncg? Haven't tested memory footprint yet, will probably do so tomorrow.

@lorentzenchr
Copy link
Member

@Micky774 Thanks for those benchmarks. In my own tries on benchmarking GLMs, I notices large differences between synthetic and real data. Could you also run timings on the kicks dataset as in #23314 (comment) for lbfgs and trust-ncg (no need for sag or saga)? Also there, one could run on dense and on sparse X.

Comment on lines +437 to +441
# TODO: remove local LinearModelLoss after renaming `loss`
elif solver == "trust-ncg":
loss_ = LinearModelLoss(
base_loss=HalfBinomialLoss(), fit_intercept=fit_intercept
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local instantiation of LinearModelLoss is to counteract to a bug caused by the reassignment of loss to a scalar value later in the code (at the end of the loop). In a separate PR, we ought to rename the loss referring to the internal loss module, or rename the loss referring to the scalar loss.

@Micky774
Copy link
Contributor Author

Micky774 commented Jun 9, 2022

Script for reference

@lorentzenchr The time taken by trust-ncg is pretty significantly worse than for lbfgs:

Plot

9de6d83d-edb8-44bd-b5c8-50bd95ce9fef

Interestingly, it seems that trust-ncg does outperform lbfgs when multi_class=multinomial and penalty="l2"

Plot

8181002e-992d-4fb7-b048-a53fe9e40e37

@lorentzenchr
Copy link
Member

@Micky774 Thanks again for those detailed analysis: much appreciated! Given the results, I guess we can close this PR because unfortunately there is no improvement over already existing solvers.

@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2022

It would more informative to re-run this benchmark with various tol values because they do not necessarily mean the same thing for different solvers. It would also be interesting to try on the same dataset as in:

#23314 (comment)

@Micky774
Copy link
Contributor Author

Micky774 commented Jun 20, 2022

It would more informative to re-run this benchmark with various tol values because they do not necessarily mean the same thing for different solvers. It would also be interesting to try on the same dataset as in:

#23314 (comment)

Initial results using this script:

Plots

Iterations vs suboptimality

ccc00d35-b6dd-49b8-9ca4-111ae96db186

Train time vs suboptimality

c0cf96f5-4242-4825-b069-ababba81c50c

Despite it outperforming w/ respect to n_iter, it falls behind significantly when measuring w.r.t. train time. It also apparently converges to a worse solution when measuring w/ the HalfBinomial loss, however all of them apparently converge to the same NLL value, so I'm not sure how dramatic that is?

@rubywerman
Copy link
Contributor

yay!! cheers

@ogrisel
Copy link
Member

ogrisel commented Jun 27, 2022

Thanks for running the benchmarks with various tol values. Indeed, it does not seem to be good enough to be worth exposing in the scikit-learn API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LogisticRegression memory consumption goes crazy on 0.22+
6 participants