Skip to content

ENH save memory with LinearLoss #23090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Apr 9, 2022

Reference Issues/PRs

Follow up of #21808 and #22548.

What does this implement/fix? Explain your changes.

This PR enables to allocate ndarrays an reuse them in LinearModelLoss. This improves memory footprint of:

  • LogisticRegression with solver "lbfgs" and "newton-cg"
  • TweedieRegressor, PoissonRegressor, GammaRegressor

Any other comments?

One could also provide pre-allocated arrays for the actual gradient (wrt the coefficients). This has one has shape=coef.shape.

If lbfgs, for instance, does 100 interations, then the current implementations allocates 2*100 temporary arrays for gradient and loss. In particular for multiclass problems, these gradient arrays have shape=(n_samples, n_classes).

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Are there benchmarks to showing the memory improvements?

@lorentzenchr
Copy link
Member Author

An alternative to explicitly calling the methods with such temporary arrays like

def gradient(..., per_sample_gradient_out=None):

would be to let class LinearModelLoss handle this. For example, one could implement a method:

def set_temporary_arrays(self, n_samples, n_classes, type):
    self.per_sample_gradient_out = np.empty(...)

And then use those temporaries implicitly.

@thomasjpfan
Copy link
Member

thomasjpfan commented Apr 10, 2022

I'm +0 on having LinearModelLoss handle the temporary arrays. I think adding state complicates the object a little and will be slightly harder for future maintainers.

Also, we would need to be careful about the size of LinearModelLoss itself. The loss object is an private attribute in GLM:

self._linear_loss = LinearModelLoss(

The temporary arrays would need be to be removed after they are not needed anymore to not take up unnecessary memory after fitting.

@lorentzenchr
Copy link
Member Author

I'm +0 on having LinearModelLoss handle the temporary arrays.

I just wanted to point out other options. Thanks for your insights on the trade-offs.

@ogrisel
Copy link
Member

ogrisel commented Apr 13, 2022

Do we have any reason to keep a long lived _linear_loss attribute on an estimator after a call to fit?

@ogrisel
Copy link
Member

ogrisel commented Apr 13, 2022

I am not sure it will have that of memory usage impact as I expect malloc to recycle recently freed memory buffers anyway. However it could improve the speed by avoiding doing too many calls to mallocs.

However, in a LBFGS call I expect that there should be ~100 of calls to the loss function object, so 100 extra mallocs + free might be invisible.

Could you please run a quick benchmarking with %timeit and memory_profiler mprof run / mprof plot to see the impact of this PR on a typical use case?

@lorentzenchr
Copy link
Member Author

I ran the simple script under details. %timeit gives not significant difference. But I'm puzzled by the result of mrpof run. I would have expected the opposite.

This PR
logistic_memprof_PR

main
logistic_memprof_MAIN

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression


alpha = 0.01
n_samples, n_features, n_classes = 100_000, 100, 50

X, y = make_classification(
    n_samples=n_samples,
    n_features=n_features,
    n_informative=n_features,
    n_redundant=0,
    n_classes=n_classes,
)
clf = LogisticRegression(C = 1/alpha)
clf.fit(X, y)

@ogrisel
Copy link
Member

ogrisel commented Apr 14, 2022

Is this behavior deterministic?

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Apr 14, 2022

Is this behavior deterministic?

Yes. It seems so.

jjerphan
jjerphan previously approved these changes Nov 18, 2022
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for exploring this, @lorentzenchr!

Here are a few commented that I started months ago. I guess this PR can be closed as it seems memory performance are currently better on main. Or do you think there might be another pattern improving memory usage, what do you think?

Comment on lines +419 to +446
if solver == "lbfgs":
# To save some memory, we preallocate a ndarray used as per row loss and
# gradient inside od LinearLoss, e.g. by LinearLoss.base_loss.gradient (and
# others).
per_sample_loss_out = np.empty_like(target)
if linear_loss.base_loss.is_multiclass:
per_sample_gradient_out = np.empty(
shape=(X.shape[0], classes.size), dtype=X.dtype, order="C"
)
else:
per_sample_gradient_out = np.empty_like(target, order="C")

func = functools.partial(
linear_loss.loss_gradient,
per_sample_loss_out=per_sample_loss_out,
per_sample_gradient_out=per_sample_gradient_out,
)
elif solver == "newton-cg":
# To save some memory, we preallocate a ndarray used as per row loss and
# gradient inside od LinearLoss, e.g. by LinearLoss.base_loss.gradient (and
# others).
per_sample_loss_out = np.empty_like(target)
if linear_loss.base_loss.is_multiclass:
per_sample_gradient_out = np.empty(
shape=(X.shape[0], classes.size), dtype=X.dtype, order="C"
)
else:
per_sample_gradient_out = np.empty_like(target, order="C")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be boiled down to this? Note that I have also specified dtype=X.dtype when creating per_sample_gradient_out and changed the comment.

Suggested change
if solver == "lbfgs":
# To save some memory, we preallocate a ndarray used as per row loss and
# gradient inside od LinearLoss, e.g. by LinearLoss.base_loss.gradient (and
# others).
per_sample_loss_out = np.empty_like(target)
if linear_loss.base_loss.is_multiclass:
per_sample_gradient_out = np.empty(
shape=(X.shape[0], classes.size), dtype=X.dtype, order="C"
)
else:
per_sample_gradient_out = np.empty_like(target, order="C")
func = functools.partial(
linear_loss.loss_gradient,
per_sample_loss_out=per_sample_loss_out,
per_sample_gradient_out=per_sample_gradient_out,
)
elif solver == "newton-cg":
# To save some memory, we preallocate a ndarray used as per row loss and
# gradient inside od LinearLoss, e.g. by LinearLoss.base_loss.gradient (and
# others).
per_sample_loss_out = np.empty_like(target)
if linear_loss.base_loss.is_multiclass:
per_sample_gradient_out = np.empty(
shape=(X.shape[0], classes.size), dtype=X.dtype, order="C"
)
else:
per_sample_gradient_out = np.empty_like(target, order="C")
# To save some memory, we preallocate two ndarrays used respectively
# as per row loss, gradient inside of LinearLoss by several methods
# e.g. by LinearLoss.base_loss.{loss,gradient,gradient_hessian_product}.
per_sample_loss_out = np.empty_like(target)
if linear_loss.base_loss.is_multiclass:
per_sample_gradient_out = np.empty(
shape=(X.shape[0], classes.size), dtype=X.dtype, order="C"
)
else:
per_sample_gradient_out = np.empty_like(target, dtype=X.dtype, order="C")
if solver == "lbfgs":
func = functools.partial(
linear_loss.loss_gradient,
per_sample_loss_out=per_sample_loss_out,
per_sample_gradient_out=per_sample_gradient_out,
)
elif solver == "newton-cg":

Comment on lines +456 to +460
hess = functools.partial(
linear_loss.gradient_hessian_product, # hess = [gradient, hessp]
per_sample_gradient_out=per_sample_gradient_out,
per_sample_hessian_out=per_sample_hessian_out,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hess = functools.partial(
linear_loss.gradient_hessian_product, # hess = [gradient, hessp]
per_sample_gradient_out=per_sample_gradient_out,
per_sample_hessian_out=per_sample_hessian_out,
)
# hess = [gradient, hessp]
hess = functools.partial(
linear_loss.gradient_hessian_product,
per_sample_gradient_out=per_sample_gradient_out,
per_sample_hessian_out=per_sample_hessian_out,
)

Comment on lines +274 to +276
# To save some memory, we preallocate a ndarray used as per row loss and
# gradient inside of LinearLoss, e.g. by LinearLoss.base_loss.gradient (and
# others).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth being a bit more explicit?

Suggested change
# To save some memory, we preallocate a ndarray used as per row loss and
# gradient inside of LinearLoss, e.g. by LinearLoss.base_loss.gradient (and
# others).
# To save some memory, we preallocate two ndarrays used respectively
# as per row loss, gradient inside of LinearLoss by several methods
# e.g. by LinearLoss.base_loss.{loss,gradient,gradient_hessian_product}.

@jjerphan jjerphan dismissed their stale review November 18, 2022 08:21

Oops, I only wanted to comment but I miscliked.

@lorentzenchr lorentzenchr deleted the linear_loss_array_reuse branch November 18, 2022 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants