-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH save memory with LinearLoss #23090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH save memory with LinearLoss #23090
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Are there benchmarks to showing the memory improvements?
An alternative to explicitly calling the methods with such temporary arrays like def gradient(..., per_sample_gradient_out=None): would be to let def set_temporary_arrays(self, n_samples, n_classes, type):
self.per_sample_gradient_out = np.empty(...) And then use those temporaries implicitly. |
I'm +0 on having Also, we would need to be careful about the size of scikit-learn/sklearn/linear_model/_glm/glm.py Line 233 in bd9336d
The temporary arrays would need be to be removed after they are not needed anymore to not take up unnecessary memory after fitting. |
I just wanted to point out other options. Thanks for your insights on the trade-offs. |
Do we have any reason to keep a long lived |
I am not sure it will have that of memory usage impact as I expect malloc to recycle recently freed memory buffers anyway. However it could improve the speed by avoiding doing too many calls to mallocs. However, in a LBFGS call I expect that there should be ~100 of calls to the loss function object, so 100 extra mallocs + free might be invisible. Could you please run a quick benchmarking with %timeit and |
I ran the simple script under details. from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
alpha = 0.01
n_samples, n_features, n_classes = 100_000, 100, 50
X, y = make_classification(
n_samples=n_samples,
n_features=n_features,
n_informative=n_features,
n_redundant=0,
n_classes=n_classes,
)
clf = LogisticRegression(C = 1/alpha)
clf.fit(X, y) |
Is this behavior deterministic? |
Yes. It seems so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for exploring this, @lorentzenchr!
Here are a few commented that I started months ago. I guess this PR can be closed as it seems memory performance are currently better on main
. Or do you think there might be another pattern improving memory usage, what do you think?
if solver == "lbfgs": | ||
# To save some memory, we preallocate a ndarray used as per row loss and | ||
# gradient inside od LinearLoss, e.g. by LinearLoss.base_loss.gradient (and | ||
# others). | ||
per_sample_loss_out = np.empty_like(target) | ||
if linear_loss.base_loss.is_multiclass: | ||
per_sample_gradient_out = np.empty( | ||
shape=(X.shape[0], classes.size), dtype=X.dtype, order="C" | ||
) | ||
else: | ||
per_sample_gradient_out = np.empty_like(target, order="C") | ||
|
||
func = functools.partial( | ||
linear_loss.loss_gradient, | ||
per_sample_loss_out=per_sample_loss_out, | ||
per_sample_gradient_out=per_sample_gradient_out, | ||
) | ||
elif solver == "newton-cg": | ||
# To save some memory, we preallocate a ndarray used as per row loss and | ||
# gradient inside od LinearLoss, e.g. by LinearLoss.base_loss.gradient (and | ||
# others). | ||
per_sample_loss_out = np.empty_like(target) | ||
if linear_loss.base_loss.is_multiclass: | ||
per_sample_gradient_out = np.empty( | ||
shape=(X.shape[0], classes.size), dtype=X.dtype, order="C" | ||
) | ||
else: | ||
per_sample_gradient_out = np.empty_like(target, order="C") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be boiled down to this? Note that I have also specified dtype=X.dtype
when creating per_sample_gradient_out
and changed the comment.
if solver == "lbfgs": | |
# To save some memory, we preallocate a ndarray used as per row loss and | |
# gradient inside od LinearLoss, e.g. by LinearLoss.base_loss.gradient (and | |
# others). | |
per_sample_loss_out = np.empty_like(target) | |
if linear_loss.base_loss.is_multiclass: | |
per_sample_gradient_out = np.empty( | |
shape=(X.shape[0], classes.size), dtype=X.dtype, order="C" | |
) | |
else: | |
per_sample_gradient_out = np.empty_like(target, order="C") | |
func = functools.partial( | |
linear_loss.loss_gradient, | |
per_sample_loss_out=per_sample_loss_out, | |
per_sample_gradient_out=per_sample_gradient_out, | |
) | |
elif solver == "newton-cg": | |
# To save some memory, we preallocate a ndarray used as per row loss and | |
# gradient inside od LinearLoss, e.g. by LinearLoss.base_loss.gradient (and | |
# others). | |
per_sample_loss_out = np.empty_like(target) | |
if linear_loss.base_loss.is_multiclass: | |
per_sample_gradient_out = np.empty( | |
shape=(X.shape[0], classes.size), dtype=X.dtype, order="C" | |
) | |
else: | |
per_sample_gradient_out = np.empty_like(target, order="C") | |
# To save some memory, we preallocate two ndarrays used respectively | |
# as per row loss, gradient inside of LinearLoss by several methods | |
# e.g. by LinearLoss.base_loss.{loss,gradient,gradient_hessian_product}. | |
per_sample_loss_out = np.empty_like(target) | |
if linear_loss.base_loss.is_multiclass: | |
per_sample_gradient_out = np.empty( | |
shape=(X.shape[0], classes.size), dtype=X.dtype, order="C" | |
) | |
else: | |
per_sample_gradient_out = np.empty_like(target, dtype=X.dtype, order="C") | |
if solver == "lbfgs": | |
func = functools.partial( | |
linear_loss.loss_gradient, | |
per_sample_loss_out=per_sample_loss_out, | |
per_sample_gradient_out=per_sample_gradient_out, | |
) | |
elif solver == "newton-cg": |
hess = functools.partial( | ||
linear_loss.gradient_hessian_product, # hess = [gradient, hessp] | ||
per_sample_gradient_out=per_sample_gradient_out, | ||
per_sample_hessian_out=per_sample_hessian_out, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hess = functools.partial( | |
linear_loss.gradient_hessian_product, # hess = [gradient, hessp] | |
per_sample_gradient_out=per_sample_gradient_out, | |
per_sample_hessian_out=per_sample_hessian_out, | |
) | |
# hess = [gradient, hessp] | |
hess = functools.partial( | |
linear_loss.gradient_hessian_product, | |
per_sample_gradient_out=per_sample_gradient_out, | |
per_sample_hessian_out=per_sample_hessian_out, | |
) |
# To save some memory, we preallocate a ndarray used as per row loss and | ||
# gradient inside of LinearLoss, e.g. by LinearLoss.base_loss.gradient (and | ||
# others). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth being a bit more explicit?
# To save some memory, we preallocate a ndarray used as per row loss and | |
# gradient inside of LinearLoss, e.g. by LinearLoss.base_loss.gradient (and | |
# others). | |
# To save some memory, we preallocate two ndarrays used respectively | |
# as per row loss, gradient inside of LinearLoss by several methods | |
# e.g. by LinearLoss.base_loss.{loss,gradient,gradient_hessian_product}. |
Oops, I only wanted to comment but I miscliked.
Reference Issues/PRs
Follow up of #21808 and #22548.
What does this implement/fix? Explain your changes.
This PR enables to allocate ndarrays an reuse them in
LinearModelLoss
. This improves memory footprint of:LogisticRegression
with solver"lbfgs"
and"newton-cg"
TweedieRegressor
,PoissonRegressor
,GammaRegressor
Any other comments?
One could also provide pre-allocated arrays for the actual gradient (wrt the coefficients). This has one has
shape=coef.shape
.If lbfgs, for instance, does 100 interations, then the current implementations allocates 2*100 temporary arrays for gradient and loss. In particular for multiclass problems, these gradient arrays have
shape=(n_samples, n_classes)
.