-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
PERF: Vectorize stochastic optimizers for 20-25% performance improvement #32110 #32113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Optimize SGDOptimizer._get_updates() using in-place NumPy operations - Optimize AdamOptimizer._get_updates() using vectorized moment updates - Add comprehensive performance tests to verify optimizations - Maintain full numerical accuracy and backward compatibility - Achieve 26.3% speedup for SGD and 24.2% speedup for Adam on large arrays The optimizations use np.multiply() and np.add() with out= parameter to reduce memory allocations and improve cache efficiency during neural network training.
- Fix line length violations in _stochastic_optimizers.py - Remove unused imports and fix style in performance tests - All tests still pass with optimizations intact
- Fix remaining line length violations in both files - Ensure full compliance with flake8 and black formatting - All performance tests still pass (6/6) - Code ready for scikit-learn CI/CD pipeline
…ents - Document 20-25% performance improvement in MLPClassifier/MLPRegressor - Entry follows scikit-learn changelog format for efficiency improvements - Resolves missing changelog requirement for PR scikit-learn#32113
# Vectorized update of velocities for better performance | ||
for i, (velocity, grad) in enumerate(zip(self.velocities, grads)): | ||
np.multiply(self.momentum, velocity, out=velocity) | ||
grad_update = -self.learning_rate * grad | ||
np.add(velocity, grad_update, out=velocity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this syntax is clearer (and efficiency is exactly the same, both syntax are sementically equivalent I think):
# Vectorized update of velocities for better performance | |
for i, (velocity, grad) in enumerate(zip(self.velocities, grads)): | |
np.multiply(self.momentum, velocity, out=velocity) | |
grad_update = -self.learning_rate * grad | |
np.add(velocity, grad_update, out=velocity) | |
# Vectorized in-place update of velocities for better performance | |
for velocity, grad in zip(self.velocities, grads): | |
velocity *= self.momentum | |
velocity -= self.learning_rate * grad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah they are mostly the same
but in second snippet
Uses Python/Numpy in-place operators directly.
More concise, easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also consider to directly call BLAS axpy via scipy, see https://stackoverflow.com/a/67358748.
I guess this will be the most efficient option, at least for the last line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @lorentzenchr
Yes, you’re right ---->>>> axpy
computes y <--- a*x + y
.
But I think there are a couple of cons when using BLAS directly here:
----- It requires two calls (scal
to scale by momentum + axpy
to apply gradient).
--- NumPy already dispatches operations like velocity *= self.momentum
and
velocity -= lr * grad
to BLAS under the hood, so we’re effectively
getting the same efficiency without the extra overhead.
So while scipy.linalg.blas.axpy
might be the most efficient option for the last line in isolation,
in practice the current NumPy code is already clear, efficient, and leverages BLAS internally.
else: | ||
updates = [v.copy() for v in self.velocities] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the tests break without those 2 lines? Maybe returned updates are only read and never written to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it does no break
i will update it to
else:
updates = self.velocities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great.
But let this comment openned, I want other reviewers to see it, just in case 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, cause v.copy() was just extra and it's resolve now
self, | ||
params, | ||
learning_rate_init=0.001, | ||
beta_1=0.9, | ||
beta_2=0.999, | ||
epsilon=1e-8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: avoid changes unrelated to the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can i close this as resolve??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the change, maybe you have some auto-formating set-up? Or you simply forgot to change/push/...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i forgot to chaneg yet
sklearn/neural_network/tests/test_stochastic_optimizers_performance.py
Outdated
Show resolved
Hide resolved
Can you add a link to the benchark code in the PR desc? (a gist or a colab notebook or ...) And can you add a link to the issue to? (like "Fixes #32110" ) |
@cakedev0 This is the benchmark_optimize.py import numpy as np
from timeit import timeit
from sklearn.neural_network._stochastic_optimizers import SGDOptimizer as OriginalSGD
from sklearn.neural_network._stochastic_optimizers import AdamOptimizer as OriginalAdam
# Import your optimized versions (assuming optimized_optimizers.py is in the same folder)
from optimized_optimizers import SGDOptimizer as OptimizedSGD
from optimized_optimizers import AdamOptimizer as OptimizedAdam
# Set random seed for reproducibility
np.random.seed(42)
# Dummy parameters and gradients (simulating a large MLP layer)
params = [
np.random.randn(1000, 1000), # Weights (1000x1000)
np.random.randn(1000) # Biases (1000)
]
grads = [
np.random.randn(1000, 1000),
np.random.randn(1000)
]
# Function to run benchmark
def benchmark_optimizer(optimizer_class, params, grads, label):
optimizer = optimizer_class(params)
time = timeit(lambda: optimizer._get_updates(grads), number=1000)
print(f"{label} time: {time:.4f} seconds")
return time
# Benchmark Original SGD
orig_sgd_time = benchmark_optimizer(OriginalSGD, params, grads, "Original SGD")
# Benchmark Optimized SGD
opt_sgd_time = benchmark_optimizer(OptimizedSGD, params, grads, "Optimized SGD")
# Benchmark Original Adam
orig_adam_time = benchmark_optimizer(OriginalAdam, params, grads, "Original Adam")
# Benchmark Optimized Adam
opt_adam_time = benchmark_optimizer(OptimizedAdam, params, grads, "Optimized Adam")
# Calculate speedups
sgd_speedup = ((orig_sgd_time - opt_sgd_time) / orig_sgd_time) * 100
adam_speedup = ((orig_adam_time - opt_adam_time) / orig_adam_time) * 100
print(f"SGD Speedup: {sgd_speedup:.1f}%")
print(f"Adam Speedup: {adam_speedup:.1f}%")
# Validate numerical accuracy
orig_sgd_updates = OriginalSGD(params)._get_updates(grads)
opt_sgd_updates = OptimizedSGD(params)._get_updates(grads)
for orig, opt in zip(orig_sgd_updates, opt_sgd_updates):
assert np.allclose(orig, opt, rtol=1e-5), "SGD updates differ!"
orig_adam_updates = OriginalAdam(params)._get_updates(grads)
opt_adam_updates = OptimizedAdam(params)._get_updates(grads)
for orig, opt in zip(orig_adam_updates, opt_adam_updates):
assert np.allclose(orig, opt, rtol=1e-5), "Adam updates differ!"
print("Numerical accuracy validated.") |
- Reduce array sizes from 500x500 to 10x10 and 100x100 to 20x20 - Test runtime reduced from 6+ seconds to 0.13 seconds (98% faster) - Maintains vectorization testing coverage while being CI-friendly - All 6 tests pass with proper numerical accuracy validation
- Replace np.multiply/np.add with in-place *= and += operators - Optimize division by avoiding temporary array creation - Remove unnecessary .copy() in SGD optimizer - Additional ~5% performance improvement for Adam - Maintains numerical accuracy while improving efficiency
- Fix missing self.learning_rate assignment in Adam _get_updates() - Ensure bias correction learning rate is properly tracked as public attribute - Maintains API compatibility and proper Adam algorithm implementation - Learning rate correctly decreases over iterations (0.001 -> ~0.0002 after 50 steps) - All tests pass with corrected implementation Previous optimization accidentally used local lr_t variable instead of updating the public self.learning_rate attribute, which is critical for proper Adam behavior.
Contribute to #32110
The optimizations use np.multiply() and np.add() with out= parameter to reduce memory allocations and improve cache efficiency during neural network training.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Any other comments?