-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT: Remove duplicated data validation done in internally used BinaryTrees #19418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT: Remove duplicated data validation done in internally used BinaryTrees #19418
Conversation
I get a strange behaviour when running the tests. From this PR's base comit and after having applying some of the first commit's patches (1828e00), if I run all the tests at once using: pytest some tests, which aren't related to those changes, fail. Those tests aren't failing if I run the suite module wise, for instance: find sklearn -maxdepth 1 -type d -wholename "*/*" -exec pytest {} \; Still, the failing tests are related to inputs checks. Might it be linked to its thread-unsafety? @jnothman: do you have any idea? 🙂 Edit: tests are passing using a module-wise run as of 32a5cff. |
Can you see what happens when you include #18736 ? That PR should make the configuration threadsafe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a glance, this should speed up the computation. Do we have benchmarks that demonstrates the improvement?
We remove the validation of c done at the beginning of KDTree.__init__ and KDTree.query_radius as c already got validated at the beginning of feature_selection._estimate_mi.
I am currently performing some benchmarks on the public interfaces impacted by those changes. A CI test is failing, but this seems unrelated to those changes here: is it the case? 🤔 |
It is unrelated, I opened #20071 to adjust the atol based on 32bitness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently performing some on the public interfaces impacted by those changes.
Could you post a summary here? I wasn't sure which commits are compared.
I was under the impression that checking for finite values is expensive, but I'm not longer so sure,
In [1]: from sklearn.utils import check_array
In [2]: import numpy as np
In [3]: x = np.random.RandomState(0).rand(10000, 500)
In [4]: %timeit x + 1
6.98 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit x.copy()
6.33 ms ± 8.24 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit check_array(x)
2.16 ms ± 29.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Here is an extract of the benchmarks' report. My machines are currently busy running others, but I'd like to get back to it soon with a better benchmark plan on internal APIs. This PR's changes potentially give no improvements.
|
The overhead of the finite check appears with slightly bigger datasets: from sklearn.utils import check_array
import numpy as np
x = np.random.RandomState(0).rand(100_000, 500)
%timeit check_array(x, force_all_finite=False)
# 14.3 µs ± 85 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit check_array(x, force_all_finite=True)
# 31.5 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
I reran those benchmarks. It looks like that performances decreased generally.
Full report· Creating environments · Discovering benchmarks · Running 12 total benchmarks (2 commits * 1 environments * 6 benchmarks) [ 0.00%] · For scikit-learn commit 903125f2 (round 1/1): [ 0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl [ 8.33%] ··· knn.KernelDensityRemovedCheck.time_query ok [ 8.33%] ··· ======= ============ ============ -- d ------- ------------------------- n 10 100 ======= ============ ============ 100 1.05±0.2ms 5.35±0.2ms 1000 146±10ms 344±2ms 10000 10.9±0.6s 48.7±2s ======= ============ ============
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. |
What do you think @thomasjpfan and @rth? Should we explore it differently? |
I reran them on a machine with much more cores, and got a different summary
Full report→ asv continuous -b RemovedCheck main remove_validation_internal_trees · Creating environments · Discovering benchmarks. ·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl... ·· Installing 903125f2 into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl.. · Running 12 total benchmarks (2 commits * 1 environments * 6 benchmarks) [ 0.00%] · For scikit-learn commit 903125f2 (round 1/1): [ 0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl [ 8.33%] ··· 19418.KernelDensityRemovedCheck.time_query ok [ 8.33%] ··· ======= ============= ============ -- d ------- -------------------------- n 10 100 ======= ============= ============ 100 1.10±0.01ms 2.58±0.1ms 1000 99.8±0.6ms 269±2ms 10000 10.6±0.04s 22.3±0.01s ======= ============= ============
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for providing the benchmarking updates!
I reran this on my machine for c882688 and I got that benchmarks haven't significantly changed. Full report
|
If there is no measurable impact I would stay we keep the files unchanged and close this PR? |
I profiled it using viztracer using the setup this blog post. It looks like the overhead is negligible compared to the rest of the runtime. Script for the other, non annotated checksfrom sklearn.datasets import make_regression, make_classification
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif
from sklearn.neighbors import NearestNeighbors, KernelDensity
def main(args=None):
X_train, y_train = make_classification(n_samples=10_000, n_features=10)
X_test, _ = make_classification(n_samples=10_000, n_features=10)
mutual_info_regression(X_train, y_train, discrete_features=False)
mutual_info_classif(X_train, y_train, discrete_features=False)
kde = KernelDensity().fit(X_train, y_train)
kde.score_samples(X_train)
if __name__ == "__main__":
main() Let's close. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I profiled it using viztracer using the setup this blog post.
It looks like the overhead is negligible compared to the rest of the runtime.
Script for the other, non annotated checks
from sklearn.datasets import make_regression, make_classification
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif
from sklearn.neighbors import NearestNeighbors, KernelDensity
def main(args=None):
X_train, y_train = make_classification(n_samples=10_000, n_features=10)
X_test, _ = make_classification(n_samples=10_000, n_features=10)
mutual_info_regression(X_train, y_train, discrete_features=False)
mutual_info_classif(X_train, y_train, discrete_features=False)
kde = KernelDensity().fit(X_train, y_train)
kde.score_samples(X_train)
if __name__ == "__main__":
main()
Let's close.
with config_context(assume_finite=True): | ||
# We remove the validation of the query points | ||
# (in *parallel_kwargs) done at the beginning of | ||
# BinaryTree.query as those points already got | ||
# validated in the caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checks running time (represented by the small spike at the beginning) here is negligible.
Script
# profile.py
from sklearn.datasets import make_regression
from sklearn.neighbors import NearestNeighbors
def main(args=None):
X_train, _ = make_regression(n_samples=100_000, n_features=10)
X_test, _ = make_regression(n_samples=100_000, n_features=10)
nn = NearestNeighbors(algorithm='kd_tree').fit(X_train)
nn.kneighbors(X_test, n_neighbors=2)
if __name__ == "__main__":
main()
giltracer --state-detect profile.py
with config_context(assume_finite=True): | ||
# We remove the validation of the query points | ||
# (in *parallel_kwargs) done at the beginning of | ||
# BinaryTree.query_radius as those points already | ||
# got validated in the caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checks running time here is negligible here similarly.
Script
# profile.py
from sklearn.datasets import make_regression
from sklearn.neighbors import NearestNeighbors
def main(args=None):
X_train, _ = make_regression(n_samples=100_000, n_features=10)
X_test, _ = make_regression(n_samples=100_000, n_features=10)
nn = NearestNeighbors(algorithm='kd_tree').fit(X_train)
nn.kneighbors(X_test, n_neighbors=2)
if __name__ == "__main__":
main()
giltracer --state-detect profile.py
with config_context(assume_finite=True): | ||
# In the following cases, we remove the validation of X done at | ||
# the beginning of the BinaryTree's constructors as X already got | ||
# validated when calling this method, NeighborsBase._fit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checks running time (represented by the small spike at the beginning) is negligible here.
Script
# profile.py
from sklearn.datasets import make_regression
from sklearn.neighbors import NearestNeighbors
def main(args=None):
X_train, _ = make_regression(n_samples=100_000, n_features=10)
X_test, _ = make_regression(n_samples=100_000, n_features=10)
nn = NearestNeighbors(algorithm='kd_tree').fit(X_train)
nn.kneighbors(X_test, n_neighbors=2)
if __name__ == "__main__":
main()
giltracer --state-detect profile.py
Reference Issues/PRs
Fixes #18749.
What does this implement/fix? Explain your changes.
BinaryTrees
validate the data in some of their methods.Those interfaces are used internally in some algorithms which already validate the data.
Hence, this PR proposes to remove the extra and duplicated data validation.
Any other comments?
This uses the global context manager for configuration, as suggested by @cmarmo in #18749.
I am not entirely sure if I've used it in the intended way, especially as this context manager gets
called in
joblib.delayed
method which get themselves serialized.