Skip to content

Avoid extra copy when using astype in sparsefuncs_fast #11966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 2, 2018

Conversation

massich
Copy link
Contributor

@massich massich commented Sep 1, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

It avoids getting an extra copy of the data in the following case:

X = np.empty([10,10], dtype=np.float64)
X = X.astype(np.float64)

Files to review:

  • sklearn/cluster/hierarchical.py
  • sklearn/cluster/k_means_.py
  • sklearn/datasets/base.py
  • sklearn/datasets/covtype.py
  • sklearn/datasets/kddcup99.py
  • sklearn/datasets/openml.py
  • sklearn/datasets/rcv1.py
  • sklearn/datasets/samples_generator.py
  • sklearn/datasets/twenty_newsgroups.py
  • sklearn/ensemble/gradient_boosting.py
  • sklearn/externals/_pilutil.py
  • sklearn/externals/joblib/init.py
  • sklearn/feature_extraction/_hashing.pyx
  • sklearn/feature_extraction/text.py
  • sklearn/feature_selection/mutual_info_.py
  • sklearn/gaussian_process/kernels.py
  • sklearn/impute.py
  • sklearn/linear_model/base.py
  • sklearn/linear_model/logistic.py
  • sklearn/manifold/t_sne.py
  • sklearn/metrics/cluster/expected_mutual_info_fast.pyx
  • sklearn/metrics/cluster/supervised.py
  • sklearn/model_selection/_split.py
  • sklearn/model_selection/_validation.py
  • sklearn/multiclass.py
  • sklearn/naive_bayes.py
  • sklearn/neighbors/approximate.py
  • sklearn/neighbors/kde.py
  • sklearn/preprocessing/_encoders.py
  • sklearn/preprocessing/base.py
  • sklearn/preprocessing/data.py
  • sklearn/preprocessing/imputation.py
  • sklearn/random_projection.py
  • sklearn/svm/base.py
  • sklearn/tree/export.py
  • sklearn/utils/class_weight.py
  • sklearn/utils/estimator_checks.py
  • sklearn/utils/fixes.py
  • sklearn/utils/linear_assignment_.py
  • sklearn/utils/multiclass.py
  • sklearn/utils/validation.py
  • sklearn/utils/sparsefuncs_fast.pyx

Any other comments?

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @massich !

Waiting for CI..

@massich
Copy link
Contributor Author

massich commented Sep 1, 2018

Actually, the list of files can be ignored right now and we can do it in a subsequent PR

@rth
Copy link
Member

rth commented Sep 1, 2018

For a random sparse CSR array of shape (5000, 40000) with a 0.01 sparsity, this reduces the runtime of csr_row_norms by ~50% which is pretty nice (I have not checked the other methods). Thanks!

@rth rth changed the title Avoid extra copy when using astype Avoid extra copy when using astype in sparsefuncs_fast Sep 1, 2018
@lesteve
Copy link
Member

lesteve commented Sep 2, 2018

I timed the modified functions following @rth comment and here are the results:

import numpy as np
from scipy.sparse import random
from sklearn.utils.sparsefuncs_fast import (
    csr_row_norms, csr_mean_variance_axis0,
    csc_mean_variance_axis0, incr_mean_variance_axis0)

csr = random(5000, 40000, format='csr')
csc = csr.asformat('csc')

print('csr_row_norms')
%timeit csr_row_norms(csr)

print('csr_mean_variance_axis0')
%timeit csr_mean_variance_axis0(csr)

print('csc_mean_variance_axis0')
%timeit csc_mean_variance_axis0(csc)

print('incr_mean_variance_axis0')
%timeit incr_mean_variance_axis0(csr, np.zeros(40000), np.ones(40000), np.array([1]))

Master:

csr_row_norms
6.7 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
csr_mean_variance_axis0
21.2 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
csc_mean_variance_axis0
15.3 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
incr_mean_variance_axis0
22 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This PR:

csr_row_norms
2.05 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
csr_mean_variance_axis0
17.1 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
csc_mean_variance_axis0
10.8 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
incr_mean_variance_axis0
17.9 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@lesteve lesteve merged commit 51b1b7c into scikit-learn:master Sep 2, 2018
@lesteve
Copy link
Member

lesteve commented Sep 2, 2018

Merging, thanks a lot @massich and nice tracking down of something that should have failed but did not in #11964 @rth!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants