Skip to content

[MRG] Add support for 64 bit indices in CSR array normalization #9663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 5, 2017

Conversation

rth
Copy link
Member

@rth rth commented Aug 31, 2017

This adds fused type to support 64 bit indices in csr_row_norms, inplace_csr_row_normalize_l1 and inplace_csr_row_normalize_l2. Fixes #6468 and begins to adress #2969 .

In this PR, for CSR array indices I define a fused integral type ( int, long long) which should produce either 32 bit and 64 bit respectively on Linux / Windows and 64bit / 32bit platforms as far as I understand. The array indices variables are always np.npy_intp, while the array shapes (M, N) are long long to support up to 64 bit array sizes. Please let me know if I missed something.

cc @jnothman @lesteve @ogrisel

Also tested that the performance is not affected (with 32bit CSR indices) using the following script (needs to be run with ipython),

import numpy as np
import scipy.sparse as sp
from IPython import get_ipython

from sklearn.utils.extmath import row_norms
from sklearn.utils.sparsefuncs_fast import (inplace_csr_row_normalize_l1,
                                            inplace_csr_row_normalize_l2)

ipython = get_ipython()
np.random.seed(43)
Xcsr = sp.random(10000, 10000, format='csr')

print('row_norm(Xcsr): ', end='')
ipython.magic("%timeit row_norms(Xcsr)")
print('inplace_csr_row_normalize_l1(Xcsr): ', end='')
ipython.magic("%timeit inplace_csr_row_normalize_l1(Xcsr)")
print('inplace_csr_row_normalize_l2(Xcsr): ', end='')
ipython.magic("%timeit inplace_csr_row_normalize_l2(Xcsr)")

@rth rth changed the title Add fused type to support 64 bit indices in CSR array normalizations [MRG] Add support for 64 bit indices in CSR array normalization Aug 31, 2017
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Please add to what's new... ideally listing which estimators can now handle the large CSR data, but perhaps then we need a test for them. Hmm.

@rth
Copy link
Member Author

rth commented Sep 1, 2017

@jnothman Thanks for the review! Maybe we could add the "what's new" entry once we actually have a mechanism for testing estimators with 64bit csr as you suggested here #2969 (comment) . Working on that..

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 as well for waiting to use that in actual estimator before adding an entry in changelog.

@rth
Copy link
Member Author

rth commented Sep 5, 2017

@jnothman Please let me know if you have other comments on this PR. Thanks!

@jnothman
Copy link
Member

jnothman commented Sep 5, 2017

Okay. You just make sure the changelog is appropriately attributed, etc. Thanks again @rth

@jnothman jnothman merged commit d551227 into scikit-learn:master Sep 5, 2017
massich pushed a commit to massich/scikit-learn that referenced this pull request Sep 15, 2017
@rth rth deleted the fused-types-l2-norm branch October 6, 2017 15:13
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants