-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
PERF: Use python integer on _count_reduce_items #21465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am half curious now what would happen if we use
items = nt.intp(items)
at the very end. Adds the overhead of creating the scalar, but besides the unfortunate__array_ufunc__
overhead, it might actually be faster and also a bit cleaner.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, that code was a tricky to get right for all the corner cases. That type conversion was probably there for a reason,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems later code expects a NumPy object (with
.ndim
attribute) in some branches. gh-20175 might help a bit here in any case (which would be merged soon).Otherwise, would have to check if it is still worthwhile to convert later, or the using code can be refactored easily. It may well make sense to convert to
np.array(items, dtype=np.intp)
, here actually!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is tricky to get right. We seem to agree the calculation inside
_count_reduce_items
can be done in plainint
, but the output should be innt.intp
ornt.ndarray(..., dtype=nt.intp)
. The reason is that if thewhere
argument is notTrue
the output of_count_reduce_items
is not a scalar but an array.Some timings for the calculation (the third one is the suggestion by seberg)
So unfortunately, the main performance win is because the
true_divide
in_mean
is not very efficient for numpy type inputs, and it is forint
type input. Benchmark to confirm:results in
Casting the result to
nt.intp
and casting back toint
in the casewhere!=True
andrcount.ndim==0
is not solution. This code:(which does not even handle the
where
case!) already loses performance.The best solution would be to make the
true_divide
also efficient fornp.intp
input, but that is not the scope of this PR.I updated the PR to keep the output of
_count_reduce_items
anint
, but handle the corner cases and make the tests pass. The performance gain is still there, and the code is not much worse, so it might be worthwhile to merge. The alternative is to put this PR on hold and see first whether we can improve the performance with other means.One of the corner cases is input of an array with
dtype=object
. Ifrcount
is not cast tont.intp
, we end up with a pure integer division which results in a scalar and not an array.Output:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eendebakpt you may want to apply gh-21188 before spending too much time on certain things. If
np.float64(3.) / 3
is much faster thannp.float64(3.)/intp(3)
, than that PR should just fix that part at least.(The PR only affect scalar code paths though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg With recent PRs the performance gap is less, so no need for a special case with
int
any more. There is still a performance gap, so if you could enable the fast paths you mentioned or perform the "fixing" that would be great.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of this diff:
because I thought a good deal of time in the type resolver ended up being spend here. But it doesn't seem to make a huge difference now, so not sure it is actually worth it right now (have to check more).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that diff is wrong, because the important part is the one where
0
is returned (the path can be adapted for the interesting case by checkingcasting == NPY_SAFE_CASTING
and then returning the table value. But, I am still not really seeing any worthwhile speedup, but maybe I am just testing the wrong thing tonight.EDIT: Fixed diff (not that it matters much, even for
arr / 0darr
orarr + 0darr
I don't really see it making a big difference:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg I found the path that triggers the legacy type resolution. For both
np.aray([1., 2.])/10
andnp.aray([1., 2.])/np.intp(10)
.ufunc_generic_fastcall
which callsconvert_ufunc_arguments
convert_ufunc_arguments
is called withnin=2
,nout=1
andallow_legacy_promotion=1
. First argument is a float64 array, second a float64 scalar.convert_ufunc_arguments
this results inall_scalar=0, any_scalar=1
.sets
force_legacy_promotion
to one.Is this something we can improve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I wrote a 2800 line file on that general problem ;).
More seriously: on the path towards the above/TODO, there may be a middle ground that just makes the "obvious" paths fast (this one happens to be obvious).
I.e. once the TODO is half gone,
force_legacy_promotion
could probably be removed as long as we take some care later (and make the paths where it has to kick in fast enough!). That may fall out, but maybe it would just add complexity. So overall, I wouldn't really aim for speeding things up, except by saying: sure hopefully that NEP will be implemented soon enough, and then we can make sure things are fast.