-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Py: Complete implementation of mpz_ and or xor for negative arguments #1810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks! It increases code size on Thumb2 arch by 536 bytes. Can you think of any tricks to reduce that? These are uncommon operations (evidenced by fact that they have not been needed so far) so their implementation can be slow to save code size. |
There are 7 instances in mpz_int of
in the operations sub, and, xor, and_neg, and_neg2, or_neg, and xor_neg. This could be made into a subroutine (tail) call, and used in the seven places, or just in the _neg cases if performance of the others is crucial. It should also be possible to combine your original and_neg, which handled two cases, and my new and_neg2, which handles the third case, at some runtime cost, similar to what I did with xor and or. I don't compile for thumb2 presently... is there a quick and easy way for me the see the effect of these changes? Something like Travis CI? Or should I just try it on unix and push it again? |
I'll have to think a bit about what's the best way to reduce the code. How about implementing (-a) & (-b) using the positive and function? You can test code size on any arch and the reduction should be roughly proportionally correct (eg if you can halve code size on x86, it will roughly halve on thumb2). Just remember to compile with debugging disabled so assertions don't muck up the size metrics (ie use -Os and -DNDEBUG). |
Here are some candidate size optimizations that should only affect and_neg, or_neg, and xor_neg performance. |
It now add 356 bytes on Thumb2, which is not too bad. |
i = (j & (-k)) = (j & (~k + 1)) = ( j & (~k + 1)) | ||
i = ((-j) & k) = ((~j + 1) & k) = ((~j + 1) & k ) | ||
computes general form: | ||
i = (im ^ (((j ^ jm) + jc) & ((k ^ km) + kc))) + ki where Xm = Xc == 0 ? 0 : DIG_MASK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see ic (carryi) in this formula...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should say
i = (im ^ (((j ^ jm) + jc) & ((k ^ km) + kc))) + ic
Where ic
is carryi
in the code.
I assume that mpn_or can be replaced by a call to mpn_or_neg, with certain choice for the input carry values, is that correct? And same for "and" and "xor"? If so then we should benchmark these functions to see if it's really important to have specialised versions for non-neg args, or if we can just have the one function for everything (and hence smaller code). |
Yes, that can work as is in some cases, and with more work in others. E.g., |
The new MICROPY_MPZ_BITWISE_SPEEDIER compile flag gives you the option to use the previous version (speedier), or this version, which uses the generic code for the positive arguments case, too. |
Without MICROPY_MPZ_BITWISE_SPEEDIER defined on unix x86_64 the mpz_neg_and_or_xor branch __text is 208 bytes larger than master, and __cstring is 96 bytes smaller. |
With the "slow" version (all or/xor/and calls go through the same function) it adds 80 bytes to Thumb2. With the "fast" version, it adds 376 bytes to Thumb2. No benchmarking done yet. |
I'd suggest to go for something simpler, like MICROPY_MPZ_BITWISE_FAST. |
@pfalcon, simpler is good; I suspect after benchmarking there's a good chance we'll be deleting one of the options... so I'll wait a while before changing it. |
I did some benchmarking on unix x86_64. Summary: for small (< 64 bit) positive arguments, the general (small code size) versions are about 16% slower than the optimized for positive arguments (larger overall code size) versions. The results in each section are for null loop, and, or, xor: |
Thanks @dcurrie for the analysis. I'd say that for ports that can afford the (small) increase in code size, then faster bitwise operations are preferable because such operations are common on microcontrollers. And then there'll be the option to have reduced code size for ports that can't afford it. |
@pflacon suggested MICROPY_MPZ_BITWISE_FAST, would you like me to make that change... and should it be set in mpconfigport.mk files? Or should we invert the compile time logic (is there a naming convention for _SMALL_CODE)? |
Yes please. Keep that name and make it disabled by default. |
@dpgeorge, Done, assuming I understood you correctly. |
With new config vars like this, they should also go in py/mpconfig.h (in this case disabled by default means set to 0 if not already defined). But I can do it, and I'll also squash your commits together and rebase. That means this PR will end up being in a "closed" state and not "merged", but of course it will be merged :) |
@dcurrie I'm not sure how you managed to benchmark with that script: it's only making integers up to 31-bits long, which will never create an mpz on a 64-bit arch! Small ints (62 or 30 bits long) are stuffed into a pointer. Only integers bigger than that use the mpz implementation. |
Oops. I tried bigger numbers, but got errors from random. Based on your earlier concern about "DIG_SIZE is usually the number of bits in mpz_dig_t (eg 16, uint16_t)" I guessed that even 31 bit numbers would use mpz. My concern about performance was smaller numbers since that is what I suspect is most common. I wonder why performance was consistently different between the two mpz implementations even with small numbers... ? |
Probably cache effects. I'm running some benchmarks now on the pyboard (Thumb2 arch, no cache, nice and quiet). |
For what it's worth, the revised script using larger numbers on x86_64 has more of a difference between versions; the fast versions are about 42% faster on average. |
Thanks @dcurrie. My tests on pyboard show about 10-20% speed up, which I think is worth the cost of the code space (on pyboard at least). |
Merged in 2e2e15c. Changes from original PR:
My tests gave the following:
Fast version costs 252 bytes more than slow version. Thanks @dcurrie for the patch, it was much needed! |
For the record, my test script is: import utime as time
import urandom as random
def make_nums(n):
nums = []
for _ in range(n):
r = 1
for i in range(64 + random.getrandbits(8)):
r = r << 1 | random.getrandbits(1)
nums.append(r)
return nums
def nullbench(nums):
for i in range(len(nums)):
for j in range(i, len(nums)):
for k in range(j, len(nums)):
pass
def andbench(nums):
for i in range(len(nums)):
for j in range(i, len(nums)):
for k in range(j, len(nums)):
nums[i] & nums[j] & nums[k]
def orbench(nums):
for i in range(len(nums)):
for j in range(i, len(nums)):
for k in range(j, len(nums)):
nums[i] | nums[j] | nums[k]
def xorbench(nums):
for i in range(len(nums)):
for j in range(i, len(nums)):
for k in range(j, len(nums)):
nums[i] ^ nums[j] ^ nums[k]
def run(nums, f):
t = time.ticks_ms()
f(nums)
t = time.ticks_diff(t, time.ticks_ms())
print(t)
def main():
random.seed(1) # start at a known, repeatable place
nums = make_nums(75)
print(sum(nums))
run(nums, nullbench)
run(nums, andbench)
run(nums, orbench)
run(nums, xorbench)
main() |
@dpgeorge , probably a vocational disease, whenever I see numbers I have to verify them. Recently a primitive benchmark result grabbed my attention. So I just pasted your script to a PYB repl and compared results which were ~10% better. That's ok, I'm usually sacrifice code space in favor of speed especially if it's otherwise unused. Going back to the default -Os still gave consistent ~2% better speed results. May I ask you what toolchain you use? Thanks! |
@chuckbook I'm using arm-none-eabi-gcc 5.3.0 at the moment. I'm not sure what you mean by your comment regarding -Os and 2% better results. I always use -Os, never -O3 (and not in this test). The tests I did for the new mpz code are with the option MICROPY_OPT_MPZ_BITWISE enabled (fast and about 250 bytes larger, but still -Os) versus this config disabled (slow and smaller, but still -Os). |
Sorry for being imprecise. I'm using arm-none-eabi-gcc 4.9.3 -Os and this gives ~2% better results, regardless of the MICROPY_OPT_MPZ_BITWISE flag. |
You mean that with that version of the compiler your numbers are about 2% better than the ones I give above? For both cases of BITWISE enabled and disabled? |
yes! |
There could be some noise due to interrupts that is different on our PCs (eg due to USB). Want to modify the test so it can run with irqs disabled (and use a timer to time)? |
Let me just verify your results. Is the actual github release ok? |
Yes, use master. |
results 4.9.3 vs. 5.2.1 each about 5 runs.
so it is (probably, couldn't get hold of 5.3.0) the tool-chain |
Interesting. Newer toolchain produces smaller code (by a decent margin) but slower code. Well, we could look in more detail at the generated assembly, but that's a lot of work. Also, other benchmarks might be faster with newer toolchain...? |
Fix off by one error in OnDiskBitmap
The implementation of and or and xor were incomplete and TODO for negative arguments, so here they are.