Skip to content

Py: Complete implementation of mpz_ and or xor for negative arguments #1810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

dcurrie
Copy link
Contributor

@dcurrie dcurrie commented Jan 31, 2016

The implementation of and or and xor were incomplete and TODO for negative arguments, so here they are.

@dpgeorge
Copy link
Member

Thanks!

It increases code size on Thumb2 arch by 536 bytes. Can you think of any tricks to reduce that? These are uncommon operations (evidenced by fact that they have not been needed so far) so their implementation can be slow to save code size.

@dcurrie
Copy link
Contributor Author

dcurrie commented Jan 31, 2016

There are 7 instances in mpz_int of

    for (--idig; idig >= oidig && *idig == 0; --idig) {
    }

    return idig + 1 - oidig;

in the operations sub, and, xor, and_neg, and_neg2, or_neg, and xor_neg. This could be made into a subroutine (tail) call, and used in the seven places, or just in the _neg cases if performance of the others is crucial.

It should also be possible to combine your original and_neg, which handled two cases, and my new and_neg2, which handles the third case, at some runtime cost, similar to what I did with xor and or.

I don't compile for thumb2 presently... is there a quick and easy way for me the see the effect of these changes? Something like Travis CI? Or should I just try it on unix and push it again?

@dpgeorge
Copy link
Member

I'll have to think a bit about what's the best way to reduce the code. How about implementing (-a) & (-b) using the positive and function?

You can test code size on any arch and the reduction should be roughly proportionally correct (eg if you can halve code size on x86, it will roughly halve on thumb2). Just remember to compile with debugging disabled so assertions don't muck up the size metrics (ie use -Os and -DNDEBUG).

@dcurrie
Copy link
Contributor Author

dcurrie commented Feb 1, 2016

Here are some candidate size optimizations that should only affect and_neg, or_neg, and xor_neg performance.

@dpgeorge
Copy link
Member

dpgeorge commented Feb 1, 2016

It now add 356 bytes on Thumb2, which is not too bad.

i = (j & (-k)) = (j & (~k + 1)) = ( j & (~k + 1))
i = ((-j) & k) = ((~j + 1) & k) = ((~j + 1) & k )
computes general form:
i = (im ^ (((j ^ jm) + jc) & ((k ^ km) + kc))) + ki where Xm = Xc == 0 ? 0 : DIG_MASK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see ic (carryi) in this formula...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should say

i = (im ^ (((j ^ jm) + jc) & ((k ^ km) + kc))) + ic

Where ic is carryi in the code.

@dpgeorge
Copy link
Member

dpgeorge commented Feb 1, 2016

I assume that mpn_or can be replaced by a call to mpn_or_neg, with certain choice for the input carry values, is that correct? And same for "and" and "xor"? If so then we should benchmark these functions to see if it's really important to have specialised versions for non-neg args, or if we can just have the one function for everything (and hence smaller code).

@dcurrie
Copy link
Contributor Author

dcurrie commented Feb 1, 2016

I assume that mpn_or can be replaced by a call to mpn_or_neg, with certain choice for the input carry values, is that correct?

Yes, that can work as is in some cases, and with more work in others. E.g., mpn_or_neg always assumes its result will be negated, which isn't true for mpn_or. So, mpn_or_neg will need to have an imask like mpn_and_neg. The other two look general now. Your initial mpn_and is especially time efficient since it has an early out on the shorter operand, could be hard to replicate this in the general case.

@dcurrie
Copy link
Contributor Author

dcurrie commented Feb 1, 2016

The new MICROPY_MPZ_BITWISE_SPEEDIER compile flag gives you the option to use the previous version (speedier), or this version, which uses the generic code for the positive arguments case, too.

@dcurrie
Copy link
Contributor Author

dcurrie commented Feb 1, 2016

Without MICROPY_MPZ_BITWISE_SPEEDIER defined on unix x86_64 the mpz_neg_and_or_xor branch __text is 208 bytes larger than master, and __cstring is 96 bytes smaller.

@dpgeorge
Copy link
Member

dpgeorge commented Feb 1, 2016

With the "slow" version (all or/xor/and calls go through the same function) it adds 80 bytes to Thumb2. With the "fast" version, it adds 376 bytes to Thumb2. No benchmarking done yet.

@pfalcon
Copy link
Contributor

pfalcon commented Feb 1, 2016

MICROPY_MPZ_BITWISE_SPEEDIER

I'd suggest to go for something simpler, like MICROPY_MPZ_BITWISE_FAST.

@dcurrie
Copy link
Contributor Author

dcurrie commented Feb 1, 2016

@pfalcon, simpler is good; I suspect after benchmarking there's a good chance we'll be deleting one of the options... so I'll wait a while before changing it.

@dcurrie
Copy link
Contributor Author

dcurrie commented Feb 2, 2016

I did some benchmarking on unix x86_64. Summary: for small (< 64 bit) positive arguments, the general (small code size) versions are about 16% slower than the optimized for positive arguments (larger overall code size) versions.

The results in each section are for null loop, and, or, xor:

image

mpz_bitwise.py.txt

@dpgeorge
Copy link
Member

dpgeorge commented Feb 2, 2016

Thanks @dcurrie for the analysis. I'd say that for ports that can afford the (small) increase in code size, then faster bitwise operations are preferable because such operations are common on microcontrollers. And then there'll be the option to have reduced code size for ports that can't afford it.

@dcurrie
Copy link
Contributor Author

dcurrie commented Feb 2, 2016

@pflacon suggested MICROPY_MPZ_BITWISE_FAST, would you like me to make that change... and should it be set in mpconfigport.mk files? Or should we invert the compile time logic (is there a naming convention for _SMALL_CODE)?

@dpgeorge
Copy link
Member

dpgeorge commented Feb 2, 2016

would you like me to make that change

Yes please. Keep that name and make it disabled by default.

@dcurrie
Copy link
Contributor Author

dcurrie commented Feb 2, 2016

@dpgeorge, Done, assuming I understood you correctly.

@dpgeorge
Copy link
Member

dpgeorge commented Feb 2, 2016

With new config vars like this, they should also go in py/mpconfig.h (in this case disabled by default means set to 0 if not already defined). But I can do it, and I'll also squash your commits together and rebase. That means this PR will end up being in a "closed" state and not "merged", but of course it will be merged :)

@dpgeorge
Copy link
Member

dpgeorge commented Feb 3, 2016

@dcurrie I'm not sure how you managed to benchmark with that script: it's only making integers up to 31-bits long, which will never create an mpz on a 64-bit arch! Small ints (62 or 30 bits long) are stuffed into a pointer. Only integers bigger than that use the mpz implementation.

@dcurrie
Copy link
Contributor Author

dcurrie commented Feb 3, 2016

Oops. I tried bigger numbers, but got errors from random. Based on your earlier concern about "DIG_SIZE is usually the number of bits in mpz_dig_t (eg 16, uint16_t)" I guessed that even 31 bit numbers would use mpz. My concern about performance was smaller numbers since that is what I suspect is most common. I wonder why performance was consistently different between the two mpz implementations even with small numbers... ?

@dpgeorge
Copy link
Member

dpgeorge commented Feb 3, 2016

I wonder why performance was consistently different between the two mpz implementations even with small numbers... ?

Probably cache effects.

I'm running some benchmarks now on the pyboard (Thumb2 arch, no cache, nice and quiet).

@dcurrie
Copy link
Contributor Author

dcurrie commented Feb 3, 2016

For what it's worth, the revised script using larger numbers on x86_64 has more of a difference between versions; the fast versions are about 42% faster on average.

image

mpz_bitwise.py.txt

@dpgeorge
Copy link
Member

dpgeorge commented Feb 3, 2016

Thanks @dcurrie. My tests on pyboard show about 10-20% speed up, which I think is worth the cost of the code space (on pyboard at least).

@dpgeorge
Copy link
Member

dpgeorge commented Feb 3, 2016

Merged in 2e2e15c.

Changes from original PR:

  • config option is MICROPY_OPT_MPZ_BITWISE to fit with other optimisation options
  • shuffled the functions around in mpz.c to be in the right order (mpn_or with mpn_or_neg)
  • renamed mpn_trimmed_length to mpn_remove_trailing_zeros
  • used mpn_remove_trailing_zeros in all possible locations (reduces code size by 44 bytes on Thumb2 and makes almost no difference to speed)

My tests gave the following:

fast:
null: 211ms
and:  1885ms
or:   1939ms
xor:  1955ms

slow but small:
null: 210ms
and:  2274ms
or:   2282ms
xor:  2170ms

Fast version costs 252 bytes more than slow version.

Thanks @dcurrie for the patch, it was much needed!

@dpgeorge dpgeorge closed this Feb 3, 2016
@dpgeorge
Copy link
Member

dpgeorge commented Feb 3, 2016

For the record, my test script is:

import utime as time
import urandom as random

def make_nums(n):
    nums = []
    for _ in range(n):
        r = 1 
        for i in range(64 + random.getrandbits(8)):
            r = r << 1 | random.getrandbits(1)
        nums.append(r)
    return nums

def nullbench(nums):
    for i in range(len(nums)):
        for j in range(i, len(nums)):
            for k in range(j, len(nums)):
                pass

def andbench(nums):
    for i in range(len(nums)):
        for j in range(i, len(nums)):
            for k in range(j, len(nums)):
                nums[i] & nums[j] & nums[k]

def orbench(nums):
    for i in range(len(nums)):
        for j in range(i, len(nums)):
            for k in range(j, len(nums)):
                nums[i] | nums[j] | nums[k]

def xorbench(nums):
    for i in range(len(nums)):
        for j in range(i, len(nums)):
            for k in range(j, len(nums)):
                nums[i] ^ nums[j] ^ nums[k]

def run(nums, f): 
    t = time.ticks_ms()
    f(nums)
    t = time.ticks_diff(t, time.ticks_ms())
    print(t)

def main():
    random.seed(1) # start at a known, repeatable place
    nums = make_nums(75)
    print(sum(nums))
    run(nums, nullbench)
    run(nums, andbench)
    run(nums, orbench)
    run(nums, xorbench)

main()

@chuckbook
Copy link

@dpgeorge , probably a vocational disease, whenever I see numbers I have to verify them. Recently a primitive benchmark result grabbed my attention. So I just pasted your script to a PYB repl and compared results which were ~10% better. That's ok, I'm usually sacrifice code space in favor of speed especially if it's otherwise unused. Going back to the default -Os still gave consistent ~2% better speed results. May I ask you what toolchain you use? Thanks!

@dpgeorge
Copy link
Member

dpgeorge commented Feb 4, 2016

@chuckbook I'm using arm-none-eabi-gcc 5.3.0 at the moment.

I'm not sure what you mean by your comment regarding -Os and 2% better results. I always use -Os, never -O3 (and not in this test). The tests I did for the new mpz code are with the option MICROPY_OPT_MPZ_BITWISE enabled (fast and about 250 bytes larger, but still -Os) versus this config disabled (slow and smaller, but still -Os).

@chuckbook
Copy link

Sorry for being imprecise. I'm using arm-none-eabi-gcc 4.9.3 -Os and this gives ~2% better results, regardless of the MICROPY_OPT_MPZ_BITWISE flag.

@dpgeorge
Copy link
Member

dpgeorge commented Feb 4, 2016

I'm using arm-none-eabi-gcc 4.9.3 -Os and this gives ~2% better results, regardless of the MICROPY_OPT_MPZ_BITWISE flag.

You mean that with that version of the compiler your numbers are about 2% better than the ones I give above? For both cases of BITWISE enabled and disabled?

@chuckbook
Copy link

yes!

@dpgeorge
Copy link
Member

dpgeorge commented Feb 4, 2016

There could be some noise due to interrupts that is different on our PCs (eg due to USB). Want to modify the test so it can run with irqs disabled (and use a timer to time)?

@chuckbook
Copy link

Let me just verify your results. Is the actual github release ok?
Is there any reason not to use 4.9.x ?

@dpgeorge
Copy link
Member

dpgeorge commented Feb 4, 2016

Is the actual github release ok?

Yes, use master.

@chuckbook
Copy link

results 4.9.3 vs. 5.2.1 each about 5 runs.

4.9.3:
   text    data     bss     dec     hex filename
 284680     336   28436  313452   4c86c build-PYBV11/firmware.elf
202 +/-0
1874 +/-1
1910 +/-0
1929 +/-0

5.2.1:
   text    data     bss     dec     hex filename
 283824     336   28432  312592   4c510 build-PYBV11/firmware.elf
218
1888 +/-1
1941 +/-0
1958 +/-1

so it is (probably, couldn't get hold of 5.3.0) the tool-chain

@dpgeorge
Copy link
Member

dpgeorge commented Feb 4, 2016

Interesting. Newer toolchain produces smaller code (by a decent margin) but slower code. Well, we could look in more detail at the generated assembly, but that's a lot of work. Also, other benchmarks might be faster with newer toolchain...?

tannewt pushed a commit to tannewt/circuitpython that referenced this pull request Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants