Py: Complete implementation of mpz_ and or xor for negative arguments #1810

dcurrie · 2016-01-31T03:41:41Z

The implementation of and or and xor were incomplete and TODO for negative arguments, so here they are.

dpgeorge · 2016-01-31T08:47:56Z

Thanks!

It increases code size on Thumb2 arch by 536 bytes. Can you think of any tricks to reduce that? These are uncommon operations (evidenced by fact that they have not been needed so far) so their implementation can be slow to save code size.

dcurrie · 2016-01-31T14:55:08Z

There are 7 instances in mpz_int of

    for (--idig; idig >= oidig && *idig == 0; --idig) {
    }

    return idig + 1 - oidig;

in the operations sub, and, xor, and_neg, and_neg2, or_neg, and xor_neg. This could be made into a subroutine (tail) call, and used in the seven places, or just in the _neg cases if performance of the others is crucial.

It should also be possible to combine your original and_neg, which handled two cases, and my new and_neg2, which handles the third case, at some runtime cost, similar to what I did with xor and or.

I don't compile for thumb2 presently... is there a quick and easy way for me the see the effect of these changes? Something like Travis CI? Or should I just try it on unix and push it again?

dpgeorge · 2016-01-31T17:54:40Z

I'll have to think a bit about what's the best way to reduce the code. How about implementing (-a) & (-b) using the positive and function?

You can test code size on any arch and the reduction should be roughly proportionally correct (eg if you can halve code size on x86, it will roughly halve on thumb2). Just remember to compile with debugging disabled so assertions don't muck up the size metrics (ie use -Os and -DNDEBUG).

dcurrie · 2016-02-01T02:03:50Z

Here are some candidate size optimizations that should only affect and_neg, or_neg, and xor_neg performance.

dpgeorge · 2016-02-01T17:16:23Z

It now add 356 bytes on Thumb2, which is not too bad.

dpgeorge · 2016-02-01T17:17:05Z

py/mpz.c

+    i =  (j & (-k)) =  (j & (~k + 1)) =  (  j      & (~k + 1)) 
+    i =  ((-j) & k) =  ((~j + 1) & k) =  ((~j + 1) &   k     )
+   computes general form: 
+   i = (im ^ (((j ^ jm) + jc) & ((k ^ km) + kc))) + ki  where Xm = Xc == 0 ? 0 : DIG_MASK


I don't see ic (carryi) in this formula...?

The comment should say

i = (im ^ (((j ^ jm) + jc) & ((k ^ km) + kc))) + ic

Where ic is carryi in the code.

dpgeorge · 2016-02-01T17:20:21Z

I assume that mpn_or can be replaced by a call to mpn_or_neg, with certain choice for the input carry values, is that correct? And same for "and" and "xor"? If so then we should benchmark these functions to see if it's really important to have specialised versions for non-neg args, or if we can just have the one function for everything (and hence smaller code).

dcurrie · 2016-02-01T17:44:55Z

I assume that mpn_or can be replaced by a call to mpn_or_neg, with certain choice for the input carry values, is that correct?

Yes, that can work as is in some cases, and with more work in others. E.g., mpn_or_neg always assumes its result will be negated, which isn't true for mpn_or. So, mpn_or_neg will need to have an imask like mpn_and_neg. The other two look general now. Your initial mpn_and is especially time efficient since it has an early out on the shorter operand, could be hard to replicate this in the general case.

dcurrie · 2016-02-01T19:00:38Z

The new MICROPY_MPZ_BITWISE_SPEEDIER compile flag gives you the option to use the previous version (speedier), or this version, which uses the generic code for the positive arguments case, too.

dcurrie · 2016-02-01T20:47:14Z

Without MICROPY_MPZ_BITWISE_SPEEDIER defined on unix x86_64 the mpz_neg_and_or_xor branch __text is 208 bytes larger than master, and __cstring is 96 bytes smaller.

dpgeorge · 2016-02-01T21:17:04Z

With the "slow" version (all or/xor/and calls go through the same function) it adds 80 bytes to Thumb2. With the "fast" version, it adds 376 bytes to Thumb2. No benchmarking done yet.

pfalcon · 2016-02-01T21:33:50Z

MICROPY_MPZ_BITWISE_SPEEDIER

I'd suggest to go for something simpler, like MICROPY_MPZ_BITWISE_FAST.

dcurrie · 2016-02-01T22:10:59Z

@pfalcon, simpler is good; I suspect after benchmarking there's a good chance we'll be deleting one of the options... so I'll wait a while before changing it.

dcurrie · 2016-02-02T20:14:53Z

I did some benchmarking on unix x86_64. Summary: for small (< 64 bit) positive arguments, the general (small code size) versions are about 16% slower than the optimized for positive arguments (larger overall code size) versions.

The results in each section are for null loop, and, or, xor:

mpz_bitwise.py.txt

dpgeorge · 2016-02-02T21:25:07Z

Thanks @dcurrie for the analysis. I'd say that for ports that can afford the (small) increase in code size, then faster bitwise operations are preferable because such operations are common on microcontrollers. And then there'll be the option to have reduced code size for ports that can't afford it.

dcurrie · 2016-02-02T21:40:25Z

@pflacon suggested MICROPY_MPZ_BITWISE_FAST, would you like me to make that change... and should it be set in mpconfigport.mk files? Or should we invert the compile time logic (is there a naming convention for _SMALL_CODE)?

dpgeorge · 2016-02-02T22:01:49Z

would you like me to make that change

Yes please. Keep that name and make it disabled by default.

dcurrie · 2016-02-02T22:20:22Z

@dpgeorge, Done, assuming I understood you correctly.

dpgeorge · 2016-02-02T22:26:42Z

With new config vars like this, they should also go in py/mpconfig.h (in this case disabled by default means set to 0 if not already defined). But I can do it, and I'll also squash your commits together and rebase. That means this PR will end up being in a "closed" state and not "merged", but of course it will be merged :)

dpgeorge · 2016-02-03T21:06:43Z

@dcurrie I'm not sure how you managed to benchmark with that script: it's only making integers up to 31-bits long, which will never create an mpz on a 64-bit arch! Small ints (62 or 30 bits long) are stuffed into a pointer. Only integers bigger than that use the mpz implementation.

dcurrie · 2016-02-03T21:22:24Z

Oops. I tried bigger numbers, but got errors from random. Based on your earlier concern about "DIG_SIZE is usually the number of bits in mpz_dig_t (eg 16, uint16_t)" I guessed that even 31 bit numbers would use mpz. My concern about performance was smaller numbers since that is what I suspect is most common. I wonder why performance was consistently different between the two mpz implementations even with small numbers... ?

dpgeorge · 2016-02-03T21:25:44Z

I wonder why performance was consistently different between the two mpz implementations even with small numbers... ?

Probably cache effects.

I'm running some benchmarks now on the pyboard (Thumb2 arch, no cache, nice and quiet).

dcurrie · 2016-02-03T21:55:32Z

For what it's worth, the revised script using larger numbers on x86_64 has more of a difference between versions; the fast versions are about 42% faster on average.

mpz_bitwise.py.txt

dpgeorge · 2016-02-03T22:07:18Z

Thanks @dcurrie. My tests on pyboard show about 10-20% speed up, which I think is worth the cost of the code space (on pyboard at least).

dpgeorge · 2016-02-03T22:18:36Z

Merged in 2e2e15c.

Changes from original PR:

config option is MICROPY_OPT_MPZ_BITWISE to fit with other optimisation options
shuffled the functions around in mpz.c to be in the right order (mpn_or with mpn_or_neg)
renamed mpn_trimmed_length to mpn_remove_trailing_zeros
used mpn_remove_trailing_zeros in all possible locations (reduces code size by 44 bytes on Thumb2 and makes almost no difference to speed)

My tests gave the following:

fast:
null: 211ms
and:  1885ms
or:   1939ms
xor:  1955ms

slow but small:
null: 210ms
and:  2274ms
or:   2282ms
xor:  2170ms

Fast version costs 252 bytes more than slow version.

Thanks @dcurrie for the patch, it was much needed!

dpgeorge · 2016-02-03T22:20:38Z

For the record, my test script is:

import utime as time
import urandom as random

def make_nums(n):
    nums = []
    for _ in range(n):
        r = 1 
        for i in range(64 + random.getrandbits(8)):
            r = r << 1 | random.getrandbits(1)
        nums.append(r)
    return nums

def nullbench(nums):
    for i in range(len(nums)):
        for j in range(i, len(nums)):
            for k in range(j, len(nums)):
                pass

def andbench(nums):
    for i in range(len(nums)):
        for j in range(i, len(nums)):
            for k in range(j, len(nums)):
                nums[i] & nums[j] & nums[k]

def orbench(nums):
    for i in range(len(nums)):
        for j in range(i, len(nums)):
            for k in range(j, len(nums)):
                nums[i] | nums[j] | nums[k]

def xorbench(nums):
    for i in range(len(nums)):
        for j in range(i, len(nums)):
            for k in range(j, len(nums)):
                nums[i] ^ nums[j] ^ nums[k]

def run(nums, f): 
    t = time.ticks_ms()
    f(nums)
    t = time.ticks_diff(t, time.ticks_ms())
    print(t)

def main():
    random.seed(1) # start at a known, repeatable place
    nums = make_nums(75)
    print(sum(nums))
    run(nums, nullbench)
    run(nums, andbench)
    run(nums, orbench)
    run(nums, xorbench)

main()

chuckbook · 2016-02-04T10:21:29Z

@dpgeorge , probably a vocational disease, whenever I see numbers I have to verify them. Recently a primitive benchmark result grabbed my attention. So I just pasted your script to a PYB repl and compared results which were ~10% better. That's ok, I'm usually sacrifice code space in favor of speed especially if it's otherwise unused. Going back to the default -Os still gave consistent ~2% better speed results. May I ask you what toolchain you use? Thanks!

dpgeorge · 2016-02-04T11:07:36Z

@chuckbook I'm using arm-none-eabi-gcc 5.3.0 at the moment.

I'm not sure what you mean by your comment regarding -Os and 2% better results. I always use -Os, never -O3 (and not in this test). The tests I did for the new mpz code are with the option MICROPY_OPT_MPZ_BITWISE enabled (fast and about 250 bytes larger, but still -Os) versus this config disabled (slow and smaller, but still -Os).

chuckbook · 2016-02-04T11:33:50Z

Sorry for being imprecise. I'm using arm-none-eabi-gcc 4.9.3 -Os and this gives ~2% better results, regardless of the MICROPY_OPT_MPZ_BITWISE flag.

dpgeorge · 2016-02-04T11:45:22Z

I'm using arm-none-eabi-gcc 4.9.3 -Os and this gives ~2% better results, regardless of the MICROPY_OPT_MPZ_BITWISE flag.

You mean that with that version of the compiler your numbers are about 2% better than the ones I give above? For both cases of BITWISE enabled and disabled?

chuckbook · 2016-02-04T11:45:54Z

yes!

dpgeorge · 2016-02-04T11:48:10Z

There could be some noise due to interrupts that is different on our PCs (eg due to USB). Want to modify the test so it can run with irqs disabled (and use a timer to time)?

chuckbook · 2016-02-04T11:59:55Z

Let me just verify your results. Is the actual github release ok?
Is there any reason not to use 4.9.x ?

dpgeorge · 2016-02-04T12:01:49Z

Is the actual github release ok?

Yes, use master.

chuckbook · 2016-02-04T13:06:31Z

results 4.9.3 vs. 5.2.1 each about 5 runs.

4.9.3:
   text    data     bss     dec     hex filename
 284680     336   28436  313452   4c86c build-PYBV11/firmware.elf
202 +/-0
1874 +/-1
1910 +/-0
1929 +/-0

5.2.1:
   text    data     bss     dec     hex filename
 283824     336   28432  312592   4c510 build-PYBV11/firmware.elf
218
1888 +/-1
1941 +/-0
1958 +/-1

so it is (probably, couldn't get hold of 5.3.0) the tool-chain

dpgeorge · 2016-02-04T13:17:53Z

Interesting. Newer toolchain produces smaller code (by a decent margin) but slower code. Well, we could look in more detail at the generated assembly, but that's a lot of work. Also, other benchmarks might be faster with newer toolchain...?

Fix off by one error in OnDiskBitmap

Complete implementation of mpz_ and or xor for negative arguments

dbd852d

py: code size optimizations in mpz and_neg or_neg xor_neg

f7cdb8f

py: save a few more __text bytes in mpz_neg_and_or_xor

14b46e2

dpgeorge reviewed Feb 1, 2016
View reviewed changes

dcurrie added 2 commits February 1, 2016 12:48

py: fix comments in py/mpz.c per code review

30b3cb7

py: compile time option MICROPY_MPZ_BITWISE_SPEEDIER for benchmarking

3ea800d

Merge branch 'master' into mpz_neg_and_or_xor

5ce8baa

py: rename MICROPY_MPZ_BITWISE_SPEEDIER to MICROPY_MPZ_BITWISE_FAST

233e47c

dpgeorge closed this Feb 3, 2016

tannewt pushed a commit to tannewt/circuitpython that referenced this pull request Apr 18, 2019

Merge pull request micropython#1810 from tannewt/black_line_ondisk

00379db

Fix off by one error in OnDiskBitmap

Uh oh!

Py: Complete implementation of mpz_ and or xor for negative arguments #1810

Py: Complete implementation of mpz_ and or xor for negative arguments #1810

Uh oh!

Conversation

dcurrie commented Jan 31, 2016

Uh oh!

dpgeorge commented Jan 31, 2016

Uh oh!

dcurrie commented Jan 31, 2016

Uh oh!

dpgeorge commented Jan 31, 2016

Uh oh!

dcurrie commented Feb 1, 2016

Uh oh!

dpgeorge commented Feb 1, 2016

Uh oh!

dpgeorge Feb 1, 2016

Choose a reason for hiding this comment

Uh oh!

dcurrie Feb 1, 2016

Choose a reason for hiding this comment

Uh oh!

dpgeorge commented Feb 1, 2016

Uh oh!

dcurrie commented Feb 1, 2016

Uh oh!

dcurrie commented Feb 1, 2016

Uh oh!

dcurrie commented Feb 1, 2016

Uh oh!

dpgeorge commented Feb 1, 2016

Uh oh!

pfalcon commented Feb 1, 2016

Uh oh!

dcurrie commented Feb 1, 2016

Uh oh!

dcurrie commented Feb 2, 2016

Uh oh!

dpgeorge commented Feb 2, 2016

Uh oh!

dcurrie commented Feb 2, 2016

Uh oh!

dpgeorge commented Feb 2, 2016

Uh oh!

dcurrie commented Feb 2, 2016

Uh oh!

dpgeorge commented Feb 2, 2016

Uh oh!

dpgeorge commented Feb 3, 2016

Uh oh!

dcurrie commented Feb 3, 2016

Uh oh!

dpgeorge commented Feb 3, 2016

Uh oh!

dcurrie commented Feb 3, 2016

Uh oh!

dpgeorge commented Feb 3, 2016

Uh oh!

dpgeorge commented Feb 3, 2016

Uh oh!

dpgeorge commented Feb 3, 2016

Uh oh!

chuckbook commented Feb 4, 2016

Uh oh!

dpgeorge commented Feb 4, 2016

Uh oh!

chuckbook commented Feb 4, 2016

Uh oh!

dpgeorge commented Feb 4, 2016

Uh oh!

chuckbook commented Feb 4, 2016

Uh oh!

dpgeorge commented Feb 4, 2016

Uh oh!

chuckbook commented Feb 4, 2016

Uh oh!

dpgeorge commented Feb 4, 2016

Uh oh!

chuckbook commented Feb 4, 2016