gh-118164: Break a loop between _pydecimal and _pylong and optimize int to str conversion #118483

serhiy-storchaka · 2024-05-01T17:45:42Z

The underlying issue here: for converting large ints to strings, CPython invokes a function in _pylong.py, which uses the decimal module to implement an asymptotically waaaaay sub-quadratic algorithm.

But if the C decimal module isn't available, CPython uses _pydecimal.py instead. Which in turn frequently does str(int). If the int is very large, _pylong ends up doing the work, which in turn asks decimal to do "big" arithmetic, which in turn calls str(big_int), which in turn ... it can become infinite mutual recursion.

This PR introduces a different int->str function that doesn't use decimal. It's asymptotically worse, "Karatsuba time" instead of quadratic time, so still a huge improvement. The hope (well, my - Tim's - hope) is that _pylong can switch to that when the C decimal isn't available.

Issue: str(10**10000) hangs if the C _decimal module is missing #118164

tim-one · 2024-05-01T20:15:37Z

See extensive comments about the approach on the issue report . Note that I did not have in mind changing _pylong at all - decimal arithmetic is "the right" way to do this, and the superior asymptotics of libmpdec multiplication is bound to favor the division-free code already there as argument size increases.

There is no problem to be solved here in an intact CPython distribution, so changes should be limited to where the problem arises: in _pydecimal, which CPython doesn't use at all. _pydecimal should use "something like this" for its own int->str conversions. Yes, that's annoying 😦.

tim-one · 2024-05-01T21:16:55Z

the superior asymptotics of libmpdec multiplication is bound to favor the division-free code already there as argument size increases

Indeed, at much larger sizes than you tried, the current code is much faster. "The problem" is that CPython's sub-quadratic division is eventually dominated by the speed of CPython's Karatsuba multiplication. That's sub-quadratic, but doubling the number of bits "only" triples the time instead of quadrupling it. libmpdec has much fancier multiplication schemes.

For example, at 32_768_000 bits, the current code took about 6 seconds on my box, but the code I showed in the issue report (derived from yours) took about 69 seconds. At 65_536_000 bits, this code took about 3x longer (over 200 seconds), but the current code only about 2.2x longer (about 13 seconds).

So -1 on changing the _longpy algorithm.

tim-one

As noted in my most recent comment, this approach has badly worse asymptotic behavior than the current approach, so I'm opposed to changing the current approach. As explained in my comment, this is "deep", not shallow: any approach not using libmpdec will eventually be limited by the asymptotics of CPython's Karatsuba multiplication (O(n**1.585)).

tim-one · 2024-05-02T00:41:31Z

A different idea: if you know how to detect whether the libmpdec library is installed, then _pylong could use its current int_to_decimal_string(). else use the new one.

I confess I don't know all the relevant details. For example, perhaps it's enough if _pylong.py, after doing its import decimal, just checked to see whether "_decimal" in sys.modules?

If that could be make to work, then nothing in _pydecimal should need to change, and _pylong would only need to decide once which version to use, at the time it's imported.

tim-one · 2024-05-02T20:32:57Z

One way to detect whether _pylong imported the C version of decimal:

if hasattr(decimal, '__libmpdec_version__'):
    # We have the C version.
else:
    # We have the Python version.

serhiy-storchaka · 2024-05-02T20:39:07Z

Unfortunately __libmpdec_version__ is defined in the Python version too.

cpython/Lib/_pydecimal.py

Line 57 in e54b0c8

__libmpdec_version__ = "2.4.2" # compatible libmpdec version

tim-one · 2024-05-02T20:48:07Z

Dang - I could have sworn I tried that ☹️

I asked on DIscourse, and Steve Dower suggested this, which is pretty straighforward:

import decimal
try:
    import _decimal
except ImportError:
    # decimal-free implementation
else:
    # the current code

tim-one · 2024-05-02T22:17:35Z

EDIT: when doing those timings, I fiddled the new code so that it always used the new algorithm.

"Tuning" is a cross-platform nightmare, which is one reason for why we did little of it when writing _pylong to begin with. On my box (Win 10, Python 3.12.3), the cutoff is apparently lower than on your box. First column shows seconds for the current code. Second column for the code here right now. Third column is second divided by first (so > 1 means the code here was slower, < 1 that the code here was faster). 3 randomized test cases at each bit length:

bits 500
   0.000 0.000 1.7368400897623442
   0.000 0.000 2.315791892285187
   0.000 0.000 2.333323632130384
bits 1_000
   0.000 0.000 4.533343358006966
   0.000 0.000 1.571434511002318
   0.000 0.000 1.5000051971270283
bits 2_000
   0.000 0.000 1.1139230480176243
   0.000 0.000 1.13580164849774
   0.000 0.000 1.1250009094943938
bits 4_000
   0.000 0.000 1.6111110790301728
   0.000 0.000 1.133857545940217
   0.000 0.000 1.0806451896824572
bits 8_000
   0.000 0.000 0.8698555378454198
   0.000 0.000 0.8741648245854122
   0.000 0.000 0.8747203761491171
bits 16_000
   0.000 0.000 0.789428104667197
   0.000 0.000 0.8258850697689581
   0.000 0.000 0.7243478000457225
bits 32_000
   0.012 0.001 0.07901368319826614
   0.002 0.001 0.38462661929131853
   0.002 0.001 0.41094582951200515
bits 64_000
   0.005 0.002 0.4628701938271759
   0.005 0.002 0.4466761323446579
   0.005 0.003 0.4859597097819747
bits 128_000
   0.011 0.008 0.6727132938489294
   0.011 0.007 0.6584790173071843
   0.011 0.007 0.6372988191245362
bits 256_000
   0.024 0.022 0.8989868890274216
   0.024 0.022 0.9402834947005769
   0.024 0.022 0.9067376976902499
bits 512_000
   0.054 0.066 1.2281905012459542
   0.052 0.065 1.257452886915317
   0.053 0.066 1.2450761337289526
bits 1_024_000
   0.116 0.202 1.7351631350039387
   0.115 0.201 1.7405417023701295
   0.117 0.203 1.736367334443194
bits 2_048_000
   0.257 0.597 2.321592911307228
   0.258 0.598 2.3151434216679183
   0.256 0.602 2.350132284001856
bits 4_096_000
   0.565 1.819 3.220948368507521
   0.571 1.813 3.1757651529396647
   0.564 1.818 3.220911829487828
bits 8_192_000
   1.244 5.398 4.337127233176792
   1.245 5.398 4.335028138454718
   1.247 5.411 4.338126648649704
bits 16_384_000
   2.724 16.659 6.114568992708456
   2.723 16.184 5.942620462918925
   2.720 16.152 5.93778361889991
bits 32_768_000
   5.935 49.575 8.35272163522155
   5.916 49.403 8.35108832237374
   5.928 49.974 8.430256359001

and I killed it then because I don't want to wait minutes for each new output line. So the new code was actually slower at the smallest lengths tried, and very much slower at the largest. The "sweet spot" (where it's faster) ranged from about 8K to somewhere well under 500K bits.

Part of the GMP release process is consuming weeks of CPU-time to find cutoffs unique to each major platform they run on. Since CPython will never have the bandwidth for that, "don't bother at all" is my preference - but suit yourself. Just be aware that, e.g., you're making str(int_with_512K_bits) about 20% slower on my box than it is today. I don't really care - but someone else might. So at least be very conservative when picking cutoffs - don't push to the very limit you see on your box 😄.

tim-one · 2024-05-03T00:34:07Z

Note that there are already two "cache powers" functions in the file, w2pow() and w5pow(), very similar. Why invent another way? In fact, you could reuse w5pow() (exploiting that 10**w == 5**w << w, replacing some of the multiply effort with a blazing fast shift).

These functions are much more efficient "in the large". For example, if the power is 40, they compute the 20th power first and just square it. If you do base**40 directly, under the covers Python will have to do 5 multiplications (squares) to build up all the power-of-2 powers <= 40, and another multiply for each bit set in 40. All the intermediate results are thrown away. If you ask w5pow() for the 40th power, and the 20th power was already cached, one square and it's done. Or if the 20th power wasn't cached, but the 19th was, two multiplies and it's done. Etc.

If you're only doing one power, much the same effort overall. But when during power multiple times, w5pow() can reuse the smaller cached powers over and over to build up the next power asked for.

tim-one · 2024-05-03T02:19:55Z

FYI, copying in the existing w5pow() and replacing the code computing d with d = w5pow(w2) << w2 does yield a modest speedup on my box. For example, the 32M-bit cases improved from about 8.37x slower to about 7.93x slower. No more than a 5% speedup, though.

Unfortunately, it's significantly slower to do the equivalent (which also removes the divmod()):

            d = w5pow(w2)
            hi = (n >> w2) // d
            lo = n - (hi * d << w2)

Computing hi goes faster then, but having to compute lo in a separate operation is much slower than having divmod() compute both (divmod() effectively gets lo "for free", as a natural consequence of computing the quotient)..

I don't have a strong opinion about this. The potential 5% isn't that impressive. A better reason to adopt it would be for internal consistency (since it's how all other functions in the file already cache powers).

tim-one · 2024-05-03T03:22:22Z

In other news 😉, I'm dropping my suggestion that it's better to do just one mass concatenation at the end. It is better "in theory", but cutting an O(d log d) subsystem to amortized O(d) just doesn't matter in an O(d**1.58) algorithm. Plus it makes avoiding leading zeroes in the result "a puzzle" instead of straightforward - and the coding needed to solve that puzzle also costs cycles.

serhiy-storchaka · 2024-05-03T07:25:33Z

About determining the sweet spot.

Did you tested this all on the optimized build? On my computer the cutoff is around 900_000 bits on the optimized build and 450_000 bits on the debug build. From your data, it is between 256_000 and 512_000 bits, but if it was the debug build, it matches my data. If you used the optimized build, I'll reduce the upper limit to 450_000.

Did you compare the new implementation with _pylong.int_to_decimal_string(n) or str(n). The former is only called from the latter for more than 30_000 bits. Below this the C implementation is used. From your data, the lower limit is between 4_000 and 8_000 bits. On my computer it is around 7_000 on the optimized build and 15_000 on the debug build. I think it is safe to lower the cutoff between the C and the Python implementations from 30_000 to 15_000 bits.

About the powers calculation. I tested several schemes before, including schemes that precompute all necessary powers of 10 (we can minimize the number of different powers, so only 10**(wmin*2**k) should be calculated), and finally your suggestion with w5pow(). More fancy schemes does not make significant difference at best, and the variant with w5pow() is 30% slower for the 32M-bit cases. It also makes the implementation much larger. The simple cache implementation in this PR only adds 3 lines of code and makes it ~5-10% faster in the range from 30K-bit to 30M-bit cases.

About concatenating the chunks. In my implementation I used more straightforward code to omit padding in the first chunk. I also tested your implementation. It does not make any difference. O(d log d) should play role for larger d, but it is dwarfed by the complexity of the main algorithm. And for smaller d, the fancy concatenation adds an overhead of O(d), so it does not have a sweet spot.

tim-one · 2024-05-03T18:16:46Z

All my testing was against Python str(int), under the released Python 3.12.3 for 64-bit Windows from python.org. That's as optimized as Steve Dower knows how to make it for Windows 😄.

You can suit yourself for fine-tuning of input ranges. My preference is not to bother with it, because I already pissed away too many years of my life shooting at moving targets. "The best" ranges can - and do - vary with the specific CPU, compiler, compiler release, and compiler options used. In CPython, such cutoff points get hard-coded once, and are typically never reviewed again, no matter how many years pass and how much technology (HW and compiler) changes. One reason for why we'll never be competitive with GMP (which puts major, ongoing effort into updating its many "cutoff points", unique to each major platform).

For this specific problem, anyone who cares a lot about conversion speed will use gmpy2 instead. Even including the overhead of converting a Python int to a gmpy2.mpz, conversion to a decimal string is generally at least 4 times faster than our fastest (libmpdec) method. In some ranges, like around 256K bits, it's 10x faster on my box. At around 32K bits, 20x faster.

So I see Python's own int->str efforts as being overwhelmingly just about avoiding quadratic-time disasters.

Lib/_pylong.py

Co-authored-by: Tim Peters <tim.peters@gmail.com>

tim-one · 2024-05-04T21:33:08Z

There's another insecurity, but I'm not inclined to "do something" about it. We're running Python code here, so other threads can run while our conversion code is running. There's nothing to stop them from doing, say, sys.set_int_max_str_digits(5) in the middle of what we're doing. It's a global setting, so the next time we do str() it can affect us too - we may not be able to complete what we started.

I expect that making that bulletproof would be way more trouble than it's worth (approximately nothing).

serhiy-storchaka · 2024-05-05T05:36:23Z

In normal case it only calls str() for less than 1000 decimal digits (3321 bits) integers. Due to floating point limitations, it can call str() for significantly larger integers if the original input has thousands times more than 2**52 bits, that exceeds the 52-bit physical address limit in both Intel64 and AMD64.

serhiy-storchaka · 2024-05-05T05:37:55Z

Thank you for your review and discussion, @tim-one. I included the modified PR description as the commit message.

miss-islington-app · 2024-05-05T07:09:11Z

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

…mize int to str conversion (pythonGH-118483) For converting large ints to strings, CPython invokes a function in _pylong.py, which uses the decimal module to implement an asymptotically waaaaay sub-quadratic algorithm. But if the C decimal module isn't available, CPython uses _pydecimal.py instead. Which in turn frequently does str(int). If the int is very large, _pylong ends up doing the work, which in turn asks decimal to do "big" arithmetic, which in turn calls str(big_int), which in turn ... it can become infinite mutual recursion. This change introduces a different int->str function that doesn't use decimal. It's asymptotically worse, "Karatsuba time" instead of quadratic time, so still a huge improvement. _pylong switches to that when the C decimal isn't available. It is also used for not too large integers (less than 450_000 bits), where it is faster (up to 2 times for 30_000 bits) than the asymptotically better implementation that uses the C decimal. (cherry picked from commit 711c80b) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Tim Peters <tim.peters@gmail.com>

bedevere-app · 2024-05-05T07:09:23Z

GH-118590 is a backport of this pull request to the 3.12 branch.

serhiy-storchaka · 2024-05-05T07:11:50Z

I think that it will not harm to backport this one too. The people who encounter this issue either have no control on the availability of _decimal, or have reasons for using the Python implementation.

…imize int to str conversion (GH-118483) (GH-118590) For converting large ints to strings, CPython invokes a function in _pylong.py, which uses the decimal module to implement an asymptotically waaaaay sub-quadratic algorithm. But if the C decimal module isn't available, CPython uses _pydecimal.py instead. Which in turn frequently does str(int). If the int is very large, _pylong ends up doing the work, which in turn asks decimal to do "big" arithmetic, which in turn calls str(big_int), which in turn ... it can become infinite mutual recursion. This change introduces a different int->str function that doesn't use decimal. It's asymptotically worse, "Karatsuba time" instead of quadratic time, so still a huge improvement. _pylong switches to that when the C decimal isn't available. It is also used for not too large integers (less than 450_000 bits), where it is faster (up to 2 times for 30_000 bits) than the asymptotically better implementation that uses the C decimal. (cherry picked from commit 711c80b) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Tim Peters <tim.peters@gmail.com>

…mize int to str conversion (pythonGH-118483) For converting large ints to strings, CPython invokes a function in _pylong.py, which uses the decimal module to implement an asymptotically waaaaay sub-quadratic algorithm. But if the C decimal module isn't available, CPython uses _pydecimal.py instead. Which in turn frequently does str(int). If the int is very large, _pylong ends up doing the work, which in turn asks decimal to do "big" arithmetic, which in turn calls str(big_int), which in turn ... it can become infinite mutual recursion. This change introduces a different int->str function that doesn't use decimal. It's asymptotically worse, "Karatsuba time" instead of quadratic time, so still a huge improvement. _pylong switches to that when the C decimal isn't available. It is also used for not too large integers (less than 450_000 bits), where it is faster (up to 2 times for 30_000 bits) than the asymptotically better implementation that uses the C decimal. Co-authored-by: Tim Peters <tim.peters@gmail.com>

pythongh-118164: Optimize int to str conversion

62a450d

serhiy-storchaka requested a review from tim-one May 1, 2024 17:45

bedevere-app bot mentioned this pull request May 1, 2024

str(10**10000) hangs if the C _decimal module is missing #118164

Closed

tim-one reviewed May 1, 2024

View reviewed changes

tim-one mentioned this pull request May 2, 2024

gh-118164: str(10**10000) hangs if the C _decimal module is missing #118503

Merged

Use the cache for pow10. Use more correct upper limit.

4e978aa

serhiy-storchaka force-pushed the int_to_decimal_string branch from 6f270bd to 627e99a Compare May 2, 2024 13:30

serhiy-storchaka changed the title ~~gh-118164: Optimize int to str conversion~~ gh-118164: Break a loop between _pydecimal and _pylong and optimize int to str conversion May 2, 2024

serhiy-storchaka marked this pull request as ready for review May 2, 2024 13:32

bedevere-app bot added the awaiting core review label May 2, 2024

Break a loop between _pydecimal and _pylong.

cd108bf

serhiy-storchaka force-pushed the int_to_decimal_string branch from 627e99a to cd108bf Compare May 2, 2024 13:34

serhiy-storchaka added 2 commits May 3, 2024 00:10

Merge branch 'main' into int_to_decimal_string

fc42978

More straighforward test for _decimal.

ff23ed6

serhiy-storchaka added 2 commits May 3, 2024 22:25

Merge branch 'main' into int_to_decimal_string

c21b498

Lower the cutoff.

2108ba7

tim-one reviewed May 4, 2024

View reviewed changes

Lib/_pylong.py Outdated Show resolved Hide resolved

Add some comments.

dd263a5

Handle errors of inexact guess of the number of decimal digits.

a3b8e6b

serhiy-storchaka commented May 4, 2024

View reviewed changes

Lib/_pylong.py Outdated Show resolved Hide resolved

serhiy-storchaka and others added 2 commits May 4, 2024 22:31

Update Lib/_pylong.py

dbac368

Co-authored-by: Tim Peters <tim.peters@gmail.com>

Remove handling of the case of too small guess for w.

9db8966

tim-one approved these changes May 4, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels May 4, 2024

serhiy-storchaka merged commit 711c80b into python:main May 5, 2024

bedevere-app bot removed the awaiting merge label May 5, 2024

serhiy-storchaka deleted the int_to_decimal_string branch May 5, 2024 05:20

serhiy-storchaka added the needs backport to 3.12 only security fixes label May 5, 2024

bedevere-app bot removed the needs backport to 3.12 only security fixes label May 5, 2024

Uh oh!

gh-118164: Break a loop between _pydecimal and _pylong and optimize int to str conversion #118483

gh-118164: Break a loop between _pydecimal and _pylong and optimize int to str conversion #118483

Uh oh!

Conversation

serhiy-storchaka commented May 1, 2024 • edited by tim-one Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim-one commented May 1, 2024

Uh oh!

tim-one commented May 1, 2024

Uh oh!

tim-one left a comment

Choose a reason for hiding this comment

Uh oh!

tim-one commented May 2, 2024

Uh oh!

tim-one commented May 2, 2024

Uh oh!

serhiy-storchaka commented May 2, 2024

Uh oh!

tim-one commented May 2, 2024

Uh oh!

tim-one commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim-one commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim-one commented May 3, 2024

Uh oh!

tim-one commented May 3, 2024

Uh oh!

serhiy-storchaka commented May 3, 2024

Uh oh!

tim-one commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tim-one commented May 4, 2024

Uh oh!

serhiy-storchaka commented May 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka commented May 5, 2024

Uh oh!

miss-islington-app bot commented May 5, 2024

Uh oh!

bedevere-app bot commented May 5, 2024

Uh oh!

serhiy-storchaka commented May 5, 2024

Uh oh!

Uh oh!

serhiy-storchaka commented May 1, 2024 •

edited by tim-one

Loading

tim-one commented May 2, 2024 •

edited

Loading

tim-one commented May 3, 2024 •

edited

Loading

tim-one commented May 3, 2024 •

edited

Loading

serhiy-storchaka commented May 5, 2024 •

edited

Loading