Skip to content

gh-118164: Break a loop between _pydecimal and _pylong and optimize int to str conversion #118483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 5, 2024

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented May 1, 2024

The underlying issue here: for converting large ints to strings, CPython invokes a function in _pylong.py, which uses the decimal module to implement an asymptotically waaaaay sub-quadratic algorithm.

But if the C decimal module isn't available, CPython uses _pydecimal.py instead. Which in turn frequently does str(int). If the int is very large, _pylong ends up doing the work, which in turn asks decimal to do "big" arithmetic, which in turn calls str(big_int), which in turn ... it can become infinite mutual recursion.

This PR introduces a different int->str function that doesn't use decimal. It's asymptotically worse, "Karatsuba time" instead of quadratic time, so still a huge improvement. The hope (well, my - Tim's - hope) is that _pylong can switch to that when the C decimal isn't available.

@tim-one
Copy link
Member

tim-one commented May 1, 2024

See extensive comments about the approach on the issue report . Note that I did not have in mind changing _pylong at all - decimal arithmetic is "the right" way to do this, and the superior asymptotics of libmpdec multiplication is bound to favor the division-free code already there as argument size increases.

There is no problem to be solved here in an intact CPython distribution, so changes should be limited to where the problem arises: in _pydecimal, which CPython doesn't use at all. _pydecimal should use "something like this" for its own int->str conversions. Yes, that's annoying 😦.

@tim-one
Copy link
Member

tim-one commented May 1, 2024

the superior asymptotics of libmpdec multiplication is bound to favor the division-free code already there as argument size increases

Indeed, at much larger sizes than you tried, the current code is much faster. "The problem" is that CPython's sub-quadratic division is eventually dominated by the speed of CPython's Karatsuba multiplication. That's sub-quadratic, but doubling the number of bits "only" triples the time instead of quadrupling it. libmpdec has much fancier multiplication schemes.

For example, at 32_768_000 bits, the current code took about 6 seconds on my box, but the code I showed in the issue report (derived from yours) took about 69 seconds. At 65_536_000 bits, this code took about 3x longer (over 200 seconds), but the current code only about 2.2x longer (about 13 seconds).

So -1 on changing the _longpy algorithm.

Copy link
Member

@tim-one tim-one left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in my most recent comment, this approach has badly worse asymptotic behavior than the current approach, so I'm opposed to changing the current approach. As explained in my comment, this is "deep", not shallow: any approach not using libmpdec will eventually be limited by the asymptotics of CPython's Karatsuba multiplication (O(n**1.585)).

@tim-one
Copy link
Member

tim-one commented May 2, 2024

A different idea: if you know how to detect whether the libmpdec library is installed, then _pylong could use its current int_to_decimal_string(). else use the new one.

I confess I don't know all the relevant details. For example, perhaps it's enough if _pylong.py, after doing its import decimal, just checked to see whether "_decimal" in sys.modules?

If that could be make to work, then nothing in _pydecimal should need to change, and _pylong would only need to decide once which version to use, at the time it's imported.

@serhiy-storchaka serhiy-storchaka force-pushed the int_to_decimal_string branch from 6f270bd to 627e99a Compare May 2, 2024 13:30
@serhiy-storchaka serhiy-storchaka changed the title gh-118164: Optimize int to str conversion gh-118164: Break a loop between _pydecimal and _pylong and optimize int to str conversion May 2, 2024
@serhiy-storchaka serhiy-storchaka marked this pull request as ready for review May 2, 2024 13:32
@serhiy-storchaka serhiy-storchaka force-pushed the int_to_decimal_string branch from 627e99a to cd108bf Compare May 2, 2024 13:34
@tim-one
Copy link
Member

tim-one commented May 2, 2024

One way to detect whether _pylong imported the C version of decimal:

if hasattr(decimal, '__libmpdec_version__'):
    # We have the C version.
else:
    # We have the Python version.

@serhiy-storchaka
Copy link
Member Author

Unfortunately __libmpdec_version__ is defined in the Python version too.

__libmpdec_version__ = "2.4.2" # compatible libmpdec version

@tim-one
Copy link
Member

tim-one commented May 2, 2024

Dang - I could have sworn I tried that ☹️

I asked on DIscourse, and Steve Dower suggested this, which is pretty straighforward:

import decimal
try:
    import _decimal
except ImportError:
    # decimal-free implementation
else:
    # the current code

@tim-one
Copy link
Member

tim-one commented May 2, 2024

EDIT: when doing those timings, I fiddled the new code so that it always used the new algorithm.

"Tuning" is a cross-platform nightmare, which is one reason for why we did little of it when writing _pylong to begin with. On my box (Win 10, Python 3.12.3), the cutoff is apparently lower than on your box. First column shows seconds for the current code. Second column for the code here right now. Third column is second divided by first (so > 1 means the code here was slower, < 1 that the code here was faster). 3 randomized test cases at each bit length:

bits 500
   0.000 0.000 1.7368400897623442
   0.000 0.000 2.315791892285187
   0.000 0.000 2.333323632130384
bits 1_000
   0.000 0.000 4.533343358006966
   0.000 0.000 1.571434511002318
   0.000 0.000 1.5000051971270283
bits 2_000
   0.000 0.000 1.1139230480176243
   0.000 0.000 1.13580164849774
   0.000 0.000 1.1250009094943938
bits 4_000
   0.000 0.000 1.6111110790301728
   0.000 0.000 1.133857545940217
   0.000 0.000 1.0806451896824572
bits 8_000
   0.000 0.000 0.8698555378454198
   0.000 0.000 0.8741648245854122
   0.000 0.000 0.8747203761491171
bits 16_000
   0.000 0.000 0.789428104667197
   0.000 0.000 0.8258850697689581
   0.000 0.000 0.7243478000457225
bits 32_000
   0.012 0.001 0.07901368319826614
   0.002 0.001 0.38462661929131853
   0.002 0.001 0.41094582951200515
bits 64_000
   0.005 0.002 0.4628701938271759
   0.005 0.002 0.4466761323446579
   0.005 0.003 0.4859597097819747
bits 128_000
   0.011 0.008 0.6727132938489294
   0.011 0.007 0.6584790173071843
   0.011 0.007 0.6372988191245362
bits 256_000
   0.024 0.022 0.8989868890274216
   0.024 0.022 0.9402834947005769
   0.024 0.022 0.9067376976902499
bits 512_000
   0.054 0.066 1.2281905012459542
   0.052 0.065 1.257452886915317
   0.053 0.066 1.2450761337289526
bits 1_024_000
   0.116 0.202 1.7351631350039387
   0.115 0.201 1.7405417023701295
   0.117 0.203 1.736367334443194
bits 2_048_000
   0.257 0.597 2.321592911307228
   0.258 0.598 2.3151434216679183
   0.256 0.602 2.350132284001856
bits 4_096_000
   0.565 1.819 3.220948368507521
   0.571 1.813 3.1757651529396647
   0.564 1.818 3.220911829487828
bits 8_192_000
   1.244 5.398 4.337127233176792
   1.245 5.398 4.335028138454718
   1.247 5.411 4.338126648649704
bits 16_384_000
   2.724 16.659 6.114568992708456
   2.723 16.184 5.942620462918925
   2.720 16.152 5.93778361889991
bits 32_768_000
   5.935 49.575 8.35272163522155
   5.916 49.403 8.35108832237374
   5.928 49.974 8.430256359001

and I killed it then because I don't want to wait minutes for each new output line. So the new code was actually slower at the smallest lengths tried, and very much slower at the largest. The "sweet spot" (where it's faster) ranged from about 8K to somewhere well under 500K bits.

Part of the GMP release process is consuming weeks of CPU-time to find cutoffs unique to each major platform they run on. Since CPython will never have the bandwidth for that, "don't bother at all" is my preference - but suit yourself. Just be aware that, e.g., you're making str(int_with_512K_bits) about 20% slower on my box than it is today. I don't really care - but someone else might. So at least be very conservative when picking cutoffs - don't push to the very limit you see on your box 😄.

@tim-one
Copy link
Member

tim-one commented May 3, 2024

Note that there are already two "cache powers" functions in the file, w2pow() and w5pow(), very similar. Why invent another way? In fact, you could reuse w5pow() (exploiting that 10**w == 5**w << w, replacing some of the multiply effort with a blazing fast shift).

These functions are much more efficient "in the large". For example, if the power is 40, they compute the 20th power first and just square it. If you do base**40 directly, under the covers Python will have to do 5 multiplications (squares) to build up all the power-of-2 powers <= 40, and another multiply for each bit set in 40. All the intermediate results are thrown away. If you ask w5pow() for the 40th power, and the 20th power was already cached, one square and it's done. Or if the 20th power wasn't cached, but the 19th was, two multiplies and it's done. Etc.

If you're only doing one power, much the same effort overall. But when during power multiple times, w5pow() can reuse the smaller cached powers over and over to build up the next power asked for.

@tim-one
Copy link
Member

tim-one commented May 3, 2024

FYI, copying in the existing w5pow() and replacing the code computing d with d = w5pow(w2) << w2 does yield a modest speedup on my box. For example, the 32M-bit cases improved from about 8.37x slower to about 7.93x slower. No more than a 5% speedup, though.

Unfortunately, it's significantly slower to do the equivalent (which also removes the divmod()):

            d = w5pow(w2)
            hi = (n >> w2) // d
            lo = n - (hi * d << w2)

Computing hi goes faster then, but having to compute lo in a separate operation is much slower than having divmod() compute both (divmod() effectively gets lo "for free", as a natural consequence of computing the quotient)..

I don't have a strong opinion about this. The potential 5% isn't that impressive. A better reason to adopt it would be for internal consistency (since it's how all other functions in the file already cache powers).

@tim-one
Copy link
Member

tim-one commented May 3, 2024

In other news 😉, I'm dropping my suggestion that it's better to do just one mass concatenation at the end. It is better "in theory", but cutting an O(d log d) subsystem to amortized O(d) just doesn't matter in an O(d**1.58) algorithm. Plus it makes avoiding leading zeroes in the result "a puzzle" instead of straightforward - and the coding needed to solve that puzzle also costs cycles.

@serhiy-storchaka
Copy link
Member Author

About determining the sweet spot.

Did you tested this all on the optimized build? On my computer the cutoff is around 900_000 bits on the optimized build and 450_000 bits on the debug build. From your data, it is between 256_000 and 512_000 bits, but if it was the debug build, it matches my data. If you used the optimized build, I'll reduce the upper limit to 450_000.

Did you compare the new implementation with _pylong.int_to_decimal_string(n) or str(n). The former is only called from the latter for more than 30_000 bits. Below this the C implementation is used. From your data, the lower limit is between 4_000 and 8_000 bits. On my computer it is around 7_000 on the optimized build and 15_000 on the debug build. I think it is safe to lower the cutoff between the C and the Python implementations from 30_000 to 15_000 bits.

About the powers calculation. I tested several schemes before, including schemes that precompute all necessary powers of 10 (we can minimize the number of different powers, so only 10**(wmin*2**k) should be calculated), and finally your suggestion with w5pow(). More fancy schemes does not make significant difference at best, and the variant with w5pow() is 30% slower for the 32M-bit cases. It also makes the implementation much larger. The simple cache implementation in this PR only adds 3 lines of code and makes it ~5-10% faster in the range from 30K-bit to 30M-bit cases.

About concatenating the chunks. In my implementation I used more straightforward code to omit padding in the first chunk. I also tested your implementation. It does not make any difference. O(d log d) should play role for larger d, but it is dwarfed by the complexity of the main algorithm. And for smaller d, the fancy concatenation adds an overhead of O(d), so it does not have a sweet spot.

@tim-one
Copy link
Member

tim-one commented May 3, 2024

All my testing was against Python str(int), under the released Python 3.12.3 for 64-bit Windows from python.org. That's as optimized as Steve Dower knows how to make it for Windows 😄.

You can suit yourself for fine-tuning of input ranges. My preference is not to bother with it, because I already pissed away too many years of my life shooting at moving targets. "The best" ranges can - and do - vary with the specific CPU, compiler, compiler release, and compiler options used. In CPython, such cutoff points get hard-coded once, and are typically never reviewed again, no matter how many years pass and how much technology (HW and compiler) changes. One reason for why we'll never be competitive with GMP (which puts major, ongoing effort into updating its many "cutoff points", unique to each major platform).

For this specific problem, anyone who cares a lot about conversion speed will use gmpy2 instead. Even including the overhead of converting a Python int to a gmpy2.mpz, conversion to a decimal string is generally at least 4 times faster than our fastest (libmpdec) method. In some ranges, like around 256K bits, it's 10x faster on my box. At around 32K bits, 20x faster.

So I see Python's own int->str efforts as being overwhelmingly just about avoiding quadratic-time disasters.

serhiy-storchaka and others added 2 commits May 4, 2024 22:31
@tim-one
Copy link
Member

tim-one commented May 4, 2024

There's another insecurity, but I'm not inclined to "do something" about it. We're running Python code here, so other threads can run while our conversion code is running. There's nothing to stop them from doing, say, sys.set_int_max_str_digits(5) in the middle of what we're doing. It's a global setting, so the next time we do str() it can affect us too - we may not be able to complete what we started.

I expect that making that bulletproof would be way more trouble than it's worth (approximately nothing).

@serhiy-storchaka serhiy-storchaka merged commit 711c80b into python:main May 5, 2024
@serhiy-storchaka serhiy-storchaka deleted the int_to_decimal_string branch May 5, 2024 05:20
@serhiy-storchaka
Copy link
Member Author

serhiy-storchaka commented May 5, 2024

In normal case it only calls str() for less than 1000 decimal digits (3321 bits) integers. Due to floating point limitations, it can call str() for significantly larger integers if the original input has thousands times more than 2**52 bits, that exceeds the 52-bit physical address limit in both Intel64 and AMD64.

@serhiy-storchaka
Copy link
Member Author

Thank you for your review and discussion, @tim-one. I included the modified PR description as the commit message.

@serhiy-storchaka serhiy-storchaka added the needs backport to 3.12 only security fixes label May 5, 2024
@miss-islington-app
Copy link

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 5, 2024
…mize int to str conversion (pythonGH-118483)

For converting large ints to strings, CPython invokes a function in _pylong.py,
which uses the decimal module to implement an asymptotically waaaaay
sub-quadratic algorithm. But if the C decimal module isn't available, CPython
uses _pydecimal.py instead. Which in turn frequently does str(int). If the int
is very large, _pylong ends up doing the work, which in turn asks decimal to do
"big" arithmetic, which in turn calls str(big_int), which in turn ... it can
become infinite mutual recursion.

This change introduces a different int->str function that doesn't use decimal.
It's asymptotically worse, "Karatsuba time" instead of quadratic time, so
still a huge improvement. _pylong switches to that when the C decimal isn't
available. It is also used for not too large integers (less than 450_000 bits),
where it is faster (up to 2 times for 30_000 bits) than the asymptotically
better implementation that uses the C decimal.

(cherry picked from commit 711c80b)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Tim Peters <tim.peters@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented May 5, 2024

GH-118590 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label May 5, 2024
@serhiy-storchaka
Copy link
Member Author

I think that it will not harm to backport this one too. The people who encounter this issue either have no control on the availability of _decimal, or have reasons for using the Python implementation.

serhiy-storchaka added a commit that referenced this pull request May 6, 2024
…imize int to str conversion (GH-118483) (GH-118590)

For converting large ints to strings, CPython invokes a function in _pylong.py,
which uses the decimal module to implement an asymptotically waaaaay
sub-quadratic algorithm. But if the C decimal module isn't available, CPython
uses _pydecimal.py instead. Which in turn frequently does str(int). If the int
is very large, _pylong ends up doing the work, which in turn asks decimal to do
"big" arithmetic, which in turn calls str(big_int), which in turn ... it can
become infinite mutual recursion.

This change introduces a different int->str function that doesn't use decimal.
It's asymptotically worse, "Karatsuba time" instead of quadratic time, so
still a huge improvement. _pylong switches to that when the C decimal isn't
available. It is also used for not too large integers (less than 450_000 bits),
where it is faster (up to 2 times for 30_000 bits) than the asymptotically
better implementation that uses the C decimal.

(cherry picked from commit 711c80b)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Tim Peters <tim.peters@gmail.com>
SonicField pushed a commit to SonicField/cpython that referenced this pull request May 8, 2024
…mize int to str conversion (pythonGH-118483)

For converting large ints to strings, CPython invokes a function in _pylong.py,
which uses the decimal module to implement an asymptotically waaaaay
sub-quadratic algorithm. But if the C decimal module isn't available, CPython
uses _pydecimal.py instead. Which in turn frequently does str(int). If the int
is very large, _pylong ends up doing the work, which in turn asks decimal to do
"big" arithmetic, which in turn calls str(big_int), which in turn ... it can
become infinite mutual recursion.

This change introduces a different int->str function that doesn't use decimal.
It's asymptotically worse, "Karatsuba time" instead of quadratic time, so
still a huge improvement. _pylong switches to that when the C decimal isn't
available. It is also used for not too large integers (less than 450_000 bits),
where it is faster (up to 2 times for 30_000 bits) than the asymptotically
better implementation that uses the C decimal.

Co-authored-by: Tim Peters <tim.peters@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants