-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
gh-118164: Break a loop between _pydecimal and _pylong and optimize int to str conversion #118483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-118164: Break a loop between _pydecimal and _pylong and optimize int to str conversion #118483
Conversation
See extensive comments about the approach on the issue report . Note that I did not have in mind changing There is no problem to be solved here in an intact CPython distribution, so changes should be limited to where the problem arises: in |
Indeed, at much larger sizes than you tried, the current code is much faster. "The problem" is that CPython's sub-quadratic division is eventually dominated by the speed of CPython's Karatsuba multiplication. That's sub-quadratic, but doubling the number of bits "only" triples the time instead of quadrupling it. For example, at 32_768_000 bits, the current code took about 6 seconds on my box, but the code I showed in the issue report (derived from yours) took about 69 seconds. At 65_536_000 bits, this code took about 3x longer (over 200 seconds), but the current code only about 2.2x longer (about 13 seconds). So -1 on changing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in my most recent comment, this approach has badly worse asymptotic behavior than the current approach, so I'm opposed to changing the current approach. As explained in my comment, this is "deep", not shallow: any approach not using libmpdec
will eventually be limited by the asymptotics of CPython's Karatsuba multiplication (O(n**1.585)
).
A different idea: if you know how to detect whether the I confess I don't know all the relevant details. For example, perhaps it's enough if If that could be make to work, then nothing in |
6f270bd
to
627e99a
Compare
627e99a
to
cd108bf
Compare
One way to detect whether if hasattr(decimal, '__libmpdec_version__'):
# We have the C version.
else:
# We have the Python version. |
Unfortunately Line 57 in e54b0c8
|
Dang - I could have sworn I tried that I asked on DIscourse, and Steve Dower suggested this, which is pretty straighforward: import decimal
try:
import _decimal
except ImportError:
# decimal-free implementation
else:
# the current code |
EDIT: when doing those timings, I fiddled the new code so that it always used the new algorithm. "Tuning" is a cross-platform nightmare, which is one reason for why we did little of it when writing
and I killed it then because I don't want to wait minutes for each new output line. So the new code was actually slower at the smallest lengths tried, and very much slower at the largest. The "sweet spot" (where it's faster) ranged from about 8K to somewhere well under 500K bits. Part of the GMP release process is consuming weeks of CPU-time to find cutoffs unique to each major platform they run on. Since CPython will never have the bandwidth for that, "don't bother at all" is my preference - but suit yourself. Just be aware that, e.g., you're making |
Note that there are already two "cache powers" functions in the file, These functions are much more efficient "in the large". For example, if the power is 40, they compute the 20th power first and just square it. If you do If you're only doing one power, much the same effort overall. But when during power multiple times, |
FYI, copying in the existing Unfortunately, it's significantly slower to do the equivalent (which also removes the d = w5pow(w2)
hi = (n >> w2) // d
lo = n - (hi * d << w2) Computing I don't have a strong opinion about this. The potential 5% isn't that impressive. A better reason to adopt it would be for internal consistency (since it's how all other functions in the file already cache powers). |
In other news 😉, I'm dropping my suggestion that it's better to do just one mass concatenation at the end. It is better "in theory", but cutting an |
About determining the sweet spot. Did you tested this all on the optimized build? On my computer the cutoff is around 900_000 bits on the optimized build and 450_000 bits on the debug build. From your data, it is between 256_000 and 512_000 bits, but if it was the debug build, it matches my data. If you used the optimized build, I'll reduce the upper limit to 450_000. Did you compare the new implementation with About the powers calculation. I tested several schemes before, including schemes that precompute all necessary powers of 10 (we can minimize the number of different powers, so only About concatenating the chunks. In my implementation I used more straightforward code to omit padding in the first chunk. I also tested your implementation. It does not make any difference. O(d log d) should play role for larger d, but it is dwarfed by the complexity of the main algorithm. And for smaller d, the fancy concatenation adds an overhead of O(d), so it does not have a sweet spot. |
All my testing was against Python You can suit yourself for fine-tuning of input ranges. My preference is not to bother with it, because I already pissed away too many years of my life shooting at moving targets. "The best" ranges can - and do - vary with the specific CPU, compiler, compiler release, and compiler options used. In CPython, such cutoff points get hard-coded once, and are typically never reviewed again, no matter how many years pass and how much technology (HW and compiler) changes. One reason for why we'll never be competitive with GMP (which puts major, ongoing effort into updating its many "cutoff points", unique to each major platform). For this specific problem, anyone who cares a lot about conversion speed will use So I see Python's own int->str efforts as being overwhelmingly just about avoiding quadratic-time disasters. |
Co-authored-by: Tim Peters <tim.peters@gmail.com>
There's another insecurity, but I'm not inclined to "do something" about it. We're running Python code here, so other threads can run while our conversion code is running. There's nothing to stop them from doing, say, I expect that making that bulletproof would be way more trouble than it's worth (approximately nothing). |
In normal case it only calls |
Thank you for your review and discussion, @tim-one. I included the modified PR description as the commit message. |
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12. |
…mize int to str conversion (pythonGH-118483) For converting large ints to strings, CPython invokes a function in _pylong.py, which uses the decimal module to implement an asymptotically waaaaay sub-quadratic algorithm. But if the C decimal module isn't available, CPython uses _pydecimal.py instead. Which in turn frequently does str(int). If the int is very large, _pylong ends up doing the work, which in turn asks decimal to do "big" arithmetic, which in turn calls str(big_int), which in turn ... it can become infinite mutual recursion. This change introduces a different int->str function that doesn't use decimal. It's asymptotically worse, "Karatsuba time" instead of quadratic time, so still a huge improvement. _pylong switches to that when the C decimal isn't available. It is also used for not too large integers (less than 450_000 bits), where it is faster (up to 2 times for 30_000 bits) than the asymptotically better implementation that uses the C decimal. (cherry picked from commit 711c80b) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Tim Peters <tim.peters@gmail.com>
GH-118590 is a backport of this pull request to the 3.12 branch. |
I think that it will not harm to backport this one too. The people who encounter this issue either have no control on the availability of |
…imize int to str conversion (GH-118483) (GH-118590) For converting large ints to strings, CPython invokes a function in _pylong.py, which uses the decimal module to implement an asymptotically waaaaay sub-quadratic algorithm. But if the C decimal module isn't available, CPython uses _pydecimal.py instead. Which in turn frequently does str(int). If the int is very large, _pylong ends up doing the work, which in turn asks decimal to do "big" arithmetic, which in turn calls str(big_int), which in turn ... it can become infinite mutual recursion. This change introduces a different int->str function that doesn't use decimal. It's asymptotically worse, "Karatsuba time" instead of quadratic time, so still a huge improvement. _pylong switches to that when the C decimal isn't available. It is also used for not too large integers (less than 450_000 bits), where it is faster (up to 2 times for 30_000 bits) than the asymptotically better implementation that uses the C decimal. (cherry picked from commit 711c80b) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Tim Peters <tim.peters@gmail.com>
…mize int to str conversion (pythonGH-118483) For converting large ints to strings, CPython invokes a function in _pylong.py, which uses the decimal module to implement an asymptotically waaaaay sub-quadratic algorithm. But if the C decimal module isn't available, CPython uses _pydecimal.py instead. Which in turn frequently does str(int). If the int is very large, _pylong ends up doing the work, which in turn asks decimal to do "big" arithmetic, which in turn calls str(big_int), which in turn ... it can become infinite mutual recursion. This change introduces a different int->str function that doesn't use decimal. It's asymptotically worse, "Karatsuba time" instead of quadratic time, so still a huge improvement. _pylong switches to that when the C decimal isn't available. It is also used for not too large integers (less than 450_000 bits), where it is faster (up to 2 times for 30_000 bits) than the asymptotically better implementation that uses the C decimal. Co-authored-by: Tim Peters <tim.peters@gmail.com>
The underlying issue here: for converting large ints to strings, CPython invokes a function in
_pylong.py
, which uses thedecimal
module to implement an asymptotically waaaaay sub-quadratic algorithm.But if the C
decimal
module isn't available, CPython uses_pydecimal.py
instead. Which in turn frequently doesstr(int)
. If the int is very large,_pylong
ends up doing the work, which in turn asksdecimal
to do "big" arithmetic, which in turn callsstr(big_int)
, which in turn ... it can become infinite mutual recursion.This PR introduces a different int->str function that doesn't use
decimal
. It's asymptotically worse, "Karatsuba time" instead of quadratic time, so still a huge improvement. The hope (well, my - Tim's - hope) is that_pylong
can switch to that when the Cdecimal
isn't available._decimal
module is missing #118164