Opt: Branchless addition, subtraction and negation for `RuntimeLong`. #5184

sjrd · 2025-05-29T13:25:10Z

Hacker's Delight offers branchless formulas for double-word additions and subtraction, without access to the machine's carry bit.

Although the new formula contains more elementary instructions, removal of the branch is significant. When one operand is constant, folding reduces to 3 bitwise operations to compute the carry, which is very fast. When both operands are variable, then the carry is often 50/50 unpredictable, which means the branch is unpredictable, and removing it is worth the full 5 bitwise operations anyway.

Negation does not need a special code path anymore. Regular folding of the 0L - b formula yields optimal, branchless code anyway. We remove RuntimeLong.neg and its code paths in the optimizer and emitter. While we're there, we also remove RuntimeLong.not, since the regular code paths for -1L ^ b fold in the same way.

This change reduces execution time of the sha512 benchmark by a whopping 25-30%.

tanishiking

Looks good from my side, nice bit hacking optimizations 😃
For double-length add/sub, double checked hacker's delight, and it looks identical to the book. (it's fun to see the algorithm is implemented in real world).

linker-private-library/src/main/scala/org/scalajs/linker/runtime/RuntimeLong.scala

tanishiking · 2025-05-29T16:08:12Z

linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala

-            else
-              PreTransBinaryOp(Int_>>, lhs, PreTransLit(IntLiteral(dist)))
+            } else {
+              val lhs2 = simplifyOnlyInterestedInMask(lhs, (-1) << dist)


[note]
The bit pattern of (-1) << dist: upper 32 - dist bits are 1, and rest are 0.
That masks bit patterns in lhs (upper bits) that still affect the result after the shift operation.

tanishiking · 2025-05-29T16:19:43Z

linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala

+   *  Likewise, if its msb is 1, we can replace `alo` by -1. That also allows
+   *  to fold the leftmost `&` and the innermost `|` (in different ways).
+   *
+   *  The simplification performed in this method is capable of performing that


[note]
Ah, I see. I understand that it's possible to mask the necessary bit pattern and replace the other parts with arbitrary values. and I was wondering what the point of the transformation was.

Now I get it, if constants can be replaced with 0 or -1, then further optimization can possibly be performed in subexpressions. nice :)

gzm0

Nice! Only minor comments.

linker-private-library/src/main/scala/org/scalajs/linker/runtime/RuntimeLong.scala

linker/shared/src/main/scala/org/scalajs/linker/backend/emitter/FunctionEmitter.scala

linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala

gzm0 · 2025-06-01T06:24:52Z

linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala

@@ -4187,6 +4179,11 @@ private[optimizer] abstract class OptimizerCore(
              PreTransBinaryOp(Int_-, PreTransLit(IntLiteral(y)), z)) =>
            foldBinaryOp(Int_+, PreTransLit(IntLiteral(x - y)), z)

+          // x - (y >> 31) -->  x + (y >>> 31)  and  x - (y >>> 31) --> x + (y >> 31)


This looks correct, but it is unclear to me why it is useful to apply this rewrite. Could you add an explanation?

Addition generally folds better than subtraction. Here, this is particularly motivated by the case where x == 0, in which we end up removing a negation.

This comes up in RuntimeLong.sub. But now that I look at it again, we could also directly change the source of RuntimeLong.sub to use + and >> 31 rather than - and >>> 31. Doing that change and removing this rewrite rule changes nothing to the output. That alters the algorithm found in Hacker's Delight, though. I guess they were not concerned about the case where the hi parts fold away to 0, so the regularity wrt. addition was more valuable.

WDYT? (the last commit is there to show that alternative)

I think we should include the last commit: If we can adjust our library code to remove a special case in the optimizer just for that, I think that is better overall (less overall complexity).

If you feel that this case might appear elsewhere in the wild, we can keep it in the optimizer of course. (just add a comment that + is better than - :P).

linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala

gzm0

LGTM after squashing.

gzm0 · 2025-06-01T10:49:28Z

linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala

@@ -4187,6 +4179,11 @@ private[optimizer] abstract class OptimizerCore(
              PreTransBinaryOp(Int_-, PreTransLit(IntLiteral(y)), z)) =>
            foldBinaryOp(Int_+, PreTransLit(IntLiteral(x - y)), z)

+          // x - (y >> 31) -->  x + (y >>> 31)  and  x - (y >>> 31) --> x + (y >> 31)


I think we should include the last commit: If we can adjust our library code to remove a special case in the optimizer just for that, I think that is better overall (less overall complexity).

If you feel that this case might appear elsewhere in the wild, we can keep it in the optimizer of course. (just add a comment that + is better than - :P).

linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala

Hacker's Delight offers branchless formulas for double-word additions and subtraction, without access to the machine's carry bit. Although the new formula contains more elementary instructions, removal of the branch is significant. When one operand is constant, folding reduces to 3 bitwise operations to compute the carry, which is very fast. When both operands are variable, then the carry is often 50/50 unpredictable, which means the branch is unpredictable, and removing it is worth the full 5 bitwise operations anyway. Negation does not need a special code path anymore. Regular folding of the `0L - b` formula yields optimal, branchless code anyway. We remove `RuntimeLong.neg` and its code paths in the optimizer and emitter. While we're there, we also remove `RuntimeLong.not`, since the regular code paths for `-1L ^ b` fold in the same way. This change reduces execution time of the `sha512` benchmark by a whopping 25-30%.

sjrd requested a review from gzm0 May 29, 2025 13:25

tanishiking approved these changes May 29, 2025

View reviewed changes

sjrd force-pushed the rt-long-branchless-add-sub-neg branch from ecb38a9 to c1cfbac Compare May 30, 2025 15:27

gzm0 requested changes Jun 1, 2025

View reviewed changes

sjrd force-pushed the rt-long-branchless-add-sub-neg branch 3 times, most recently from 147c816 to 1926666 Compare June 3, 2025 07:52

sjrd requested a review from gzm0 June 3, 2025 08:57

gzm0 approved these changes Jun 7, 2025

View reviewed changes

sjrd force-pushed the rt-long-branchless-add-sub-neg branch from 1926666 to 83cdafc Compare June 7, 2025 07:42

sjrd enabled auto-merge June 7, 2025 07:44

sjrd merged commit 052d861 into scala-js:main Jun 7, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Opt: Branchless addition, subtraction and negation for `RuntimeLong`. #5184

Opt: Branchless addition, subtraction and negation for `RuntimeLong`. #5184

Uh oh!

sjrd commented May 29, 2025

Uh oh!

tanishiking left a comment

Uh oh!

Uh oh!

tanishiking May 29, 2025 •

edited

Loading

Uh oh!

tanishiking May 29, 2025 •

edited

Loading

Uh oh!

gzm0 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gzm0 Jun 1, 2025

Uh oh!

sjrd Jun 1, 2025 •

edited

Loading

Uh oh!

gzm0 Jun 1, 2025

Uh oh!

Uh oh!

gzm0 left a comment

Uh oh!

gzm0 Jun 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Opt: Branchless addition, subtraction and negation for RuntimeLong. #5184

Opt: Branchless addition, subtraction and negation for RuntimeLong. #5184

Uh oh!

Conversation

sjrd commented May 29, 2025

Uh oh!

tanishiking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tanishiking May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanishiking May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gzm0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gzm0 Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

sjrd Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gzm0 Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gzm0 left a comment

Choose a reason for hiding this comment

Uh oh!

gzm0 Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Opt: Branchless addition, subtraction and negation for `RuntimeLong`. #5184

Opt: Branchless addition, subtraction and negation for `RuntimeLong`. #5184

tanishiking May 29, 2025 •

edited

Loading

tanishiking May 29, 2025 •

edited

Loading

sjrd Jun 1, 2025 •

edited

Loading