Skip to content

Opt: Branchless addition, subtraction and negation for RuntimeLong. #5184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 7, 2025

Conversation

sjrd
Copy link
Member

@sjrd sjrd commented May 29, 2025

Hacker's Delight offers branchless formulas for double-word additions and subtraction, without access to the machine's carry bit.

Although the new formula contains more elementary instructions, removal of the branch is significant. When one operand is constant, folding reduces to 3 bitwise operations to compute the carry, which is very fast. When both operands are variable, then the carry is often 50/50 unpredictable, which means the branch is unpredictable, and removing it is worth the full 5 bitwise operations anyway.

Negation does not need a special code path anymore. Regular folding of the 0L - b formula yields optimal, branchless code anyway. We remove RuntimeLong.neg and its code paths in the optimizer and emitter. While we're there, we also remove RuntimeLong.not, since the regular code paths for -1L ^ b fold in the same way.

This change reduces execution time of the sha512 benchmark by a whopping 25-30%.

@sjrd sjrd requested a review from gzm0 May 29, 2025 13:25
Copy link
Contributor

@tanishiking tanishiking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side, nice bit hacking optimizations 😃
For double-length add/sub, double checked hacker's delight, and it looks identical to the book. (it's fun to see the algorithm is implemented in real world).

else
PreTransBinaryOp(Int_>>, lhs, PreTransLit(IntLiteral(dist)))
} else {
val lhs2 = simplifyOnlyInterestedInMask(lhs, (-1) << dist)
Copy link
Contributor

@tanishiking tanishiking May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note]
The bit pattern of (-1) << dist: upper 32 - dist bits are 1, and rest are 0.
That masks bit patterns in lhs (upper bits) that still affect the result after the shift operation.

* Likewise, if its msb is 1, we can replace `alo` by -1. That also allows
* to fold the leftmost `&` and the innermost `|` (in different ways).
*
* The simplification performed in this method is capable of performing that
Copy link
Contributor

@tanishiking tanishiking May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note]
Ah, I see. I understand that it's possible to mask the necessary bit pattern and replace the other parts with arbitrary values. and I was wondering what the point of the transformation was.

Now I get it, if constants can be replaced with 0 or -1, then further optimization can possibly be performed in subexpressions. nice :)

@sjrd sjrd force-pushed the rt-long-branchless-add-sub-neg branch from ecb38a9 to c1cfbac Compare May 30, 2025 15:27
Copy link
Contributor

@gzm0 gzm0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Only minor comments.

@@ -4187,6 +4179,11 @@ private[optimizer] abstract class OptimizerCore(
PreTransBinaryOp(Int_-, PreTransLit(IntLiteral(y)), z)) =>
foldBinaryOp(Int_+, PreTransLit(IntLiteral(x - y)), z)

// x - (y >> 31) --> x + (y >>> 31) and x - (y >>> 31) --> x + (y >> 31)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct, but it is unclear to me why it is useful to apply this rewrite. Could you add an explanation?

Copy link
Member Author

@sjrd sjrd Jun 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition generally folds better than subtraction. Here, this is particularly motivated by the case where x == 0, in which we end up removing a negation.

This comes up in RuntimeLong.sub. But now that I look at it again, we could also directly change the source of RuntimeLong.sub to use + and >> 31 rather than - and >>> 31. Doing that change and removing this rewrite rule changes nothing to the output. That alters the algorithm found in Hacker's Delight, though. I guess they were not concerned about the case where the hi parts fold away to 0, so the regularity wrt. addition was more valuable.

WDYT? (the last commit is there to show that alternative)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should include the last commit: If we can adjust our library code to remove a special case in the optimizer just for that, I think that is better overall (less overall complexity).

If you feel that this case might appear elsewhere in the wild, we can keep it in the optimizer of course. (just add a comment that + is better than - :P).

@sjrd sjrd force-pushed the rt-long-branchless-add-sub-neg branch 3 times, most recently from 147c816 to 1926666 Compare June 3, 2025 07:52
@sjrd sjrd requested a review from gzm0 June 3, 2025 08:57
Copy link
Contributor

@gzm0 gzm0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after squashing.

@@ -4187,6 +4179,11 @@ private[optimizer] abstract class OptimizerCore(
PreTransBinaryOp(Int_-, PreTransLit(IntLiteral(y)), z)) =>
foldBinaryOp(Int_+, PreTransLit(IntLiteral(x - y)), z)

// x - (y >> 31) --> x + (y >>> 31) and x - (y >>> 31) --> x + (y >> 31)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should include the last commit: If we can adjust our library code to remove a special case in the optimizer just for that, I think that is better overall (less overall complexity).

If you feel that this case might appear elsewhere in the wild, we can keep it in the optimizer of course. (just add a comment that + is better than - :P).

Hacker's Delight offers branchless formulas for double-word
additions and subtraction, without access to the machine's carry
bit.

Although the new formula contains more elementary instructions,
removal of the branch is significant. When one operand is
constant, folding reduces to 3 bitwise operations to compute the
carry, which is very fast. When both operands are variable, then
the carry is often 50/50 unpredictable, which means the branch
is unpredictable, and removing it is worth the full 5 bitwise
operations anyway.

Negation does not need a special code path anymore. Regular
folding of the `0L - b` formula yields optimal, branchless code
anyway. We remove `RuntimeLong.neg` and its code paths in the
optimizer and emitter. While we're there, we also remove
`RuntimeLong.not`, since the regular code paths for `-1L ^ b`
fold in the same way.

This change reduces execution time of the `sha512` benchmark by a
whopping 25-30%.
@sjrd sjrd force-pushed the rt-long-branchless-add-sub-neg branch from 1926666 to 83cdafc Compare June 7, 2025 07:42
@sjrd sjrd enabled auto-merge June 7, 2025 07:44
@sjrd sjrd merged commit 052d861 into scala-js:main Jun 7, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants