-
Notifications
You must be signed in to change notification settings - Fork 396
Opt: Branchless addition, subtraction and negation for RuntimeLong
.
#5184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my side, nice bit hacking optimizations 😃
For double-length add/sub, double checked hacker's delight, and it looks identical to the book. (it's fun to see the algorithm is implemented in real world).
linker-private-library/src/main/scala/org/scalajs/linker/runtime/RuntimeLong.scala
Outdated
Show resolved
Hide resolved
else | ||
PreTransBinaryOp(Int_>>, lhs, PreTransLit(IntLiteral(dist))) | ||
} else { | ||
val lhs2 = simplifyOnlyInterestedInMask(lhs, (-1) << dist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[note]
The bit pattern of (-1) << dist
: upper 32 - dist
bits are 1, and rest are 0.
That masks bit patterns in lhs
(upper bits) that still affect the result after the shift operation.
* Likewise, if its msb is 1, we can replace `alo` by -1. That also allows | ||
* to fold the leftmost `&` and the innermost `|` (in different ways). | ||
* | ||
* The simplification performed in this method is capable of performing that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[note]
Ah, I see. I understand that it's possible to mask the necessary bit pattern and replace the other parts with arbitrary values. and I was wondering what the point of the transformation was.
Now I get it, if constants can be replaced with 0 or -1, then further optimization can possibly be performed in subexpressions. nice :)
ecb38a9
to
c1cfbac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Only minor comments.
linker-private-library/src/main/scala/org/scalajs/linker/runtime/RuntimeLong.scala
Show resolved
Hide resolved
linker-private-library/src/main/scala/org/scalajs/linker/runtime/RuntimeLong.scala
Outdated
Show resolved
Hide resolved
linker/shared/src/main/scala/org/scalajs/linker/backend/emitter/FunctionEmitter.scala
Show resolved
Hide resolved
linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala
Outdated
Show resolved
Hide resolved
linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala
Show resolved
Hide resolved
@@ -4187,6 +4179,11 @@ private[optimizer] abstract class OptimizerCore( | |||
PreTransBinaryOp(Int_-, PreTransLit(IntLiteral(y)), z)) => | |||
foldBinaryOp(Int_+, PreTransLit(IntLiteral(x - y)), z) | |||
|
|||
// x - (y >> 31) --> x + (y >>> 31) and x - (y >>> 31) --> x + (y >> 31) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct, but it is unclear to me why it is useful to apply this rewrite. Could you add an explanation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addition generally folds better than subtraction. Here, this is particularly motivated by the case where x == 0
, in which we end up removing a negation.
This comes up in RuntimeLong.sub
. But now that I look at it again, we could also directly change the source of RuntimeLong.sub
to use +
and >> 31
rather than -
and >>> 31
. Doing that change and removing this rewrite rule changes nothing to the output. That alters the algorithm found in Hacker's Delight, though. I guess they were not concerned about the case where the hi parts fold away to 0, so the regularity wrt. addition was more valuable.
WDYT? (the last commit is there to show that alternative)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should include the last commit: If we can adjust our library code to remove a special case in the optimizer just for that, I think that is better overall (less overall complexity).
If you feel that this case might appear elsewhere in the wild, we can keep it in the optimizer of course. (just add a comment that + is better than - :P).
linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala
Outdated
Show resolved
Hide resolved
147c816
to
1926666
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after squashing.
@@ -4187,6 +4179,11 @@ private[optimizer] abstract class OptimizerCore( | |||
PreTransBinaryOp(Int_-, PreTransLit(IntLiteral(y)), z)) => | |||
foldBinaryOp(Int_+, PreTransLit(IntLiteral(x - y)), z) | |||
|
|||
// x - (y >> 31) --> x + (y >>> 31) and x - (y >>> 31) --> x + (y >> 31) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should include the last commit: If we can adjust our library code to remove a special case in the optimizer just for that, I think that is better overall (less overall complexity).
If you feel that this case might appear elsewhere in the wild, we can keep it in the optimizer of course. (just add a comment that + is better than - :P).
linker/shared/src/main/scala/org/scalajs/linker/frontend/optimizer/OptimizerCore.scala
Show resolved
Hide resolved
Hacker's Delight offers branchless formulas for double-word additions and subtraction, without access to the machine's carry bit. Although the new formula contains more elementary instructions, removal of the branch is significant. When one operand is constant, folding reduces to 3 bitwise operations to compute the carry, which is very fast. When both operands are variable, then the carry is often 50/50 unpredictable, which means the branch is unpredictable, and removing it is worth the full 5 bitwise operations anyway. Negation does not need a special code path anymore. Regular folding of the `0L - b` formula yields optimal, branchless code anyway. We remove `RuntimeLong.neg` and its code paths in the optimizer and emitter. While we're there, we also remove `RuntimeLong.not`, since the regular code paths for `-1L ^ b` fold in the same way. This change reduces execution time of the `sha512` benchmark by a whopping 25-30%.
1926666
to
83cdafc
Compare
Hacker's Delight offers branchless formulas for double-word additions and subtraction, without access to the machine's carry bit.
Although the new formula contains more elementary instructions, removal of the branch is significant. When one operand is constant, folding reduces to 3 bitwise operations to compute the carry, which is very fast. When both operands are variable, then the carry is often 50/50 unpredictable, which means the branch is unpredictable, and removing it is worth the full 5 bitwise operations anyway.
Negation does not need a special code path anymore. Regular folding of the
0L - b
formula yields optimal, branchless code anyway. We removeRuntimeLong.neg
and its code paths in the optimizer and emitter. While we're there, we also removeRuntimeLong.not
, since the regular code paths for-1L ^ b
fold in the same way.This change reduces execution time of the
sha512
benchmark by a whopping 25-30%.