-
Notifications
You must be signed in to change notification settings - Fork 396
Fix #1201: Correct regex support. #4455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a382b3c
to
5f24a09
Compare
Might be a stupid idea, but could one maybe transform the string to match in such a way that it can be more simply matched by a JS regex? |
If this has a non-negligible impact on resulting JS sizes, can we please also have a setting to opt-out of this new behaviour? |
I had thought about that to implement support for
I haven't looked at the code size impact yet. Given the size of the implementation in Scala (1.5 kLoC), I guess it will be 3-4 kB of fastOpt code. But no, we won't be able to have a switch to disable the behavior. This is no different from any other method of the JDK. If you don't want its code size, don't use it and use |
3b86df5
to
bfabfa9
Compare
bfabfa9
to
8dd831f
Compare
With the current state of this PR, where I have made some effort not to produce unreasonable code by accident, the size increases are of
Looks like I was quite off the mark for my wild estimation above. That said, I don't think it fundamentally changes the problem. As I've been working on this, I noticed how broken blindly using |
8dd831f
to
55909f6
Compare
55909f6
to
64b0614
Compare
I think we should chose safety before generating stuff that doesn't work. Basically:
(our usual approach here is auto-detect and polyfill; that is OK IMO as long as we actually polyfill, which here is probably not up for discussion). So if somebody ends up using a feature, they simply need to acknowledge in their build once that they are OK with that ES version.
While I agree with this reasoning, I think there is another point we need to consider: For all practical intents and purposes, bumping the codesize of a thing by that much is a backwards incompatible change (this is also where the argument is imprecise IMO: for most javalib stuff, we just add new things, so use it or leave it, existing things remain unchanged or get bugfixes that don't affect actual usage for the worse). With such a big bump, maintainers of codebases that heavily relied on regexes (and have worked "just fine" for ~7 years), would now be faced with the choice of either:
I do not think we should leave our users in this situation. I do not have a solution for this right now (that doesn't involve dependency hellscape for at least one side of the equation), so I'll keep thinking. |
64b0614
to
26ac524
Compare
Supposedly, a codebase such that adjusting to js.RegExp would be a significant cost is also a large enough codebase that 17 kB won't make a real difference, would it? I'm more worried about the actual correctness issues, in terms of breaking changes. It's not impossible that there are codebases that actually rely on |
Hmmm... That's a good point. Although I'm not sure that's the case. So for example, a case I was concerned about is usage of // from the doc of scala.util.matching.Regex
val date = raw"(\d{4})-(\d{2})-(\d{2})".r
"2004-01-20" match {
case date(year, month, day) => s"$year was a good year for PLs."
} Will have to be transformed into: val date = new JSExtractor(new js.RegExp("(\d{4})-(\d{2})-(\d{2})"))
"2004-01-20" match {
case date(year, month, day) => s"$year was a good year for PLs."
}
class JSExtractor(...) {
def unapply(...) = ...
} But I have no clue how to quantify cost of transformation with library size :-/
That is true. OTOH though, as the referenced bugs on this PR prove, it has been considered a bug by many and so far, no fix towards the Java behavior has been filed as a bug / regression. |
26ac524
to
fb42a4b
Compare
I pushed new commits that implement that strategy. As a bonus, this brings the fullOpt increase down from 17 KB to 15 kB in the default configuration (producing ES 2015 code). That's because the code dealing with |
fb42a4b
to
19fa403
Compare
Anything that has new configuration will require a minor version bump. Since we don't have anything else in flight that can go in a patch release, I suggest we release 1.5.1 now. @gzm0 If you agree, I prepared a version commit here: sjrd@3746189 |
d42bab9
to
2a9f870
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I think I have addressed all the comments so far, except in the readme. I need to do a second pass for the readme.
I'm pretty sure that, besides rebases, the only thing that has changed over the last month or so is the following diff: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Now I think I have addressed all the comments in the readme as well.
b701816
to
86cd7e9
Compare
f2604f5
to
2d6b118
Compare
@gzm0 I see that you've marked a lot of comments as Resolved, but haven't submitted a new review. Is it just that you accidentally left a Pending Review without submitting it, or you truly haven't finished it yet? |
I'm not done yet. Unfortunately resolved state is not part of the review batch (but I absolutely need it, otherwise I have no clue what I still need to look at :P). |
Ah ah, yes, that makes sense :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second round except for the tests. I'll start on them now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
That |
03f7b46
to
0221e33
Compare
It turns out the issues were related to the version of Scala being used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finish line :) Only minor stuff on the tests. Everything else LGTM.
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Show resolved
Hide resolved
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Outdated
Show resolved
Hide resolved
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Outdated
Show resolved
Hide resolved
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Outdated
Show resolved
Hide resolved
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Outdated
Show resolved
Hide resolved
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Outdated
Show resolved
Hide resolved
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Outdated
Show resolved
Hide resolved
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Outdated
Show resolved
Hide resolved
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Outdated
Show resolved
Hide resolved
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Adding more tests for comments, following your review, made me realize that comments are also supposed to work in character classes. So there is one more comment that implements that.
test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala
Show resolved
Hide resolved
Instead, use the default one from the sbt build. This version number has been out of date several times in the past, because we always forget to update it. It is easier not to have to update it at all.
This avoids repeating them in several places the build, potentially creating mismatches.
For the upcoming changes to `java.util.regex.*`.
Previously, `java.util.regex.Pattern` was implemented without much concern for correctness wrt. the Java semantics of regular expressions. Patterns were passed through to the native `RegExp` with minimal preprocessing. This could cause several kinds of incompatibilities: - Throwing `ParseError`s for features not supported by JS regexes, - Or worse, silently compile with different semantics. In this commit, we correctly implement virtually all the features of Java regular expressions by compiling Java patterns down to JS patterns with the same semantics. This change introduces a significant code size regression for code bases that were already using `Pattern` and/or Scala's `Regex`. Therefore, we went to great lengths to optimize the compiler for code size, in particular in the default ES 2015 configuration. This means that some code is not as idiomatic as it could be. The impact of this commit on a typical output is approximately 65 KB for fastOpt and 12 KB for fullOpt. The `README.md` file contains an extensive explanation of the design of the compiler, and of how it compiles each kind of Java pattern. In addition to fixing the mega-issue scala-js#1201, this commit fixes the smaller issues scala-js#105, scala-js#1677, scala-js#1847, scala-js#2082 and scala-js#3959, which had been closed as duplicates of scala-js#1201.
0a436a6
to
ed71c83
Compare
Thanks! 🎉 |
This PR correctly implements
java.util.regex
by compiling Java regexes down to JavaScript RegExp's that do the same thing.Until now, we basically passed regexes through to JavaScript's
RegExp
, not caring at all whether the behavior was going to be as specified or not. This could result in:SyntaxError
s, for features not supported by JSIn this commit, we compile everything we can compile, and throw
PatternSyntaxException
s for things that we cannot faithfully compile.Some features require the underlying JS
RegExp
to support ES 2018:MULTILINE
andUNICODE_CHARACTER_CLASS
flags\p{}
classes for Unicode propertiesIn addition,
UNICODE_CASE
requires support for theu
flag, which is ES 2015.CANON_EQ
is never supported.Everything else is supported in all cases. Even correct handling of surrogate pairs is supported in ES 5.1, which does not support the
u
flag.My main open question is: what do we do about the features that require ES 2018 support? As currently implemented, they will throw
PatternSyntaxException
s at run-time if the JS engine does not support the required features. This could result in surprises if the developers test using a recent Node.js version, but then deploy in environments that are not guaranteed to support ES 2018. This can even happen in unexpected situations, for example theMULTILINE
flag requires look-behind assertions, even though JS'RegExp
has a support for anm
flag even in ES 5.1. The problem is that JS' notion ofm
is not equivalent to Java's notion ofMULTILINE
, so we can't delegate to it.Of course, run-time errors depending on newer ES features can already happen, but never through usage of JDK APIs. It can happen through direct usages of JavaScript APIs, even some declared in our standard library, but they are marked as such in the Scaladoc.
I don't think we have precedent for JDK features that are conditional on ES features. How should we address that? Possible ideas are:
PatternSyntaxException
s for features that would need the ES 2018 features.@gzm0 WDYT?
To-do
[...]
character classes