Fix #1201: Correct regex support. #4455

sjrd · 2021-03-26T18:31:06Z

This PR correctly implements java.util.regex by compiling Java regexes down to JavaScript RegExp's that do the same thing.

Until now, we basically passed regexes through to JavaScript's RegExp, not caring at all whether the behavior was going to be as specified or not. This could result in:

SyntaxErrors, for features not supported by JS
or worse, regexes that successfully compile, but behave differently than their spec

In this commit, we compile everything we can compile, and throw PatternSyntaxExceptions for things that we cannot faithfully compile.

Some features require the underlying JS RegExp to support ES 2018:

the MULTILINE and UNICODE_CHARACTER_CLASS flags
look-behind assertions
\p{} classes for Unicode properties

In addition, UNICODE_CASE requires support for the u flag, which is ES 2015. CANON_EQ is never supported.

Everything else is supported in all cases. Even correct handling of surrogate pairs is supported in ES 5.1, which does not support the u flag.

My main open question is: what do we do about the features that require ES 2018 support? As currently implemented, they will throw PatternSyntaxExceptions at run-time if the JS engine does not support the required features. This could result in surprises if the developers test using a recent Node.js version, but then deploy in environments that are not guaranteed to support ES 2018. This can even happen in unexpected situations, for example the MULTILINE flag requires look-behind assertions, even though JS' RegExp has a support for an m flag even in ES 5.1. The problem is that JS' notion of m is not equivalent to Java's notion of MULTILINE, so we can't delegate to it.

Of course, run-time errors depending on newer ES features can already happen, but never through usage of JDK APIs. It can happen through direct usages of JavaScript APIs, even some declared in our standard library, but they are marked as such in the Scaladoc.

I don't think we have precedent for JDK features that are conditional on ES features. How should we address that? Possible ideas are:

Do nothing, and accept that run-time crashes may happen; developers should test under the environments that they care about.
Provide an opt-in link-time config to enable the ES 2018-dependent features; if it is not set, we would unconditionally throw PatternSyntaxExceptions for features that would need the ES 2018 features.
Not provide those features at all (I don't think this is a good idea)

@gzm0 WDYT?

To-do

Flesh out the open question above
More tests (what I have so far is quite decent, but there really are a lot of possible combinations that need testing)
Clean up, especially the compilation of [...] character classes
Better document the implementation

nightscape · 2021-03-26T21:17:53Z

Might be a stupid idea, but could one maybe transform the string to match in such a way that it can be more simply matched by a JS regex?

japgolly · 2021-03-26T21:26:06Z

If this has a non-negligible impact on resulting JS sizes, can we please also have a setting to opt-out of this new behaviour?

sjrd · 2021-03-26T22:40:51Z

Might be a stupid idea, but could one maybe transform the string to match in such a way that it can be more simply matched by a JS regex?

I had thought about that to implement support for CANON_EQ, but it doesn't work, because it also changes the captured substrings, and that is not correct behavior. So no, we cannot do that. Anyway, the only other feature for which it might help would be ASCII case insensitive, but that's not really hard to implement by transforming the pattern instead, so we don't gain anything.

If this has a non-negligible impact on resulting JS sizes, can we please also have a setting to opt-out of this new behaviour?

I haven't looked at the code size impact yet. Given the size of the implementation in Scala (1.5 kLoC), I guess it will be 3-4 kB of fastOpt code. But no, we won't be able to have a switch to disable the behavior. This is no different from any other method of the JDK. If you don't want its code size, don't use it and use js.RegExp instead. Just like if you don't want the Unicode database embedded in your program, don't use c.isDigit but c >= '0' && c <= '9'.

sjrd · 2021-03-28T11:25:06Z

With the current state of this PR, where I have made some effort not to produce unreasonable code by accident, the size increases are of

about 100 kB for fastOpt
about 17 kB for fullOpt

Looks like I was quite off the mark for my wild estimation above. That said, I don't think it fundamentally changes the problem. As I've been working on this, I noticed how broken blindly using js.RegExp was; virtually nothing behaves the same except the most simple combinators. Even simple things like \s or . are incorrect.

gzm0 · 2021-03-29T17:38:23Z

I don't think we have precedent for JDK features that are conditional on ES features. How should we address that? Possible ideas are:

I think we should chose safety before generating stuff that doesn't work. Basically:

Introduce ESFeatures.useECMAScript20{16-21} and LinkingInfo.assumingES20{16-21} (maybe parametrize?)
Use these to link-time DCE the JDK autodetect.
Throw when we cannot translate a construct, have a nice error message if an ES feature upgrade could allow the translation.

(our usual approach here is auto-detect and polyfill; that is OK IMO as long as we actually polyfill, which here is probably not up for discussion).

So if somebody ends up using a feature, they simply need to acknowledge in their build once that they are OK with that ES version.

Looks like I was quite off the mark for my wild estimation above. That said, I don't think it fundamentally changes the problem.

While I agree with this reasoning, I think there is another point we need to consider: For all practical intents and purposes, bumping the codesize of a thing by that much is a backwards incompatible change (this is also where the argument is imprecise IMO: for most javalib stuff, we just add new things, so use it or leave it, existing things remain unchanged or get bugfixes that don't affect actual usage for the worse).

With such a big bump, maintainers of codebases that heavily relied on regexes (and have worked "just fine" for ~7 years), would now be faced with the choice of either:

Stalling Scala.js upgrades (nobody wants that, not us, not our users).
Investing a considerable amount of effort to migrate off Java regexes (possible, but might involve a lot of new wrapper classes etc. if code is cross compiling).

I do not think we should leave our users in this situation.

I do not have a solution for this right now (that doesn't involve dependency hellscape for at least one side of the equation), so I'll keep thinking.

sjrd · 2021-03-29T18:18:24Z

With such a big bump, maintainers of codebases that heavily relied on regexes (and have worked "just fine" for ~7 years), would now be faced with the choice of either:

Supposedly, a codebase such that adjusting to js.RegExp would be a significant cost is also a large enough codebase that 17 kB won't make a real difference, would it?

I'm more worried about the actual correctness issues, in terms of breaking changes. It's not impossible that there are codebases that actually rely on Pattern behaving like JavaScript, and not as documented in the JavaDoc. Unlike our other fixes to JDK stuff, this one has been documented to behave as it does for 7 years. 😕

gzm0 · 2021-03-29T19:25:42Z

Supposedly, a codebase such that adjusting to js.RegExp would be a significant cost is also a large enough codebase that 17 kB won't make a real difference, would it?

Hmmm... That's a good point. Although I'm not sure that's the case. So for example, a case I was concerned about is usage of Pattern in some larger data structures / across library interfaces / underpinning of a library. For example, the scalalib itself is actually an example here:

// from the doc of scala.util.matching.Regex
val date = raw"(\d{4})-(\d{2})-(\d{2})".r

"2004-01-20" match {
  case date(year, month, day) => s"$year was a good year for PLs."
}

Will have to be transformed into:

val date = new JSExtractor(new js.RegExp("(\d{4})-(\d{2})-(\d{2})"))

"2004-01-20" match {
  case date(year, month, day) => s"$year was a good year for PLs."
}

class JSExtractor(...) {
  def unapply(...) = ...
}

But I have no clue how to quantify cost of transformation with library size :-/

Unlike our other fixes to JDK stuff, this one has been documented to behave as it does for 7 years

That is true. OTOH though, as the referenced bugs on this PR prove, it has been considered a bug by many and so far, no fix towards the Java behavior has been filed as a bug / regression.

sjrd · 2021-03-30T14:18:41Z

I think we should chose safety before generating stuff that doesn't work. Basically:

Introduce ESFeatures.useECMAScript20{16-21} and LinkingInfo.assumingES20{16-21} (maybe parametrize?)

Use these to link-time DCE the JDK autodetect.

Throw when we cannot translate a construct, have a nice error message if an ES feature upgrade could allow the translation.

(our usual approach here is auto-detect and polyfill; that is OK IMO as long as we actually polyfill, which here is probably not up for discussion).

So if somebody ends up using a feature, they simply need to acknowledge in their build once that they are OK with that ES version.

I pushed new commits that implement that strategy.

As a bonus, this brings the fullOpt increase down from 17 KB to 15 kB in the default configuration (producing ES 2015 code). That's because the code dealing with \p{} character classes, including the predefinedCharacterClass map, can be dead-code-eliminated. (which is actually quite remarkable)

sjrd · 2021-03-31T07:49:02Z

Anything that has new configuration will require a minor version bump. Since we don't have anything else in flight that can go in a patch release, I suggest we release 1.5.1 now.

@gzm0 If you agree, I prepared a version commit here: sjrd@3746189

sjrd

Updated. I think I have addressed all the comments so far, except in the readme. I need to do a second pass for the readme.

javalib/src/main/scala/java/util/regex/Pattern.scala

project/Build.scala

scala-test-suite/src/test/resources/2.12.11/BlacklistedTests.txt

scala-test-suite/src/test/resources/2.13.5/BlacklistedTests.txt

javalib/src/main/scala/java/util/regex/PatternCompiler.scala

sjrd · 2021-06-15T10:24:52Z

@sjrd if you actually changed something in these, please LMK.

I'm pretty sure that, besides rebases, the only thing that has changed over the last month or so is the following diff:
https://github.com/scala-js/scala-js/compare/24ce7cbb67416f1a04f0d39622002b8b9ea23272..b1aa90d795231f4caa446666947884c4b90cd189

sjrd

Updated. Now I think I have addressed all the comments in the readme as well.

javalib/src/main/scala/java/util/regex/README.md

sjrd · 2021-06-28T14:01:31Z

@gzm0 I see that you've marked a lot of comments as Resolved, but haven't submitted a new review. Is it just that you accidentally left a Pending Review without submitting it, or you truly haven't finished it yet?

gzm0 · 2021-06-28T14:03:41Z

I'm not done yet. Unfortunately resolved state is not part of the review batch (but I absolutely need it, otherwise I have no clue what I still need to look at :P).

sjrd · 2021-06-28T14:11:27Z

Ah ah, yes, that makes sense :)

gzm0

Second round except for the tests. I'll start on them now :)

javalib/src/main/scala/java/util/regex/PatternCompiler.scala

javalib/src/main/scala/java/util/regex/README.md

project/Build.scala

javalib/src/main/scala/java/util/regex/README.md

javalib/src/main/scala/java/util/regex/PatternCompiler.scala

sjrd

Updated.

javalib/src/main/scala/java/util/regex/README.md

javalib/src/main/scala/java/util/regex/PatternCompiler.scala

sjrd · 2021-07-05T13:51:59Z

That LibrarySizeTest seems to be very sensitive to the environment. The results on Windows are significantly different than on Linux. I'm not sure why yet. :-s

sjrd · 2021-07-06T08:46:20Z

It turns out the issues were related to the version of Scala being used.
Note the first two commits in the PR, which are new (and if you want I can submit them as a separate, preliminary PR).

gzm0

Finish line :) Only minor stuff on the tests. Everything else LGTM.

test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala

sjrd

Updated.

Adding more tests for comments, following your review, made me realize that comments are also supposed to work in character classes. So there is one more comment that implements that.

test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala

Instead, use the default one from the sbt build. This version number has been out of date several times in the past, because we always forget to update it. It is easier not to have to update it at all.

This avoids repeating them in several places the build, potentially creating mismatches.

For the upcoming changes to `java.util.regex.*`.

Previously, `java.util.regex.Pattern` was implemented without much concern for correctness wrt. the Java semantics of regular expressions. Patterns were passed through to the native `RegExp` with minimal preprocessing. This could cause several kinds of incompatibilities: - Throwing `ParseError`s for features not supported by JS regexes, - Or worse, silently compile with different semantics. In this commit, we correctly implement virtually all the features of Java regular expressions by compiling Java patterns down to JS patterns with the same semantics. This change introduces a significant code size regression for code bases that were already using `Pattern` and/or Scala's `Regex`. Therefore, we went to great lengths to optimize the compiler for code size, in particular in the default ES 2015 configuration. This means that some code is not as idiomatic as it could be. The impact of this commit on a typical output is approximately 65 KB for fastOpt and 12 KB for fullOpt. The `README.md` file contains an extensive explanation of the design of the compiler, and of how it compiles each kind of Java pattern. In addition to fixing the mega-issue scala-js#1201, this commit fixes the smaller issues scala-js#105, scala-js#1677, scala-js#1847, scala-js#2082 and scala-js#3959, which had been closed as duplicates of scala-js#1201.

sjrd · 2021-07-07T21:58:33Z

Thanks! 🎉
Rebased and squashed.

sjrd mentioned this pull request Mar 26, 2021

Regex differences #1201

Closed

sjrd force-pushed the almost-correct-regex branch from a382b3c to 5f24a09 Compare March 26, 2021 18:43

sjrd force-pushed the almost-correct-regex branch 2 times, most recently from 3b86df5 to bfabfa9 Compare March 27, 2021 17:41

sjrd linked an issue Mar 28, 2021 that may be closed by this pull request

String.split(x: Array[Char]) produces bad regexes #105

Closed

sjrd mentioned this pull request Mar 28, 2021

String.split(x: Array[Char]) produces bad regexes #105

Closed

sjrd force-pushed the almost-correct-regex branch from bfabfa9 to 8dd831f Compare March 28, 2021 11:16

sjrd force-pushed the almost-correct-regex branch from 8dd831f to 55909f6 Compare March 28, 2021 15:43

sjrd removed a link to an issue Mar 29, 2021

Pattern matcher groupCount throws #1727

Closed

sjrd force-pushed the almost-correct-regex branch from 55909f6 to 64b0614 Compare March 29, 2021 16:17

sjrd force-pushed the almost-correct-regex branch from 64b0614 to 26ac524 Compare March 29, 2021 17:59

sjrd force-pushed the almost-correct-regex branch from 26ac524 to fb42a4b Compare March 30, 2021 14:16

sjrd force-pushed the almost-correct-regex branch from fb42a4b to 19fa403 Compare March 30, 2021 17:17

sjrd force-pushed the almost-correct-regex branch 2 times, most recently from d42bab9 to 2a9f870 Compare March 31, 2021 09:09

sjrd commented Jun 15, 2021

View reviewed changes

javalib/src/main/scala/java/util/regex/README.md Outdated Show resolved Hide resolved

sjrd requested a review from gzm0 June 15, 2021 16:17

sjrd force-pushed the almost-correct-regex branch from b701816 to 86cd7e9 Compare June 16, 2021 08:14

sjrd force-pushed the almost-correct-regex branch 2 times, most recently from f2604f5 to 2d6b118 Compare June 28, 2021 13:52

gzm0 requested changes Jul 4, 2021

View reviewed changes

javalib/src/main/scala/java/util/regex/PatternCompiler.scala Outdated Show resolved Hide resolved

javalib/src/main/scala/java/util/regex/PatternCompiler.scala Show resolved Hide resolved

sjrd commented Jul 5, 2021

View reviewed changes

javalib/src/main/scala/java/util/regex/README.md Outdated Show resolved Hide resolved

javalib/src/main/scala/java/util/regex/README.md Outdated Show resolved Hide resolved

javalib/src/main/scala/java/util/regex/PatternCompiler.scala Show resolved Hide resolved

sjrd force-pushed the almost-correct-regex branch 2 times, most recently from 03f7b46 to 0221e33 Compare July 6, 2021 08:32

sjrd requested a review from gzm0 July 6, 2021 08:46

gzm0 requested changes Jul 7, 2021

View reviewed changes

sjrd commented Jul 7, 2021

View reviewed changes

test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/util/regex/RegexEngineTest.scala Show resolved Hide resolved

sjrd requested a review from gzm0 July 7, 2021 14:00

gzm0 approved these changes Jul 7, 2021

View reviewed changes

sjrd added 4 commits July 7, 2021 23:55

Do not explicitly set the Scala version in the AppVeyor build.

799f50d

Instead, use the default one from the sbt build. This version number has been out of date several times in the past, because we always forget to update it. It is easier not to have to update it at all.

Introduce build-level vals for the default Scala versions.

ecb47df

This avoids repeating them in several places the build, potentially creating mismatches.

Bump the version to 1.7.0-SNAPSHOT.

df81197

For the upcoming changes to `java.util.regex.*`.

sjrd force-pushed the almost-correct-regex branch from 0a436a6 to ed71c83 Compare July 7, 2021 21:58

sjrd merged commit 3c58b2c into scala-js:master Jul 8, 2021

sjrd deleted the almost-correct-regex branch July 8, 2021 03:17

Fix #1201: Correct regex support. #4455

Fix #1201: Correct regex support. #4455

Uh oh!

Conversation

sjrd commented Mar 26, 2021 • edited by gzm0 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nightscape commented Mar 26, 2021

Uh oh!

japgolly commented Mar 26, 2021

Uh oh!

sjrd commented Mar 26, 2021

Uh oh!

sjrd commented Mar 28, 2021

Uh oh!

gzm0 commented Mar 29, 2021

Uh oh!

sjrd commented Mar 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gzm0 commented Mar 29, 2021

Uh oh!

sjrd commented Mar 30, 2021

Uh oh!

sjrd commented Mar 31, 2021

Uh oh!

sjrd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjrd commented Jun 15, 2021

Uh oh!

sjrd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sjrd commented Jun 28, 2021

Uh oh!

gzm0 commented Jun 28, 2021

Uh oh!

sjrd commented Jun 28, 2021

Uh oh!

gzm0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjrd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjrd commented Jul 5, 2021

Uh oh!

sjrd commented Jul 6, 2021

Uh oh!

gzm0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sjrd commented Mar 26, 2021 •

edited by gzm0

Loading

sjrd commented Mar 29, 2021 •

edited

Loading