Skip to content

Fix #1201: Correct regex support. #4455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 8, 2021
Merged

Conversation

sjrd
Copy link
Member

@sjrd sjrd commented Mar 26, 2021

This PR correctly implements java.util.regex by compiling Java regexes down to JavaScript RegExp's that do the same thing.

Until now, we basically passed regexes through to JavaScript's RegExp, not caring at all whether the behavior was going to be as specified or not. This could result in:

  • SyntaxErrors, for features not supported by JS
  • or worse, regexes that successfully compile, but behave differently than their spec

In this commit, we compile everything we can compile, and throw PatternSyntaxExceptions for things that we cannot faithfully compile.

Some features require the underlying JS RegExp to support ES 2018:

  • the MULTILINE and UNICODE_CHARACTER_CLASS flags
  • look-behind assertions
  • \p{} classes for Unicode properties

In addition, UNICODE_CASE requires support for the u flag, which is ES 2015. CANON_EQ is never supported.

Everything else is supported in all cases. Even correct handling of surrogate pairs is supported in ES 5.1, which does not support the u flag.


My main open question is: what do we do about the features that require ES 2018 support? As currently implemented, they will throw PatternSyntaxExceptions at run-time if the JS engine does not support the required features. This could result in surprises if the developers test using a recent Node.js version, but then deploy in environments that are not guaranteed to support ES 2018. This can even happen in unexpected situations, for example the MULTILINE flag requires look-behind assertions, even though JS' RegExp has a support for an m flag even in ES 5.1. The problem is that JS' notion of m is not equivalent to Java's notion of MULTILINE, so we can't delegate to it.

Of course, run-time errors depending on newer ES features can already happen, but never through usage of JDK APIs. It can happen through direct usages of JavaScript APIs, even some declared in our standard library, but they are marked as such in the Scaladoc.

I don't think we have precedent for JDK features that are conditional on ES features. How should we address that? Possible ideas are:

  • Do nothing, and accept that run-time crashes may happen; developers should test under the environments that they care about.
  • Provide an opt-in link-time config to enable the ES 2018-dependent features; if it is not set, we would unconditionally throw PatternSyntaxExceptions for features that would need the ES 2018 features.
  • Not provide those features at all (I don't think this is a good idea)

@gzm0 WDYT?


To-do

  • Flesh out the open question above
  • More tests (what I have so far is quite decent, but there really are a lot of possible combinations that need testing)
  • Clean up, especially the compilation of [...] character classes
  • Better document the implementation

@sjrd sjrd mentioned this pull request Mar 26, 2021
@sjrd sjrd force-pushed the almost-correct-regex branch from a382b3c to 5f24a09 Compare March 26, 2021 18:43
@nightscape
Copy link
Contributor

Might be a stupid idea, but could one maybe transform the string to match in such a way that it can be more simply matched by a JS regex?

@japgolly
Copy link
Contributor

If this has a non-negligible impact on resulting JS sizes, can we please also have a setting to opt-out of this new behaviour?

@sjrd
Copy link
Member Author

sjrd commented Mar 26, 2021

Might be a stupid idea, but could one maybe transform the string to match in such a way that it can be more simply matched by a JS regex?

I had thought about that to implement support for CANON_EQ, but it doesn't work, because it also changes the captured substrings, and that is not correct behavior. So no, we cannot do that. Anyway, the only other feature for which it might help would be ASCII case insensitive, but that's not really hard to implement by transforming the pattern instead, so we don't gain anything.

If this has a non-negligible impact on resulting JS sizes, can we please also have a setting to opt-out of this new behaviour?

I haven't looked at the code size impact yet. Given the size of the implementation in Scala (1.5 kLoC), I guess it will be 3-4 kB of fastOpt code. But no, we won't be able to have a switch to disable the behavior. This is no different from any other method of the JDK. If you don't want its code size, don't use it and use js.RegExp instead. Just like if you don't want the Unicode database embedded in your program, don't use c.isDigit but c >= '0' && c <= '9'.

@sjrd sjrd force-pushed the almost-correct-regex branch 2 times, most recently from 3b86df5 to bfabfa9 Compare March 27, 2021 17:41
@sjrd sjrd linked an issue Mar 28, 2021 that may be closed by this pull request
@sjrd sjrd force-pushed the almost-correct-regex branch from bfabfa9 to 8dd831f Compare March 28, 2021 11:16
@sjrd
Copy link
Member Author

sjrd commented Mar 28, 2021

With the current state of this PR, where I have made some effort not to produce unreasonable code by accident, the size increases are of

  • about 100 kB for fastOpt
  • about 17 kB for fullOpt

Looks like I was quite off the mark for my wild estimation above. That said, I don't think it fundamentally changes the problem. As I've been working on this, I noticed how broken blindly using js.RegExp was; virtually nothing behaves the same except the most simple combinators. Even simple things like \s or . are incorrect.

@gzm0
Copy link
Contributor

gzm0 commented Mar 29, 2021

I don't think we have precedent for JDK features that are conditional on ES features. How should we address that? Possible ideas are:

I think we should chose safety before generating stuff that doesn't work. Basically:

  1. Introduce ESFeatures.useECMAScript20{16-21} and LinkingInfo.assumingES20{16-21} (maybe parametrize?)
  2. Use these to link-time DCE the JDK autodetect.
  3. Throw when we cannot translate a construct, have a nice error message if an ES feature upgrade could allow the translation.

(our usual approach here is auto-detect and polyfill; that is OK IMO as long as we actually polyfill, which here is probably not up for discussion).

So if somebody ends up using a feature, they simply need to acknowledge in their build once that they are OK with that ES version.

Looks like I was quite off the mark for my wild estimation above. That said, I don't think it fundamentally changes the problem.

While I agree with this reasoning, I think there is another point we need to consider: For all practical intents and purposes, bumping the codesize of a thing by that much is a backwards incompatible change (this is also where the argument is imprecise IMO: for most javalib stuff, we just add new things, so use it or leave it, existing things remain unchanged or get bugfixes that don't affect actual usage for the worse).

With such a big bump, maintainers of codebases that heavily relied on regexes (and have worked "just fine" for ~7 years), would now be faced with the choice of either:

  • Stalling Scala.js upgrades (nobody wants that, not us, not our users).
  • Investing a considerable amount of effort to migrate off Java regexes (possible, but might involve a lot of new wrapper classes etc. if code is cross compiling).

I do not think we should leave our users in this situation.

I do not have a solution for this right now (that doesn't involve dependency hellscape for at least one side of the equation), so I'll keep thinking.

@sjrd sjrd force-pushed the almost-correct-regex branch from 64b0614 to 26ac524 Compare March 29, 2021 17:59
@sjrd
Copy link
Member Author

sjrd commented Mar 29, 2021

With such a big bump, maintainers of codebases that heavily relied on regexes (and have worked "just fine" for ~7 years), would now be faced with the choice of either:

Supposedly, a codebase such that adjusting to js.RegExp would be a significant cost is also a large enough codebase that 17 kB won't make a real difference, would it?

I'm more worried about the actual correctness issues, in terms of breaking changes. It's not impossible that there are codebases that actually rely on Pattern behaving like JavaScript, and not as documented in the JavaDoc. Unlike our other fixes to JDK stuff, this one has been documented to behave as it does for 7 years. 😕

@gzm0
Copy link
Contributor

gzm0 commented Mar 29, 2021

Supposedly, a codebase such that adjusting to js.RegExp would be a significant cost is also a large enough codebase that 17 kB won't make a real difference, would it?

Hmmm... That's a good point. Although I'm not sure that's the case. So for example, a case I was concerned about is usage of Pattern in some larger data structures / across library interfaces / underpinning of a library. For example, the scalalib itself is actually an example here:

// from the doc of scala.util.matching.Regex
val date = raw"(\d{4})-(\d{2})-(\d{2})".r

"2004-01-20" match {
  case date(year, month, day) => s"$year was a good year for PLs."
}

Will have to be transformed into:

val date = new JSExtractor(new js.RegExp("(\d{4})-(\d{2})-(\d{2})"))

"2004-01-20" match {
  case date(year, month, day) => s"$year was a good year for PLs."
}

class JSExtractor(...) {
  def unapply(...) = ...
}

But I have no clue how to quantify cost of transformation with library size :-/

Unlike our other fixes to JDK stuff, this one has been documented to behave as it does for 7 years

That is true. OTOH though, as the referenced bugs on this PR prove, it has been considered a bug by many and so far, no fix towards the Java behavior has been filed as a bug / regression.

@sjrd sjrd force-pushed the almost-correct-regex branch from 26ac524 to fb42a4b Compare March 30, 2021 14:16
@sjrd
Copy link
Member Author

sjrd commented Mar 30, 2021

I think we should chose safety before generating stuff that doesn't work. Basically:

  1. Introduce ESFeatures.useECMAScript20{16-21} and LinkingInfo.assumingES20{16-21} (maybe parametrize?)
  2. Use these to link-time DCE the JDK autodetect.
  3. Throw when we cannot translate a construct, have a nice error message if an ES feature upgrade could allow the translation.

(our usual approach here is auto-detect and polyfill; that is OK IMO as long as we actually polyfill, which here is probably not up for discussion).

So if somebody ends up using a feature, they simply need to acknowledge in their build once that they are OK with that ES version.

I pushed new commits that implement that strategy.

As a bonus, this brings the fullOpt increase down from 17 KB to 15 kB in the default configuration (producing ES 2015 code). That's because the code dealing with \p{} character classes, including the predefinedCharacterClass map, can be dead-code-eliminated. (which is actually quite remarkable)

@sjrd sjrd force-pushed the almost-correct-regex branch from fb42a4b to 19fa403 Compare March 30, 2021 17:17
@sjrd
Copy link
Member Author

sjrd commented Mar 31, 2021

Anything that has new configuration will require a minor version bump. Since we don't have anything else in flight that can go in a patch release, I suggest we release 1.5.1 now.

@gzm0 If you agree, I prepared a version commit here: sjrd@3746189

@sjrd sjrd force-pushed the almost-correct-regex branch 2 times, most recently from d42bab9 to 2a9f870 Compare March 31, 2021 09:09
Copy link
Member Author

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I think I have addressed all the comments so far, except in the readme. I need to do a second pass for the readme.

@sjrd
Copy link
Member Author

sjrd commented Jun 15, 2021

@sjrd if you actually changed something in these, please LMK.

I'm pretty sure that, besides rebases, the only thing that has changed over the last month or so is the following diff:
https://github.com/scala-js/scala-js/compare/24ce7cbb67416f1a04f0d39622002b8b9ea23272..b1aa90d795231f4caa446666947884c4b90cd189

Copy link
Member Author

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Now I think I have addressed all the comments in the readme as well.

@sjrd sjrd requested a review from gzm0 June 15, 2021 16:17
@sjrd sjrd force-pushed the almost-correct-regex branch from b701816 to 86cd7e9 Compare June 16, 2021 08:14
@sjrd sjrd force-pushed the almost-correct-regex branch 2 times, most recently from f2604f5 to 2d6b118 Compare June 28, 2021 13:52
@sjrd
Copy link
Member Author

sjrd commented Jun 28, 2021

@gzm0 I see that you've marked a lot of comments as Resolved, but haven't submitted a new review. Is it just that you accidentally left a Pending Review without submitting it, or you truly haven't finished it yet?

@gzm0
Copy link
Contributor

gzm0 commented Jun 28, 2021

I'm not done yet. Unfortunately resolved state is not part of the review batch (but I absolutely need it, otherwise I have no clue what I still need to look at :P).

@sjrd
Copy link
Member Author

sjrd commented Jun 28, 2021

Ah ah, yes, that makes sense :)

Copy link
Contributor

@gzm0 gzm0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second round except for the tests. I'll start on them now :)

Copy link
Member Author

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@sjrd
Copy link
Member Author

sjrd commented Jul 5, 2021

That LibrarySizeTest seems to be very sensitive to the environment. The results on Windows are significantly different than on Linux. I'm not sure why yet. :-s

@sjrd sjrd force-pushed the almost-correct-regex branch 2 times, most recently from 03f7b46 to 0221e33 Compare July 6, 2021 08:32
@sjrd
Copy link
Member Author

sjrd commented Jul 6, 2021

It turns out the issues were related to the version of Scala being used.
Note the first two commits in the PR, which are new (and if you want I can submit them as a separate, preliminary PR).

@sjrd sjrd requested a review from gzm0 July 6, 2021 08:46
Copy link
Contributor

@gzm0 gzm0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finish line :) Only minor stuff on the tests. Everything else LGTM.

Copy link
Member Author

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Adding more tests for comments, following your review, made me realize that comments are also supposed to work in character classes. So there is one more comment that implements that.

@sjrd sjrd requested a review from gzm0 July 7, 2021 14:00
sjrd added 4 commits July 7, 2021 23:55
Instead, use the default one from the sbt build.

This version number has been out of date several times in the past,
because we always forget to update it. It is easier not to have to
update it at all.
This avoids repeating them in several places the build, potentially
creating mismatches.
For the upcoming changes to `java.util.regex.*`.
Previously, `java.util.regex.Pattern` was implemented without much
concern for correctness wrt. the Java semantics of regular
expressions. Patterns were passed through to the native `RegExp`
with minimal preprocessing.

This could cause several kinds of incompatibilities:

- Throwing `ParseError`s for features not supported by JS regexes,
- Or worse, silently compile with different semantics.

In this commit, we correctly implement virtually all the features
of Java regular expressions by compiling Java patterns down to JS
patterns with the same semantics.

This change introduces a significant code size regression for code
bases that were already using `Pattern` and/or Scala's `Regex`.
Therefore, we went to great lengths to optimize the compiler for
code size, in particular in the default ES 2015 configuration. This
means that some code is not as idiomatic as it could be. The impact
of this commit on a typical output is approximately 65 KB for
fastOpt and 12 KB for fullOpt.

The `README.md` file contains an extensive explanation of the
design of the compiler, and of how it compiles each kind of Java
pattern.

In addition to fixing the mega-issue scala-js#1201, this commit fixes the
smaller issues scala-js#105, scala-js#1677, scala-js#1847, scala-js#2082 and scala-js#3959, which had been
closed as duplicates of scala-js#1201.
@sjrd sjrd force-pushed the almost-correct-regex branch from 0a436a6 to ed71c83 Compare July 7, 2021 21:58
@sjrd
Copy link
Member Author

sjrd commented Jul 7, 2021

Thanks! 🎉
Rebased and squashed.

@sjrd sjrd merged commit 3c58b2c into scala-js:master Jul 8, 2021
@sjrd sjrd deleted the almost-correct-regex branch July 8, 2021 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants