-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[Routing] Fix matching of utf8 params #42159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hey! I see that this is your first PR. That is great! Welcome! Symfony has a contribution guide which I suggest you to read. In short:
Review the GitHub status checks of your pull request and try to solve the reported issues. If some tests are failing, try to see if they are failing because of this change. When two Symfony core team members approve this change, it will be merged and you will become an official Symfony contributor! I am going to sit back now and wait for the reviews. Cheers! Carsonbot |
@@ -122,7 +122,8 @@ private static function compilePattern(Route $route, string $pattern, bool $isHo | |||
|
|||
// Match all variables enclosed in "{}" and iterate over them. But we only want to match the innermost variable | |||
// in case of nested "{}", e.g. {foo{bar}}. This in ensured because \w does not match "{" or "}" itself. | |||
preg_match_all('#\{(!)?(\w+)\}#', $pattern, $matches, \PREG_OFFSET_CAPTURE | \PREG_SET_ORDER); | |||
$routeParamsPattern = $needsUtf8 ? '#\{(!)?([\p{L}_]+)\}#u' : '#\{(!)?(\w+)\}#'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure 100% sure, but \w with u flag should be enough. Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\w doesn't support unicode characters.
Here is example:
- \w - https://regex101.com/r/p9O5Qd/1
- [\p{L}_] - https://regex101.com/r/QmU3Ru/1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no unicode flag in your snippet
} catch (ResourceNotFoundException $e) { | ||
} | ||
|
||
$this->assertEquals(['_route' => 'foo', 'bär' => 'baz'], $matcher->match('/foo/baz')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertSame whenever possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that it's possible to use assertSame here. $matcher->match() doesn't guarantee the order of elements in the array. And assertEquals ignores it.
@@ -122,7 +122,8 @@ private static function compilePattern(Route $route, string $pattern, bool $isHo | |||
|
|||
// Match all variables enclosed in "{}" and iterate over them. But we only want to match the innermost variable | |||
// in case of nested "{}", e.g. {foo{bar}}. This in ensured because \w does not match "{" or "}" itself. | |||
preg_match_all('#\{(!)?(\w+)\}#', $pattern, $matches, \PREG_OFFSET_CAPTURE | \PREG_SET_ORDER); | |||
$routeParamsPattern = $needsUtf8 ? '#\{(!)?([\p{L}\d_]+)\}#u' : '#\{(!)?(\w+)\}#'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no need to check for unicode support: \pL
always works
this means that the regexp can unconditionally be: '#\{(!)?([\w\pL]++)\}#'
note that there are other occurrences of \w
in this very file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. Anyway, I found that I also need to make some fixes to support utf-8 characters in regex of compiled routes. (for php7.2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicolas-grekas Maybe it's not so bad to split the logic.
Matching characters by Unicode property is not fast, because PCRE has to do a multistage table lookup in order to find a character's property. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In non-unicode mode, PCRE doesn't use unicode tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicolas-grekas in non-unicode mode \pL does not fully handle unicode characters too. https://www.phpliveregex.com/p/BbM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the u
modifier is set (aka when $needsUtf8
is true, aka when the utf8
option is set), PCRE will use Unicode tables. It will use ASCII tables otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we still need to conditionally set u modifier, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already done:
$regexp .= 'u'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it doesn't work. We have to use the modifier u in this regex as well. Because otherwise route params won't be parsed and route will be determined as static instead of dynamic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's the point. But it doesn't work in my test cases and I am still sure that we need to add u
on currently discussed string.
Anyway I will wait for full PR then. Thanks for the time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See grep '\\w' src/Symfony/Component/Routing/ -r
Change Regexp for routes with UTF-8 params.