Skip to content

[Routing] Fix matching of utf8 params #42159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

[Routing] Fix matching of utf8 params #42159

wants to merge 2 commits into from

Conversation

mazanax
Copy link

@mazanax mazanax commented Jul 16, 2021

Q A
Branch? 5.3
Bug fix? yes
New feature? no
Deprecations? no
Tickets Fix #41909
License MIT
Doc PR

Change Regexp for routes with UTF-8 params.

@carsonbot
Copy link

Hey!

I see that this is your first PR. That is great! Welcome!

Symfony has a contribution guide which I suggest you to read.

In short:

  • Always add tests
  • Keep backward compatibility (see https://symfony.com/bc).
  • Bug fixes must be submitted against the lowest maintained branch where they apply (see https://symfony.com/releases)
  • Features and deprecations must be submitted against the 5.4 branch.

Review the GitHub status checks of your pull request and try to solve the reported issues. If some tests are failing, try to see if they are failing because of this change.

When two Symfony core team members approve this change, it will be merged and you will become an official Symfony contributor!
If this PR is merged in a lower version branch, it will be merged up to all maintained branches within a few days.

I am going to sit back now and wait for the reviews.

Cheers!

Carsonbot

@carsonbot carsonbot changed the title [Router] Fix matching of utf8 params [Routing] Fix matching of utf8 params Jul 16, 2021
@@ -122,7 +122,8 @@ private static function compilePattern(Route $route, string $pattern, bool $isHo

// Match all variables enclosed in "{}" and iterate over them. But we only want to match the innermost variable
// in case of nested "{}", e.g. {foo{bar}}. This in ensured because \w does not match "{" or "}" itself.
preg_match_all('#\{(!)?(\w+)\}#', $pattern, $matches, \PREG_OFFSET_CAPTURE | \PREG_SET_ORDER);
$routeParamsPattern = $needsUtf8 ? '#\{(!)?([\p{L}_]+)\}#u' : '#\{(!)?(\w+)\}#';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks

Copy link
Contributor

@Foxprodev Foxprodev Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure 100% sure, but \w with u flag should be enough. Am I wrong?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\w doesn't support unicode characters.
Here is example:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no unicode flag in your snippet

} catch (ResourceNotFoundException $e) {
}

$this->assertEquals(['_route' => 'foo', 'bär' => 'baz'], $matcher->match('/foo/baz'));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertSame whenever possible

Copy link
Author

@mazanax mazanax Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it's possible to use assertSame here. $matcher->match() doesn't guarantee the order of elements in the array. And assertEquals ignores it.

@@ -122,7 +122,8 @@ private static function compilePattern(Route $route, string $pattern, bool $isHo

// Match all variables enclosed in "{}" and iterate over them. But we only want to match the innermost variable
// in case of nested "{}", e.g. {foo{bar}}. This in ensured because \w does not match "{" or "}" itself.
preg_match_all('#\{(!)?(\w+)\}#', $pattern, $matches, \PREG_OFFSET_CAPTURE | \PREG_SET_ORDER);
$routeParamsPattern = $needsUtf8 ? '#\{(!)?([\p{L}\d_]+)\}#u' : '#\{(!)?(\w+)\}#';
Copy link
Member

@nicolas-grekas nicolas-grekas Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need to check for unicode support: \pL always works
this means that the regexp can unconditionally be: '#\{(!)?([\w\pL]++)\}#'

note that there are other occurrences of \w in this very file

Copy link
Author

@mazanax mazanax Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Anyway, I found that I also need to make some fixes to support utf-8 characters in regex of compiled routes. (for php7.2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolas-grekas Maybe it's not so bad to split the logic.

Matching characters by Unicode property is not fast, because PCRE has to do a multistage table lookup in order to find a character's property. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE by default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In non-unicode mode, PCRE doesn't use unicode tables.

Copy link
Contributor

@Foxprodev Foxprodev Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolas-grekas in non-unicode mode \pL does not fully handle unicode characters too. https://www.phpliveregex.com/p/BbM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the u modifier is set (aka when $needsUtf8 is true, aka when the utf8 option is set), PCRE will use Unicode tables. It will use ASCII tables otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we still need to conditionally set u modifier, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already done:

Copy link
Author

@mazanax mazanax Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it doesn't work. We have to use the modifier u in this regex as well. Because otherwise route params won't be parsed and route will be determined as static instead of dynamic

Copy link
Contributor

@Foxprodev Foxprodev Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's the point. But it doesn't work in my test cases and I am still sure that we need to add u on currently discussed string.
Anyway I will wait for full PR then. Thanks for the time!

Copy link
Member

@nicolas-grekas nicolas-grekas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See grep '\\w' src/Symfony/Component/Routing/ -r

@nicolas-grekas
Copy link
Member

Closing in favor of #45054
Could you please have a look @mazanax?
Thanks for pushing this forward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants