Skip to content

[WIP][Routing] Added unicode support #3629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

[WIP][Routing] Added unicode support #3629

wants to merge 3 commits into from

Conversation

dlsniper
Copy link
Contributor

Bug fix: yes
Feature addition: not sure
Backwards compatibility break: no
Symfony2 tests pass: Build Status
Fixes the following tickets: #4166
Todo: -

This is PR started when the Routing component didn't had support for unicode routes but the latest changes to it seems to have changed its state.
I've basically just added some missing tests for this feature and noted the change in the changelog as it seems pretty nice to know about.

@stof
Copy link
Member

stof commented Mar 17, 2012

Can you add some test using unicode to avoid regressions ?

@fabpot
Copy link
Member

fabpot commented Mar 18, 2012

what if UTF-8 support in PCRE is not enabled? What if the project does not use UTF-8 as its primary charset?

@ghost
Copy link

ghost commented Mar 18, 2012

@fabpot - It's pretty rare not to see UTF-8 enabled on a system (because it is a dependency of many packages on enterprise linux OSs).

OT: the useage of UTF-8 is growing by leaps and bounds http://w3techs.com/technologies/history_overview/character_encoding

@dlsniper
Copy link
Contributor Author

@fabpot: I could add support for detecting if UTF-8 is enabled or not in PCRE but IMHO this should be a must for any system by now anyway and I can't remember the last time I've encountered such a system without this. I could patch the app/check.php file from symfony-standard if you think that would help out as well.

Tho I realize that not everyone might benefit from this patch, I don't think this will slow down the system in any forseeable way or impact the projects otherwise. I could do some performance tests as well to see where we stand with this.

@Drak thanks for the link, pretty interesting.

@dlsniper
Copy link
Contributor Author

@stof can you tell me where/what other test should I add for this so that I can implement this and make this PR available for merging?

Also, should I add the detection for UTF-8 in PCRE or not? If not I could just add something to the readme so that people know about the requirement.

@dlsniper
Copy link
Contributor Author

Hi,

Do I have a go or no go for this PR?
If I need to change something to it before it can be merged let me know. I plan to add a Changelog entry for it if I'll have a go for it.

Thanks.

@stof
Copy link
Member

stof commented Apr 3, 2012

@dlsniper Please rebase your branch. It conflicts with master because of the move of tests.

@fabpot should the unicode support be detected or assumed ?

@dlsniper
Copy link
Contributor Author

dlsniper commented Apr 4, 2012

@stof thanks for pointing this out. I'll rebase it once I have some feedback for this. I'll also add some changelog entry that should mention the change and open a PR for the docs but first I need to know if I should invest some time in this or not.

@dlsniper
Copy link
Contributor Author

dlsniper commented Apr 7, 2012

Hi,

I've rebased to the latest master, added changelog info and mentioned this in the docs as well.

Best regards.

@dlsniper
Copy link
Contributor Author

Hi,

Anyone any input on this? If everything is ok then can this be merged? If not then can this be closed? Or just let me know what should I do to get this merged if it's missing something.

Thanks.

ping @fabpot @stof @vicb

@vicb
Copy link
Contributor

vicb commented Apr 13, 2012

@dlsniper let's wait for the feedback from @fabpot. If he is ok to merge this, the PR would need to be rebased again, the routing has had a lot of changes during the last week.

@dlsniper
Copy link
Contributor Author

It seems that the latest changes to the routing component made the unicode support on by default so I guess this PR original intent is pointless now.

It can still be merged as it contains just missing tests for this case and, what I believe to be, a nice to have mention for the feature in the changelog.

Let me know if I'm missing something about the support for unicode but as far as I can see both my computer and Travis seem to have the same opinion about the tests.

@dlsniper
Copy link
Contributor Author

ping @fabpot Can you please merge this so in case the unicode support for routes disappears at some point, people will know?

Also, I've run the tests in Routing component separately as the latest changes in master regarding Redis/Memcached break the tests both on my laptop and on Travis.

Thanks!

@vicb
Copy link
Contributor

vicb commented Apr 30, 2012

@dlsniper what do you mean by unicode support ? (i.e. I don't think we can claim unicode support without the u modifier ?).

@dlsniper
Copy link
Contributor Author

@vicb Then we have an interesting case. I can't check with PHP 5.3.2 but in the Ubuntu 12.04 PHP 5.3.10 I can clearly see the tests passing, regardless of input characters. Also I could do this before, using 5.3.6 (Ubuntu 11.10). Also, I've been running the dev environment against the changes in the routing component for about 2 weeks now and I couldn't see any problems with this.

Should I add the u modifier to this PR?

@Tobion
Copy link
Contributor

Tobion commented Apr 30, 2012

I think this unicode thing is somewhat tricky. One could also argue the test you're doing are even wrong. Because UTF-8 chars are not allowed in URLs/URIs. So they need to be percent-encoded to be valid. So generating a URL needs to encode all unallowed chars (e.g. segments with unicode chars).
But this might not be the desired output for some users who want nice looking URLs. But technically it would be correct. Also most browsers will show the nice URL version anyway in the address bar when following an encoded URL. So the argument of nice looking URLs is not really valid.
See http://stackoverflow.com/questions/2742852/unicode-characters-in-urls

@dlsniper
Copy link
Contributor Author

@Tobion If the tests are not right, then please do tell. I'm very interested in having this the proper way as in fact, like I've said above, I'm using this patch in dev env and I don't have any problems with it and in production I'm using the older Routing component but with the same patch, the u modifier.

So far I haven't had any problems on production related to this but if in fact it's just pure luck, then I'll take any advice I can receive,

Also while you might argue that pretty URLs are not something really needed, try explain that to my marketing director, to the SEO guys and so on :)

@vicb meanwhile, I've updated the tests and added the missing modifier.

@Tobion
Copy link
Contributor

Tobion commented Apr 30, 2012

You don't seem to have understood what I was saying. It has nothing to do with pretty URLs in SEO context.
Your PR just shows that we don't have unicode support the proper way.
One of your tests is like this:

$routes = $this->getRoutes('test', new Route('/Жени/{slug1}'));
$this->assertEquals('/app.php/Жени/%D0%96%D0%B5%D0%BD%D0%B8', $this->getGenerator($routes)->generate('test', array('slug1' => 'Жени')));

It shows that only the variables get url encoded but in fact all segments must be url encoded, i.e. static text too. Otherwise the generated URL contains invalid chars and does not conform to the RFC. So it should be

$this->assertEquals('/app.php/%D0%96%D0%B5%D0%BD%D0%B8/%D0%96%D0%B5%D0%BD%D0%B8', $this->getGenerator($routes)->generate('test', array('slug1' => 'Жени')));

@mvrhov
Copy link

mvrhov commented Apr 30, 2012

Hm there was an issue reported by some Russian guy which had the problem of matching the URI with Cyrillic characters.
I cant seem to find it but it might be related.

@dlsniper
Copy link
Contributor Author

@mvrhov Do you talk about #2375 ?
@Tobion I'll look into it, I've worked on this patch on a couple of different times but I have some time right now. The test you've show seems to be missing the optional parameters indeed so there's more work for this. Thanks for pointing it out!

@Tobion
Copy link
Contributor

Tobion commented Apr 30, 2012

The Matcher and Dumper etc should probably work on the encoded value instead of the decoded one. Then the preg_match would also not need the u modifier as there won't be any non-ascii chars in the route.

@Tobion
Copy link
Contributor

Tobion commented Apr 30, 2012

Too bad the concrete test case for #2375 is missing because I don't see how a simple requirement change would solve his problem with Cryillic chars. His problem and the solution does not seem to be related to each other.

@dlsniper
Copy link
Contributor Author

It seems that we are using rawurlencode which does the enconding part: https://github.com/symfony/symfony/blob/master/src/Symfony/Component/Routing/Generator/UrlGenerator.php#L121 I didn't had time to take a look at the matcher but my guess is that it's using rawurldecode :) This is limited unfortunately for default parameters so I'll have a look on how this can be fixed.

Also I'll add some more tests, using Greek, Arabic, Armenian, Chinese, Vietnamese and HIndi. I believe that should do it, no?

@Tobion
Copy link
Contributor

Tobion commented Apr 30, 2012

Hm, it get's even more tricky. What if the delimiter between two variables is a unicode character? This will probably break everything. Both the current implementation but also my idea with using url encoded value in all comparisions. That's because the default regex exclusion in url encoded form won't work, e.g.: [^%D0]

@Tobion
Copy link
Contributor

Tobion commented Apr 30, 2012

So I would reason, that we don't have full unicode support currently. To support all cases requires much research and testing.
For example the PhpMatcherDumper uses 0 === strpos() as an optimization. This specific code should work with unicode because we are only testing the first byte so it doesn't matter that strpos is not multibyte save.

@dlsniper
Copy link
Contributor Author

@Tobion yes, I've seen the non-unicode safe functions being used in the Routing component.

I've renamed this to WIP as there's indeed much work to be done and it seems that I'm just a lucky case in using this on live as I do have a small use case scenario :)

I'm working on adding some tests to this, like mentioned above and see from there how we can work on this.

Thanks for your help so far :)

@mvrhov
Copy link

mvrhov commented May 1, 2012

I was talking about: #2899

@dlsniper
Copy link
Contributor Author

dlsniper commented May 1, 2012

@mvrhov I believe that the current routing system should take care of the issue now but if it doesn't, let me know.
@Tobion I've fixed the issue you've reported in #4166 and I've tested it against a quick dummy project. I'm not sure if the matcher should be changed as as far as I remember doing a rawurldecode was a good thing.

@vicb
Copy link
Contributor

vicb commented May 1, 2012

The compiler is not unicode aware

@dlsniper
Copy link
Contributor Author

dlsniper commented May 1, 2012

@vicb guess I've missed that because the tests are passing and for some reason the expression starts with . right after the starting delimiter. If I were to add the u modifier I'm not sure this will be a valid fix as I'd rather remove the . but since you've did the Routing overhaul earlier last month maybe you can offer some insight about it?

Thanks

@vicb
Copy link
Contributor

vicb commented May 1, 2012

I think adding a 'u' would not be enough: you would need to take into account the case where the separator could be a unicode character:

  • . should probably be changed for \X,
  • accessing $m[0][0][0] is not valid any more, you should put the \X inside a matching subpattern,
  • you should verify all the code handling the separator (to check if it supports unicode),
  • you should also check if unicode characters can be used inside classes, otherwise you have to encode them.

@fabpot
Copy link
Member

fabpot commented May 1, 2012

I've not read the entire discussion but allowing unicode characters for the separator is probably not needed (I don't see any valid use case for that -- we are already quite generous in the range od characters you can use for separators). Let's keep it simple.

@Tobion
Copy link
Contributor

Tobion commented May 1, 2012

@fabpot so if you have a unicode char between two variables, e.g. {var1}€{var2} the default regex for var1 should simply be .+ instead of [^€]+ (which it normally would be if it wasn't a unicode char)? That would make it somewhat inconsistent and we would need to distinguish between unicode and ascii.

@@ -124,7 +124,7 @@ protected function doGenerate($variables, $defaults, $requirements, $tokens, $pa
$optional = false;
}
} elseif ('text' === $token[0]) {
$url = $token[1].$url;
$url = str_replace(array_keys($this->decodedChars), $this->decodedChars, rawurlencode($token[1])).$url;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strtr ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will patch in future commit.

@vicb
Copy link
Contributor

vicb commented May 1, 2012

@Tobion If unicode is not allowed for separators {var1}€{var2} would not be a valid route.

Some questions:

  • How can we chek if unicode is enabled in PCRE ?
  • Should we make unicode support optional ?

@Tobion
Copy link
Contributor

Tobion commented May 1, 2012

Huh? You mean {var1}€{var2} wouldn't be a valid route and raises an exception? That would be very bad.
But {var1}/€/{var2} would be valid because the unicode char is not a delimiter? I think that's very incomprehensible for most people.

@dlsniper
Copy link
Contributor Author

dlsniper commented May 1, 2012

@vicb For PCRE detection I think we can use something like here: http://drupal.org/node/587362#comment-3411540

As for making it optional, I'm not sure, it could be done if we need to make it optional.

@Tobion No, currently you won't receive an error if you try something like {var1}€{var2} but I believe it's just a coincidence that's working.

@vicb
Copy link
Contributor

vicb commented May 1, 2012

@Tobion this is my understanding of @fabpot proposal which would require minimal changes to the current code. (@dlsniper this discussion is not about the current code but the unicode aware code).

@Tobion
Copy link
Contributor

Tobion commented May 5, 2012

@dlsniper I fixed #4166 in #4205. First the fix should be done on 2.0 and secondly, we shouldn't mix unicode support with the fix as they are different things.

@fabpot fabpot closed this Jul 3, 2012
fabpot added a commit that referenced this pull request Aug 25, 2016
…s (nicolas-grekas)

This PR was merged into the 3.2-dev branch.

Discussion
----------

[Routing] Add seamless support for unicode requirements

| Q             | A
| ------------- | ---
| Branch?       | master
| Bug fix?      | no
| New feature?  | yes
| BC breaks?    | no
| Deprecations? | yes
| Tests pass?   | yes
| Fixed tickets | #3629, #5236, #19562
| License       | MIT
| Doc PR        | symfony/symfony-docs#6890

This PR adds unicode support to route matching and generation by automatically adding the `u` modifier to regexps that use either unicode characters or unicode enabled character classes (e.g. `\p...` `\x{...}` `\X`).

As a side note, if one wants to match a single unicode character (vs a single byte), one should use `\PM` or `\X` instead of `.` *or* set the `unicode` parameter to true.

Commits
-------

a829d34 [Routing] Add seamless support for unicode requirements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants