[WIP][Routing] Added unicode support #3629

dlsniper · 2012-03-17T20:10:39Z

Bug fix: yes
Feature addition: not sure
Backwards compatibility break: no
Symfony2 tests pass:
Fixes the following tickets: #4166
Todo: -

This is PR started when the Routing component didn't had support for unicode routes but the latest changes to it seems to have changed its state.
I've basically just added some missing tests for this feature and noted the change in the changelog as it seems pretty nice to know about.

stof · 2012-03-17T20:33:44Z

Can you add some test using unicode to avoid regressions ?

fabpot · 2012-03-18T02:39:07Z

what if UTF-8 support in PCRE is not enabled? What if the project does not use UTF-8 as its primary charset?

ghost · 2012-03-18T04:00:12Z

@fabpot - It's pretty rare not to see UTF-8 enabled on a system (because it is a dependency of many packages on enterprise linux OSs).

OT: the useage of UTF-8 is growing by leaps and bounds http://w3techs.com/technologies/history_overview/character_encoding

dlsniper · 2012-03-18T05:36:09Z

@fabpot: I could add support for detecting if UTF-8 is enabled or not in PCRE but IMHO this should be a must for any system by now anyway and I can't remember the last time I've encountered such a system without this. I could patch the app/check.php file from symfony-standard if you think that would help out as well.

Tho I realize that not everyone might benefit from this patch, I don't think this will slow down the system in any forseeable way or impact the projects otherwise. I could do some performance tests as well to see where we stand with this.

@Drak thanks for the link, pretty interesting.

dlsniper · 2012-03-18T16:36:03Z

@stof can you tell me where/what other test should I add for this so that I can implement this and make this PR available for merging?

Also, should I add the detection for UTF-8 in PCRE or not? If not I could just add something to the readme so that people know about the requirement.

dlsniper · 2012-03-25T11:53:28Z

Hi,

Do I have a go or no go for this PR?
If I need to change something to it before it can be merged let me know. I plan to add a Changelog entry for it if I'll have a go for it.

Thanks.

stof · 2012-04-03T22:45:51Z

@dlsniper Please rebase your branch. It conflicts with master because of the move of tests.

@fabpot should the unicode support be detected or assumed ?

dlsniper · 2012-04-04T17:22:43Z

@stof thanks for pointing this out. I'll rebase it once I have some feedback for this. I'll also add some changelog entry that should mention the change and open a PR for the docs but first I need to know if I should invest some time in this or not.

dlsniper · 2012-04-07T07:33:58Z

Hi,

I've rebased to the latest master, added changelog info and mentioned this in the docs as well.

Best regards.

dlsniper · 2012-04-13T19:25:03Z

Hi,

Anyone any input on this? If everything is ok then can this be merged? If not then can this be closed? Or just let me know what should I do to get this merged if it's missing something.

Thanks.

ping @fabpot @stof @vicb

vicb · 2012-04-13T19:44:37Z

@dlsniper let's wait for the feedback from @fabpot. If he is ok to merge this, the PR would need to be rebased again, the routing has had a lot of changes during the last week.

dlsniper · 2012-04-14T16:10:05Z

It seems that the latest changes to the routing component made the unicode support on by default so I guess this PR original intent is pointless now.

It can still be merged as it contains just missing tests for this case and, what I believe to be, a nice to have mention for the feature in the changelog.

Let me know if I'm missing something about the support for unicode but as far as I can see both my computer and Travis seem to have the same opinion about the tests.

dlsniper · 2012-04-30T17:20:55Z

ping @fabpot Can you please merge this so in case the unicode support for routes disappears at some point, people will know?

Also, I've run the tests in Routing component separately as the latest changes in master regarding Redis/Memcached break the tests both on my laptop and on Travis.

Thanks!

vicb · 2012-04-30T17:39:07Z

@dlsniper what do you mean by unicode support ? (i.e. I don't think we can claim unicode support without the u modifier ?).

dlsniper · 2012-04-30T17:48:35Z

@vicb Then we have an interesting case. I can't check with PHP 5.3.2 but in the Ubuntu 12.04 PHP 5.3.10 I can clearly see the tests passing, regardless of input characters. Also I could do this before, using 5.3.6 (Ubuntu 11.10). Also, I've been running the dev environment against the changes in the routing component for about 2 weeks now and I couldn't see any problems with this.

Should I add the u modifier to this PR?

Tobion · 2012-04-30T18:10:49Z

I think this unicode thing is somewhat tricky. One could also argue the test you're doing are even wrong. Because UTF-8 chars are not allowed in URLs/URIs. So they need to be percent-encoded to be valid. So generating a URL needs to encode all unallowed chars (e.g. segments with unicode chars).
But this might not be the desired output for some users who want nice looking URLs. But technically it would be correct. Also most browsers will show the nice URL version anyway in the address bar when following an encoded URL. So the argument of nice looking URLs is not really valid.
See http://stackoverflow.com/questions/2742852/unicode-characters-in-urls

dlsniper · 2012-04-30T18:15:27Z

@Tobion If the tests are not right, then please do tell. I'm very interested in having this the proper way as in fact, like I've said above, I'm using this patch in dev env and I don't have any problems with it and in production I'm using the older Routing component but with the same patch, the u modifier.

So far I haven't had any problems on production related to this but if in fact it's just pure luck, then I'll take any advice I can receive,

Also while you might argue that pretty URLs are not something really needed, try explain that to my marketing director, to the SEO guys and so on :)

@vicb meanwhile, I've updated the tests and added the missing modifier.

Tobion · 2012-04-30T20:15:43Z

You don't seem to have understood what I was saying. It has nothing to do with pretty URLs in SEO context.
Your PR just shows that we don't have unicode support the proper way.
One of your tests is like this:

$routes = $this->getRoutes('test', new Route('/Жени/{slug1}'));
$this->assertEquals('/app.php/Жени/%D0%96%D0%B5%D0%BD%D0%B8', $this->getGenerator($routes)->generate('test', array('slug1' => 'Жени')));

It shows that only the variables get url encoded but in fact all segments must be url encoded, i.e. static text too. Otherwise the generated URL contains invalid chars and does not conform to the RFC. So it should be

$this->assertEquals('/app.php/%D0%96%D0%B5%D0%BD%D0%B8/%D0%96%D0%B5%D0%BD%D0%B8', $this->getGenerator($routes)->generate('test', array('slug1' => 'Жени')));

mvrhov · 2012-04-30T20:19:48Z

Hm there was an issue reported by some Russian guy which had the problem of matching the URI with Cyrillic characters.
I cant seem to find it but it might be related.

dlsniper · 2012-04-30T20:23:36Z

@mvrhov Do you talk about #2375 ?
@Tobion I'll look into it, I've worked on this patch on a couple of different times but I have some time right now. The test you've show seems to be missing the optional parameters indeed so there's more work for this. Thanks for pointing it out!

Tobion · 2012-04-30T20:38:40Z

The Matcher and Dumper etc should probably work on the encoded value instead of the decoded one. Then the preg_match would also not need the u modifier as there won't be any non-ascii chars in the route.

Tobion · 2012-04-30T20:41:14Z

Too bad the concrete test case for #2375 is missing because I don't see how a simple requirement change would solve his problem with Cryillic chars. His problem and the solution does not seem to be related to each other.

dlsniper · 2012-04-30T20:45:20Z

It seems that we are using rawurlencode which does the enconding part: https://github.com/symfony/symfony/blob/master/src/Symfony/Component/Routing/Generator/UrlGenerator.php#L121 I didn't had time to take a look at the matcher but my guess is that it's using rawurldecode :) This is limited unfortunately for default parameters so I'll have a look on how this can be fixed.

Also I'll add some more tests, using Greek, Arabic, Armenian, Chinese, Vietnamese and HIndi. I believe that should do it, no?

Tobion · 2012-04-30T21:03:02Z

Hm, it get's even more tricky. What if the delimiter between two variables is a unicode character? This will probably break everything. Both the current implementation but also my idea with using url encoded value in all comparisions. That's because the default regex exclusion in url encoded form won't work, e.g.: [^%D0]

Tobion · 2012-04-30T21:12:30Z

So I would reason, that we don't have full unicode support currently. To support all cases requires much research and testing.
For example the PhpMatcherDumper uses 0 === strpos() as an optimization. This specific code should work with unicode because we are only testing the first byte so it doesn't matter that strpos is not multibyte save.

dlsniper · 2012-04-30T21:16:58Z

@Tobion yes, I've seen the non-unicode safe functions being used in the Routing component.

I've renamed this to WIP as there's indeed much work to be done and it seems that I'm just a lucky case in using this on live as I do have a small use case scenario :)

I'm working on adding some tests to this, like mentioned above and see from there how we can work on this.

Thanks for your help so far :)

mvrhov · 2012-05-01T06:20:59Z

I was talking about: #2899

dlsniper · 2012-05-01T09:40:56Z

@mvrhov I believe that the current routing system should take care of the issue now but if it doesn't, let me know.
@Tobion I've fixed the issue you've reported in #4166 and I've tested it against a quick dummy project. I'm not sure if the matcher should be changed as as far as I remember doing a rawurldecode was a good thing.

vicb · 2012-05-01T09:47:37Z

The compiler is not unicode aware

dlsniper · 2012-05-01T10:00:54Z

@vicb guess I've missed that because the tests are passing and for some reason the expression starts with . right after the starting delimiter. If I were to add the u modifier I'm not sure this will be a valid fix as I'd rather remove the . but since you've did the Routing overhaul earlier last month maybe you can offer some insight about it?

Thanks

vicb · 2012-05-01T10:11:24Z

I think adding a 'u' would not be enough: you would need to take into account the case where the separator could be a unicode character:

. should probably be changed for \X,
accessing $m[0][0][0] is not valid any more, you should put the \X inside a matching subpattern,
you should verify all the code handling the separator (to check if it supports unicode),
you should also check if unicode characters can be used inside classes, otherwise you have to encode them.

fabpot · 2012-05-01T10:37:32Z

I've not read the entire discussion but allowing unicode characters for the separator is probably not needed (I don't see any valid use case for that -- we are already quite generous in the range od characters you can use for separators). Let's keep it simple.

Tobion · 2012-05-01T10:58:26Z

@fabpot so if you have a unicode char between two variables, e.g. {var1}€{var2} the default regex for var1 should simply be .+ instead of [^€]+ (which it normally would be if it wasn't a unicode char)? That would make it somewhat inconsistent and we would need to distinguish between unicode and ascii.

vicb · 2012-05-01T11:10:06Z

src/Symfony/Component/Routing/Generator/UrlGenerator.php

@@ -124,7 +124,7 @@ protected function doGenerate($variables, $defaults, $requirements, $tokens, $pa
                    $optional = false;
                }
            } elseif ('text' === $token[0]) {
-                $url = $token[1].$url;
+                $url = str_replace(array_keys($this->decodedChars), $this->decodedChars, rawurlencode($token[1])).$url;


Thanks, will patch in future commit.

vicb · 2012-05-01T11:34:39Z

@Tobion If unicode is not allowed for separators {var1}€{var2} would not be a valid route.

Some questions:

How can we chek if unicode is enabled in PCRE ?
Should we make unicode support optional ?

Tobion · 2012-05-01T11:39:29Z

Huh? You mean {var1}€{var2} wouldn't be a valid route and raises an exception? That would be very bad.
But {var1}/€/{var2} would be valid because the unicode char is not a delimiter? I think that's very incomprehensible for most people.

dlsniper · 2012-05-01T11:47:06Z

@vicb For PCRE detection I think we can use something like here: http://drupal.org/node/587362#comment-3411540

As for making it optional, I'm not sure, it could be done if we need to make it optional.

@Tobion No, currently you won't receive an error if you try something like {var1}€{var2} but I believe it's just a coincidence that's working.

vicb · 2012-05-01T11:58:23Z

@Tobion this is my understanding of @fabpot proposal which would require minimal changes to the current code. (@dlsniper this discussion is not about the current code but the unicode aware code).

Tobion · 2012-05-05T00:30:58Z

@dlsniper I fixed #4166 in #4205. First the fix should be done on 2.0 and secondly, we shouldn't mix unicode support with the fix as they are different things.

…s (nicolas-grekas) This PR was merged into the 3.2-dev branch. Discussion ---------- [Routing] Add seamless support for unicode requirements | Q | A | ------------- | --- | Branch? | master | Bug fix? | no | New feature? | yes | BC breaks? | no | Deprecations? | yes | Tests pass? | yes | Fixed tickets | #3629, #5236, #19562 | License | MIT | Doc PR | symfony/symfony-docs#6890 This PR adds unicode support to route matching and generation by automatically adding the `u` modifier to regexps that use either unicode characters or unicode enabled character classes (e.g. `\p...` `\x{...}` `\X`). As a side note, if one wants to match a single unicode character (vs a single byte), one should use `\PM` or `\X` instead of `.` *or* set the `unicode` parameter to true. Commits ------- a829d34 [Routing] Add seamless support for unicode requirements

dlsniper mentioned this pull request Apr 7, 2012

Added Unicode information about routes symfony/symfony-docs#1225

Merged

Tobion mentioned this pull request Apr 24, 2012

Routing component (and unit tests) fails on CentOS #4093

Closed

Tobion mentioned this pull request Apr 30, 2012

[Routing] UrlGenerator fails to encode static text #4166

Closed

dlsniper added 3 commits May 1, 2012 11:39

Added unicode modifier to regexp and updated tests

d367257

Added more language tests

3cd07d1

Made URLGenerator Unicode aware

6a4b88a

vicb reviewed May 1, 2012
View reviewed changes

fabpot closed this Jul 3, 2012

Tobion mentioned this pull request Aug 10, 2012

[Routing] add unicode support #5236

Closed

nicolas-grekas mentioned this pull request Aug 12, 2016

[Routing] Add seamless support for unicode requirements #19604

Merged

Uh oh!

[WIP][Routing] Added unicode support #3629

[WIP][Routing] Added unicode support #3629

Uh oh!

Conversation

dlsniper commented Mar 17, 2012

Uh oh!

stof commented Mar 17, 2012

Uh oh!

fabpot commented Mar 18, 2012

Uh oh!

ghost commented Mar 18, 2012

Uh oh!

dlsniper commented Mar 18, 2012

Uh oh!

dlsniper commented Mar 18, 2012

Uh oh!

dlsniper commented Mar 25, 2012

Uh oh!

stof commented Apr 3, 2012

Uh oh!

dlsniper commented Apr 4, 2012

Uh oh!

dlsniper commented Apr 7, 2012

Uh oh!

dlsniper commented Apr 13, 2012

Uh oh!

vicb commented Apr 13, 2012

Uh oh!

dlsniper commented Apr 14, 2012

Uh oh!

dlsniper commented Apr 30, 2012

Uh oh!

vicb commented Apr 30, 2012

Uh oh!

dlsniper commented Apr 30, 2012

Uh oh!

Tobion commented Apr 30, 2012

Uh oh!

dlsniper commented Apr 30, 2012

Uh oh!

Tobion commented Apr 30, 2012

Uh oh!

mvrhov commented Apr 30, 2012

Uh oh!

dlsniper commented Apr 30, 2012

Uh oh!

Tobion commented Apr 30, 2012

Uh oh!

Tobion commented Apr 30, 2012

Uh oh!

dlsniper commented Apr 30, 2012

Uh oh!

Tobion commented Apr 30, 2012

Uh oh!

Tobion commented Apr 30, 2012

Uh oh!

dlsniper commented Apr 30, 2012

Uh oh!

mvrhov commented May 1, 2012

Uh oh!

dlsniper commented May 1, 2012

Uh oh!

vicb commented May 1, 2012

Uh oh!

dlsniper commented May 1, 2012

Uh oh!

vicb commented May 1, 2012

Uh oh!

fabpot commented May 1, 2012

Uh oh!

Tobion commented May 1, 2012

Uh oh!

vicb May 1, 2012

Choose a reason for hiding this comment

Uh oh!

dlsniper May 1, 2012

Choose a reason for hiding this comment

Uh oh!

vicb commented May 1, 2012