-
Notifications
You must be signed in to change notification settings - Fork 294
Do not directly use isolated surrogates in unicode literals #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…orms besides Jython
Critic review: https://critic.hoppipolla.co.uk/r/1443 This is an external review system which you may optionally use for the code review of your pull request. In order to help critic track your changes, please do not make in-place history rewrites (e.g. via |
I should mention that my branch of pip has the same change proposed here in its vendor lib inclusion of html5lib |
Really I'd rather Jython just changed so that is matched the Python documentation and CPython and PyPy and had its unicode type be 16-/32-bit code-units. :) But… Trying to run the tests with that branch of Jython on your branch of html5lib leads to a bunch of errors. Notably, Otherwise there's a bunch of failures due to lack of an euc_jp codec. The docs say:
…which I take to mean as Python, as a language, comes with them. Hence that's seems to be another bug in Jython. |
Guido van Rossum summarized the issue of UTF-16 support in Jython well, which we used as guidance when Jython 2.5 was under development:
Re euc_jp codec, this is a known bug with respect to a lack of CJK codecs (http://bugs.jython.org/issue1066). Hopefully there has been some more recent work on the corresponding patch. However, I did just re-run using $ ~/jythondev/jython-socket-reboot/dist/bin/nosetests --verbose
...
----------------------------------------------------------------------
Ran 1038 tests in 16.498s
OK Maybe I need to specify additional options? |
Did you init/update the git submodule? See the readme. You should be looking at tens of thousands of tests. FWIW, while the Py2 language reference uses the ambiguous phrasing of "Unicode ordinal" in defining the unicode type (what on earth a "Unicode ordinal" is is a very good question — it's not defined anywhere!), the Py3 language reference states that the str type is a sequence of Unicode codepoints (which would therefore mean lone surrogates are allowed). But this is rehashing what I said in the Jython bug and probably not the place to debate Python semantics. |
@gsnedders Got it, I was able to run 16729 tests, so we are now on the same page. Of these I'm seeing 7 errors. 6 are definitely euc_jp; 1 is failing in the middle of a loop, which I haven't investigated yet. Maybe this is euc_jp too? Since we would like to fix http://bugs.jython.org/issue1066 and support CJK codecs, I'm going to put this PR on hold on our side until that CJK support is complete, which I would be expect in the beta 4. This timing seems to be sound given the work by Yuji Yamano and the fact that this is now getting some attention. Because we will also will be integrating ensurepip during beta 4, there may be an interval where ensurepip will refer to a Jython-specific fork of pip (so beyond the current practice of referring people to https://github.com/jimbaker/pip to try things out), but this should be short in duration. |
The other error is another case of lone-surrogates, I can tell you that much. |
@gsnedders re the other case: will take a look. Thanks again for your help! I have updated http://bugs.jython.org/issue1066 to reflect this CJK codec dependency. |
When you next take a look at this, mind adding |
So where are we with this right now? |
The CJK work is now complete in Jython trunk: Jython now supports being able to wrap java.nio.charset codec support with the standard Python codec API. This includes the tested euc_jp codec in this test suite. There's one remaining problem with isolated surrogates with the tests corresponding to unicodeCharsProblematic.test (a data file used to construct distinct unit tests, a rather clever approach); some sort of workaround will be required. I will update the branch with this test workaround accordingly, along with .gitignore. |
HTMLUnicodeInputStream objects from unicode strings that contain isolated surrogates. Such tests are not meaningful on Jython which does not allow for invalid unicode strings to be decoded in the first place.
All tests pass with python 2.7, python 3.4, and jython 2.7 trunk using html5lib-tests test data. With this latest change, Jython will now only run 17353 of the current 17357 tests, effectively excluding the following test cases in unicodeCharsProblematic.test:
However, it does run the test case in that data file using I did notice that the JSON parsing of namedentities.test is rather slow. Something to fixed separately, by using the Jackson package for the json module implementation. |
I should also mention that https://bitbucket.org/jimbaker/jython-socket-reboot has been deleted, now that it has been merged into https://bitbucket.org/jython/jython (or alternatively, http://hg.python.org/jython) |
For what it's worth, if this gets merged and released within the next month or so It'll make it into pip 1.6. Not sure what the release schedule looks like for html5lib though :) |
@gsnedders Any updates/comments on this pull request? We are really hoping that pip 1.6 will be able to support Jython 2.7, as well as other users of html5lib-python |
Apologies for being a bit slow (I've been all over the place travelling pretty continuously, and I'm now on holiday till EuroPython); there's a couple of comments on Critic now, see the first comment above. @dstufft: the html5lib release schedule is typically someone going, "oh, hey, I need this bugfix that's on master, could you make a release?", and me responding with a release. :) (In theory, I plan on making releases every so often once anything worthwhile happens, but other people often find obscure bugfixes worthwhile, so it gets released when they ask.) |
@gsnedders Looking forward to seeing you at EuroPython! I'll respond to the Critic review, thanks! |
I added some more comments on critic. |
Hmm, test failures are due to trailing whitespace. |
#168 is essentially the same as this with the trailing whitespace fixed. |
#168 fails all its tests, not just the flake8 ones. |
That odd given nonameentername@c4755d3 is the only extra commit on the branch… Literally the whole diff is just that one commit! So, um… Somehow submodule not being updated properly on Travis? Because that looks like it's run with a more recent html5lib-tests revision… |
Oh, Travis CI seems to run the implicit merge-commit in html5lib-python@refs/pull/168/merge. Hence given it's based on master (which currently fails tests because #174 isn't fixed yet) and run more recently it fails. |
(That's annoying because it means that when Travis CI runs a given PR it tests different things; if I were to make it retest #150 it would have all the same failures, but with flake8 finding nothing, le sigh.) |
I was hoping that the recent jython 2.7b4 release somehow worked around this problem, but it is still borken. |
Hmm, I don't know of any resolution here. Are you in the process of merging #168? |
Just closing this because #168 is it's replacement rather than keeping both open. |
(We'll see if anything happens soon. Maybe.) |
Jython does not support isolated surrogates in unicode, including in unicode literals. This has been reported in #2 This bug is critical for Jython due to the fact that html5lib is a vendor lib for pip, and this is blocking pip from running on Jython.
For platforms besides Jython, this pull request allows for these surrogates to be constructed in literals, but through an additional step of indirection. For Jython itself, Jython's normal decode of literals will ensure that such invalid unicode strings cannot be constructed from any source.
To run this on Jython:
Note that in the dev build, you will find executables in dist/bin, such as dist/bin/jython or dist/bin/pip
The jython-socket-reboot branch is nearly complete for merging into Jython; it is a major component of Jython 2.7.0 beta 3. (I'm a core dev of Jython.)