-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
parse: Don't allow creation of invalid UTF8 strings or identifiers #17862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17862 +/- ##
=======================================
Coverage 98.38% 98.39%
=======================================
Files 171 171
Lines 22297 22300 +3
=======================================
+ Hits 21938 21941 +3
Misses 359 359 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Code size report:
|
py/objstr.h
Outdated
@@ -119,4 +119,13 @@ extern const mp_obj_dict_t mp_obj_bytearray_locals_dict; | |||
extern const mp_obj_dict_t mp_obj_array_locals_dict; | |||
#endif | |||
|
|||
#if MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK | |||
// Throws an exception if string content is not UTF-8 | |||
void utf8_require(const byte *p, size_t len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to give this function the mp_
namespace prefix. We haven't always been consistent about this, but I think any new functions should have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion. done.
tests/unicode/unicode_parser.py.exp
Outdated
@@ -0,0 +1,5 @@ | |||
UnicodeError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the output match CPython so this test doesn't need a .exp file?
I see that CPython raises SyntaxError though, which is different to MicroPython here... not sure what the best way forward is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to make MicroPython raise SyntaxError, to match CPython?
I don't know if that really matters though, this is a pretty rare case, and saving code size (reusing the same UnicodeError) is probably more important.
OTOH, if you had some code like this:
try:
exec(some_code)
except SyntaxError:
handle_syntax_error()
you might be surprised that the UnicodeError is raised and escapes the error handling.
py/parse.c
Outdated
@@ -598,6 +598,9 @@ static mp_parse_node_t make_node_const_object_optimised(parser_t *parser, size_t | |||
static void push_result_token(parser_t *parser, uint8_t rule_id) { | |||
mp_parse_node_t pn; | |||
mp_lexer_t *lex = parser->lexer; | |||
if (lex->tok_kind == MP_TOKEN_NAME || lex->tok_kind == MP_TOKEN_STRING) { | |||
mp_utf8_require((byte *)lex->vstr.buf, lex->vstr.len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to impact performance for compilation of all code. I guess there's not really any way around that if we want to validate the utf8 properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on both counts. If the overhead is intolerable for some use case, it can be compile-time disabled just like the existing uses of mp_utf8_require (gated by MICROPY_PY_BUILTINS_STR_UNICODE_CHECK, on by default when unicode is on).
(It's been rattling around in my head to have a build flag for "remove all checks a good static checker would diagnose", e.g., code that passes mypy --strict
, is checked for valid utf-8, etc. but I think there would still end up being a lot of judgement calls and also you can't quite type check micropython/circuitpython code because of lack of exactly matching pyi stubs…)
It's not necessarily a great comparison but I made a file called "huge.py" from 16 copies of tests/perf_bench/bm_hexiom.py (total size 268928 bytes) then benchmarked compiling it with the old and new mpy-cross on my x86_64 linux desktop system. Statistically, it was a wash.
Linux
|
I changed things a bit so SyntaxError can be raised in this case. I agree it seems preferable if the cost isn't high. |
That takes e.g., RP2040 from -16 bytes to +16 bytes (net 32 bytes difference to throw the correct exception). Other ports are in the same range of increase. |
.. from non UTF-8 inputs. In this case, MicroPython raises UnicodeError while CPython uses SyntaxError. By catching either exception, the test does not require an .exp file. Signed-off-by: Jeff Epler <jepler@gmail.com>
All sites immediately threw a UnicodeError, so roll that into the new function utf8_require. unicode.c was designed not to require runtime.h, so move the checking function into objstr.c. Reduce the number of #if sites by making a do-nothing variant that is used instead when !STR_UNICODE or !STR_UNICODE_CHECK. Signed-off-by: Jeff Epler <jepler@gmail.com>
.. even when compiling non UTF-8 files or byte strings. Closes: micropython#17855 Signed-off-by: Jeff Epler <jepler@gmail.com>
Summary
Fuzz testing found that it was possible to create invalid UTF-8 strings when the program input was not UTF-8. This could occur because a disk file was not UTF-8, or because a byte string passed to eval()/exec() was not UTF-8.
Besides leading to the problems that the introduction of
utf8_check
was intended to fix (#9044), the fuzzer found an actual crash when the first byte was\xff
and the string was used as an exception argument (#17855).I also noticed that the check could be generalized a little to avoid constructing non-UTF-8 identifiers, which could also lead to problems.
I re-organized the code to pay for the size cost of the new check in the lexer.
Testing
I added a new test, using
eval()
andexec()
of byte strings, to ensure that these cases are caught by the lexer.Trade-offs and Alternatives
Could check that the whole code buffer is UTF-8 instead.