parse: Don't allow creation of invalid UTF8 strings or identifiers #17862

jepler · 2025-08-07T14:29:42Z

Summary

Fuzz testing found that it was possible to create invalid UTF-8 strings when the program input was not UTF-8. This could occur because a disk file was not UTF-8, or because a byte string passed to eval()/exec() was not UTF-8.

Besides leading to the problems that the introduction of utf8_check was intended to fix (#9044), the fuzzer found an actual crash when the first byte was \xff and the string was used as an exception argument (#17855).

I also noticed that the check could be generalized a little to avoid constructing non-UTF-8 identifiers, which could also lead to problems.

I re-organized the code to pay for the size cost of the new check in the lexer.

Testing

I added a new test, using eval() and exec() of byte strings, to ensure that these cases are caught by the lexer.

Trade-offs and Alternatives

Could check that the whole code buffer is UTF-8 instead.

codecov · 2025-08-07T14:39:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.39%. Comparing base (1588c45) to head (90e366e).

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #17862   +/-   ##
=======================================
  Coverage   98.38%   98.39%           
=======================================
  Files         171      171           
  Lines       22297    22300    +3     
=======================================
+ Hits        21938    21941    +3     
  Misses        359      359

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2025-08-07T14:39:41Z

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:   +88 +0.010% standard
      stm32:   +16 +0.004% PYBV10
     mimxrt:   +24 +0.006% TEENSY40
        rp2:   +16 +0.002% RPI_PICO_W
       samd:   +16 +0.006% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:   +24 +0.005% VIRT_RV32

dlech · 2025-08-07T16:18:50Z

py/objstr.h

@@ -119,4 +119,13 @@ extern const mp_obj_dict_t mp_obj_bytearray_locals_dict;
 extern const mp_obj_dict_t mp_obj_array_locals_dict;
 #endif

+#if MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK
+// Throws an exception if string content is not UTF-8
+void utf8_require(const byte *p, size_t len);


I suggest to give this function the mp_ namespace prefix. We haven't always been consistent about this, but I think any new functions should have it.

good suggestion. done.

dpgeorge · 2025-08-15T02:07:28Z

tests/unicode/unicode_parser.py.exp

@@ -0,0 +1,5 @@
+UnicodeError


Can we make the output match CPython so this test doesn't need a .exp file?

I see that CPython raises SyntaxError though, which is different to MicroPython here... not sure what the best way forward is.

Would it be better to make MicroPython raise SyntaxError, to match CPython?

I don't know if that really matters though, this is a pretty rare case, and saving code size (reusing the same UnicodeError) is probably more important.

OTOH, if you had some code like this:

try: exec(some_code) except SyntaxError: handle_syntax_error()

you might be surprised that the UnicodeError is raised and escapes the error handling.

dpgeorge · 2025-08-15T14:55:22Z

py/parse.c

@@ -598,6 +598,9 @@ static mp_parse_node_t make_node_const_object_optimised(parser_t *parser, size_t
 static void push_result_token(parser_t *parser, uint8_t rule_id) {
    mp_parse_node_t pn;
    mp_lexer_t *lex = parser->lexer;
+    if (lex->tok_kind == MP_TOKEN_NAME || lex->tok_kind == MP_TOKEN_STRING) {
+        mp_utf8_require((byte *)lex->vstr.buf, lex->vstr.len);


This is going to impact performance for compilation of all code. I guess there's not really any way around that if we want to validate the utf8 properly.

Agreed on both counts. If the overhead is intolerable for some use case, it can be compile-time disabled just like the existing uses of mp_utf8_require (gated by MICROPY_PY_BUILTINS_STR_UNICODE_CHECK, on by default when unicode is on).

(It's been rattling around in my head to have a build flag for "remove all checks a good static checker would diagnose", e.g., code that passes mypy --strict, is checked for valid utf-8, etc. but I think there would still end up being a lot of judgement calls and also you can't quite type check micropython/circuitpython code because of lack of exactly matching pyi stubs…)

jepler · 2025-08-15T17:14:07Z

It's not necessarily a great comparison but I made a file called "huge.py" from 16 copies of tests/perf_bench/bm_hexiom.py (total size 268928 bytes) then benchmarked compiling it with the old and new mpy-cross on my x86_64 linux desktop system. Statistically, it was a wash.

Benchmark 1: ./mpy-cross-pr17862 huge.py
  Time (mean ± σ):      66.1 ms ±   0.3 ms    [User: 58.6 ms, System: 7.4 ms]
  Range (min … max):    65.3 ms …  67.0 ms    100 runs
 
Benchmark 2: ./mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py
  Time (mean ± σ):      66.3 ms ±   0.3 ms    [User: 59.6 ms, System: 6.6 ms]
  Range (min … max):    65.8 ms …  67.7 ms    100 runs
 
Summary
  ./mpy-cross-pr17862 huge.py ran
    1.00 ± 0.01 times faster than ./mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py

Linux perf stat counted 4 million more instructions (+0.59%) but another way to look at is 15 added instructions per byte of source code. (baseline is around 2554 instructions per byte of source code)

 Performance counter stats for './mpy-cross-pr17862 huge.py' (30 runs):

             66.82 msec task-clock                       #    0.990 CPUs utilized               ( +-  0.09% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               329      page-faults                      #    4.924 K/sec                       ( +-  0.09% )
       282,271,960      cycles                           #    4.224 GHz                         ( +-  0.05% )  (57.31%)
        41,553,651      stalled-cycles-frontend          #   14.72% frontend cycles idle        ( +-  0.33% )  (63.35%)
       690,916,317      instructions                     #    2.45  insn per cycle            
                                                  #    0.06  stalled cycles per insn     ( +-  0.09% )  (63.77%)
       193,477,661      branches                         #    2.896 G/sec                       ( +-  0.08% )  (60.80%)
         1,477,191      branch-misses                    #    0.76% of all branches             ( +-  1.10% )  (54.76%)

         0.0674732 +- 0.0000710 seconds time elapsed  ( +-  0.11% )


 Performance counter stats for './mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py' (30 runs):

             65.99 msec task-clock                       #    0.991 CPUs utilized               ( +-  0.05% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               328      page-faults                      #    4.971 K/sec                       ( +-  0.07% )
       283,490,040      cycles                           #    4.296 GHz                         ( +-  0.05% )  (55.98%)
        42,627,980      stalled-cycles-frontend          #   15.04% frontend cycles idle        ( +-  0.37% )  (62.06%)
       686,832,657      instructions                     #    2.42  insn per cycle            
                                                  #    0.06  stalled cycles per insn     ( +-  0.13% )  (63.52%)
       194,634,958      branches                         #    2.950 G/sec                       ( +-  0.17% )  (62.25%)
         1,374,874      branch-misses                    #    0.71% of all branches             ( +-  1.75% )  (56.18%)

         0.0665979 +- 0.0000371 seconds time elapsed  ( +-  0.06% )

jepler · 2025-08-16T19:18:13Z

I changed things a bit so SyntaxError can be raised in this case. I agree it seems preferable if the cost isn't high.

jepler · 2025-08-16T20:28:34Z

That takes e.g., RP2040 from -16 bytes to +16 bytes (net 32 bytes difference to throw the correct exception). Other ports are in the same range of increase.

.. from non UTF-8 inputs. In this case, MicroPython raises UnicodeError while CPython uses SyntaxError. By catching either exception, the test does not require an .exp file. Signed-off-by: Jeff Epler <jepler@gmail.com>

All sites immediately threw a UnicodeError, so roll that into the new function utf8_require. unicode.c was designed not to require runtime.h, so move the checking function into objstr.c. Reduce the number of #if sites by making a do-nothing variant that is used instead when !STR_UNICODE or !STR_UNICODE_CHECK. Signed-off-by: Jeff Epler <jepler@gmail.com>

.. even when compiling non UTF-8 files or byte strings. Closes: micropython#17855 Signed-off-by: Jeff Epler <jepler@gmail.com>

jepler changed the title ~~parse: Don't allow creation of invalid UTF8 strings~~ parse: Don't allow creation of invalid UTF8 strings or identifiers Aug 7, 2025

jepler force-pushed the issue17855 branch from fd0ca44 to 392d55c Compare August 7, 2025 14:43

dlech reviewed Aug 7, 2025

View reviewed changes

jepler force-pushed the issue17855 branch from 392d55c to 09533f8 Compare August 7, 2025 16:45

dpgeorge added py-core Relates to py/ directory in source unicode Bugs and enhancements related to Unicode/UTF-8 support. labels Aug 8, 2025

dpgeorge reviewed Aug 15, 2025

View reviewed changes

jepler force-pushed the issue17855 branch from 09533f8 to bce1332 Compare August 15, 2025 14:41

dpgeorge reviewed Aug 15, 2025

View reviewed changes

jepler force-pushed the issue17855 branch from bce1332 to 1e841d4 Compare August 16, 2025 19:17

jepler added 3 commits August 17, 2025 08:59

tests: Add test of invalid unicode strings.

90012e7

.. from non UTF-8 inputs. In this case, MicroPython raises UnicodeError while CPython uses SyntaxError. By catching either exception, the test does not require an .exp file. Signed-off-by: Jeff Epler <jepler@gmail.com>

parse: Don't allow creation of invalid UTF8 strings or identifiers.

90e366e

.. even when compiling non UTF-8 files or byte strings. Closes: micropython#17855 Signed-off-by: Jeff Epler <jepler@gmail.com>

jepler force-pushed the issue17855 branch from 1e841d4 to 90e366e Compare August 17, 2025 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

parse: Don't allow creation of invalid UTF8 strings or identifiers #17862

parse: Don't allow creation of invalid UTF8 strings or identifiers #17862

jepler commented Aug 7, 2025

Uh oh!

codecov bot commented Aug 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 7, 2025 •

edited

Loading

Uh oh!

dlech Aug 7, 2025

Uh oh!

jepler Aug 7, 2025

Uh oh!

dpgeorge Aug 15, 2025

Uh oh!

jepler Aug 15, 2025

Uh oh!

dpgeorge Aug 15, 2025

Uh oh!

dpgeorge Aug 15, 2025

Uh oh!

jepler Aug 15, 2025

Uh oh!

jepler commented Aug 15, 2025 •

edited

Loading

Uh oh!

jepler commented Aug 16, 2025

Uh oh!

jepler commented Aug 16, 2025

Uh oh!

Uh oh!

Uh oh!

parse: Don't allow creation of invalid UTF8 strings or identifiers #17862

Are you sure you want to change the base?

parse: Don't allow creation of invalid UTF8 strings or identifiers #17862

Conversation

jepler commented Aug 7, 2025

Summary

Testing

Trade-offs and Alternatives

Uh oh!

codecov bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlech Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

jepler Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

dpgeorge Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

jepler Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

dpgeorge Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

dpgeorge Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

jepler Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

jepler commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jepler commented Aug 16, 2025

Uh oh!

jepler commented Aug 16, 2025

Uh oh!

Uh oh!

codecov bot commented Aug 7, 2025 •

edited

Loading

github-actions bot commented Aug 7, 2025 •

edited

Loading

jepler commented Aug 15, 2025 •

edited

Loading