Skip to content

Conversation

jepler
Copy link
Contributor

@jepler jepler commented Aug 7, 2025

Summary

Fuzz testing found that it was possible to create invalid UTF-8 strings when the program input was not UTF-8. This could occur because a disk file was not UTF-8, or because a byte string passed to eval()/exec() was not UTF-8.

Besides leading to the problems that the introduction of utf8_check was intended to fix (#9044), the fuzzer found an actual crash when the first byte was \xff and the string was used as an exception argument (#17855).

I also noticed that the check could be generalized a little to avoid constructing non-UTF-8 identifiers, which could also lead to problems.

I re-organized the code to pay for the size cost of the new check in the lexer.

Testing

I added a new test, using eval() and exec() of byte strings, to ensure that these cases are caught by the lexer.

Trade-offs and Alternatives

Could check that the whole code buffer is UTF-8 instead.

Copy link

codecov bot commented Aug 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.39%. Comparing base (1588c45) to head (90e366e).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #17862   +/-   ##
=======================================
  Coverage   98.38%   98.39%           
=======================================
  Files         171      171           
  Lines       22297    22300    +3     
=======================================
+ Hits        21938    21941    +3     
  Misses        359      359           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

github-actions bot commented Aug 7, 2025

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:   +88 +0.010% standard
      stm32:   +16 +0.004% PYBV10
     mimxrt:   +24 +0.006% TEENSY40
        rp2:   +16 +0.002% RPI_PICO_W
       samd:   +16 +0.006% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:   +24 +0.005% VIRT_RV32

@jepler jepler changed the title parse: Don't allow creation of invalid UTF8 strings parse: Don't allow creation of invalid UTF8 strings or identifiers Aug 7, 2025
py/objstr.h Outdated
@@ -119,4 +119,13 @@ extern const mp_obj_dict_t mp_obj_bytearray_locals_dict;
extern const mp_obj_dict_t mp_obj_array_locals_dict;
#endif

#if MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK
// Throws an exception if string content is not UTF-8
void utf8_require(const byte *p, size_t len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to give this function the mp_ namespace prefix. We haven't always been consistent about this, but I think any new functions should have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion. done.

@dpgeorge dpgeorge added py-core Relates to py/ directory in source unicode Bugs and enhancements related to Unicode/UTF-8 support. labels Aug 8, 2025
@@ -0,0 +1,5 @@
UnicodeError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the output match CPython so this test doesn't need a .exp file?

I see that CPython raises SyntaxError though, which is different to MicroPython here... not sure what the best way forward is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to make MicroPython raise SyntaxError, to match CPython?

I don't know if that really matters though, this is a pretty rare case, and saving code size (reusing the same UnicodeError) is probably more important.

OTOH, if you had some code like this:

try:
    exec(some_code)
except SyntaxError:
    handle_syntax_error()

you might be surprised that the UnicodeError is raised and escapes the error handling.

py/parse.c Outdated
@@ -598,6 +598,9 @@ static mp_parse_node_t make_node_const_object_optimised(parser_t *parser, size_t
static void push_result_token(parser_t *parser, uint8_t rule_id) {
mp_parse_node_t pn;
mp_lexer_t *lex = parser->lexer;
if (lex->tok_kind == MP_TOKEN_NAME || lex->tok_kind == MP_TOKEN_STRING) {
mp_utf8_require((byte *)lex->vstr.buf, lex->vstr.len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to impact performance for compilation of all code. I guess there's not really any way around that if we want to validate the utf8 properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on both counts. If the overhead is intolerable for some use case, it can be compile-time disabled just like the existing uses of mp_utf8_require (gated by MICROPY_PY_BUILTINS_STR_UNICODE_CHECK, on by default when unicode is on).

(It's been rattling around in my head to have a build flag for "remove all checks a good static checker would diagnose", e.g., code that passes mypy --strict, is checked for valid utf-8, etc. but I think there would still end up being a lot of judgement calls and also you can't quite type check micropython/circuitpython code because of lack of exactly matching pyi stubs…)

@jepler
Copy link
Contributor Author

jepler commented Aug 15, 2025

It's not necessarily a great comparison but I made a file called "huge.py" from 16 copies of tests/perf_bench/bm_hexiom.py (total size 268928 bytes) then benchmarked compiling it with the old and new mpy-cross on my x86_64 linux desktop system. Statistically, it was a wash.

Benchmark 1: ./mpy-cross-pr17862 huge.py
  Time (mean ± σ):      66.1 ms ±   0.3 ms    [User: 58.6 ms, System: 7.4 ms]
  Range (min … max):    65.3 ms …  67.0 ms    100 runs
 
Benchmark 2: ./mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py
  Time (mean ± σ):      66.3 ms ±   0.3 ms    [User: 59.6 ms, System: 6.6 ms]
  Range (min … max):    65.8 ms …  67.7 ms    100 runs
 
Summary
  ./mpy-cross-pr17862 huge.py ran
    1.00 ± 0.01 times faster than ./mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py

Linux perf stat counted 4 million more instructions (+0.59%) but another way to look at is 15 added instructions per byte of source code. (baseline is around 2554 instructions per byte of source code)

 Performance counter stats for './mpy-cross-pr17862 huge.py' (30 runs):

             66.82 msec task-clock                       #    0.990 CPUs utilized               ( +-  0.09% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               329      page-faults                      #    4.924 K/sec                       ( +-  0.09% )
       282,271,960      cycles                           #    4.224 GHz                         ( +-  0.05% )  (57.31%)
        41,553,651      stalled-cycles-frontend          #   14.72% frontend cycles idle        ( +-  0.33% )  (63.35%)
       690,916,317      instructions                     #    2.45  insn per cycle            
                                                  #    0.06  stalled cycles per insn     ( +-  0.09% )  (63.77%)
       193,477,661      branches                         #    2.896 G/sec                       ( +-  0.08% )  (60.80%)
         1,477,191      branch-misses                    #    0.76% of all branches             ( +-  1.10% )  (54.76%)

         0.0674732 +- 0.0000710 seconds time elapsed  ( +-  0.11% )


 Performance counter stats for './mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py' (30 runs):

             65.99 msec task-clock                       #    0.991 CPUs utilized               ( +-  0.05% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               328      page-faults                      #    4.971 K/sec                       ( +-  0.07% )
       283,490,040      cycles                           #    4.296 GHz                         ( +-  0.05% )  (55.98%)
        42,627,980      stalled-cycles-frontend          #   15.04% frontend cycles idle        ( +-  0.37% )  (62.06%)
       686,832,657      instructions                     #    2.42  insn per cycle            
                                                  #    0.06  stalled cycles per insn     ( +-  0.13% )  (63.52%)
       194,634,958      branches                         #    2.950 G/sec                       ( +-  0.17% )  (62.25%)
         1,374,874      branch-misses                    #    0.71% of all branches             ( +-  1.75% )  (56.18%)

         0.0665979 +- 0.0000371 seconds time elapsed  ( +-  0.06% )

@jepler
Copy link
Contributor Author

jepler commented Aug 16, 2025

I changed things a bit so SyntaxError can be raised in this case. I agree it seems preferable if the cost isn't high.

@jepler
Copy link
Contributor Author

jepler commented Aug 16, 2025

That takes e.g., RP2040 from -16 bytes to +16 bytes (net 32 bytes difference to throw the correct exception). Other ports are in the same range of increase.

jepler added 3 commits August 17, 2025 08:59
.. from non UTF-8 inputs. In this case, MicroPython raises
UnicodeError while CPython uses SyntaxError. By catching either
exception, the test does not require an .exp file.

Signed-off-by: Jeff Epler <jepler@gmail.com>
All sites immediately threw a UnicodeError, so roll that into
the new function utf8_require.

unicode.c was designed not to require runtime.h, so move the
checking function into objstr.c.

Reduce the number of #if sites by making a do-nothing variant
that is used instead when !STR_UNICODE or !STR_UNICODE_CHECK.

Signed-off-by: Jeff Epler <jepler@gmail.com>
.. even when compiling non UTF-8 files or byte strings.

Closes: micropython#17855
Signed-off-by: Jeff Epler <jepler@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source unicode Bugs and enhancements related to Unicode/UTF-8 support.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants