extmod/re: Use buffer protocol for data to search through. #8152

andrewleech · 2022-01-09T06:20:35Z

The ure module currently requires either a str or byte object to be used for pattern, replace and string arguments.

When searching through a large amount of data with a need to manipulate said data, it's far more efficient to keep it in a bytearray. Converting / copying to a string to pass into re functions requires duplication of the ram usage.

This PR replaces the mp_obj_str_get_data and similar functions used in modure with buffer protocol functionality, meaning any object that can act as a buffer can be used for regex functions.

The second change included here is to always iterate over the string/buffer for the provided length, rather than exiting on NULL character. This allows for NULL characters to be included in search patterns which improves compatibility with cpython / other regex parsers. I personally needed this while porting a data parsing script from cpython to micropython.

dpgeorge · 2022-01-09T07:46:53Z

When searching through a large amount of data with a need to manipulate said data, it's far more efficient to keep it in a bytearray. Converting / copying to a string to pass into re functions requires duplication of the ram usage

Yes, definitely it should accept bytearray/memoryview/etc. This is what CPython allows.

The second change included here is to always iterate over the string/buffer for the provided length, rather than exiting on NULL character. This allows for NULL characters to be included in search patterns which improves compatibility with cpython / other regex parsers. I personally needed this while porting a data parsing script from cpython to micropython.

That makes sense as well, if it matches CPython.

There should be some tests added for this new behaviour.

andrewleech · 2022-01-09T07:53:26Z

Yep I need to fix a couple of builds too it seems - I'll definitely add the tests too thanks

andrewleech · 2022-01-11T01:30:08Z

py/objstr.c

@@ -2067,6 +2067,8 @@ mp_obj_t mp_obj_new_str_from_vstr(const mp_obj_type_t *type, vstr_t *vstr) {
            vstr->alloc = 0;
            return MP_OBJ_NEW_QSTR(q);
        }
+    } else {
+        type = &mp_type_bytes;


@dpgeorge the following line was failing:

mp_obj_new_str_from_vstr(mp_obj_get_type(where), &vstr_return);

when where is bytearray etc as this function would copy the incoming type to the output.
The solution proposed here (in separate commit) is to sanitise the provided type to be either mp_type_str or mp_type_bytes the same as is done in mp_obj_new_str_of_type() earlier in this file.

If this seems like a risky / unneccesary change this type checking could be done as-needed in modre.c before calling mp_obj_new_str_from_vstr.
However, mp_type_bytes is currently not exposed in nativeglue.c so dynamic ure module currently fails with this approach - we could certainly expose mp_type_bytes in native modules though if this is a preferred direction; indeed mp_type_str is already exposed there.

I think what you did here is a good change, I think all existing callers of this function passed either str or bytes, so functionality shouldn't change. And this scales well to other uses which would have similar semantics.

But actually dynamic ure doesn't enable URE_SUB because dynamic doesn't support mp_obj_new_str_from_vstr()!

Oh... before this one of the unix builds here was failing, I thought it was from a dynamic build of this.

I believe this change is now incorrect and no longer needed after the work done since on standardising str/byte/bytearray functions. As such I've removed it in the rebase, all tests are passing.

andrewleech · 2022-01-11T03:11:11Z

Other than the above question about mp_obj_new_str_from_vstr I think this PR is ready for review thanks @dpgeorge

dpgeorge · 2022-01-22T23:06:46Z

extmod/modure.c


        // Append replacement string to result, substituting any regex groups
-        while (*repl != '\0') {
+        while (repl_len > 0) {


I think it'd be simpler (and maybe smaller code) to do:

const char *repl_top = bufinfo.buf + bufinfo.len; while (repl < repl_top) { ...

Yes I believe you're right, not sure why I didn't do it that way from the start. I was very uncomfortable about the manual length tracking, knew right from the start it was a bad / fragile approach.

dpgeorge · 2022-01-22T23:08:04Z

extmod/modure.c

            if (*repl == '\\') {
                ++repl;
+                repl_len--;
                bool is_g_format = false;
                if (*repl == 'g' && repl[1] == '<') {


need to guard against repl going beyond the end of the array

dpgeorge · 2022-01-22T23:08:22Z

extmod/modure.c

@@ -328,9 +337,11 @@ STATIC mp_obj_t re_sub_helper(size_t n_args, const mp_obj_t *args) {
                    unsigned int match_no = 0;
                    do {
                        match_no = match_no * 10 + (*repl++ - '0');
+                        repl_len--;
                    } while ('0' <= *repl && *repl <= '9');


needs a guard

dpgeorge · 2022-01-22T23:09:24Z

lib/re1.5/compilecode.c

@@ -18,19 +18,21 @@ static void _emit_checked(int at, char *code, int val, bool *err) {
    }
 }

-static const char *_compilecode(const char *re, ByteProg *prog, int sizecode)
+static const char *_compilecode(const char *re, size_t *len, ByteProg *prog, int sizecode)


Does this function need to pass the updated len back to the caller? If not it's more efficient to just have size_t len.

I did need to pass the len back to track how much is consumed when the function is run recursively, but this need goes away with a change to re_top :-)

dpgeorge · 2022-01-22T23:10:05Z

lib/re1.5/compilecode.c

 {
    char *code = sizecode ? NULL : prog->insts;
    bool err = false;
    int start = PC;
    int term = PC;
    int alt_label = 0;

-    for (; *re && *re != ')'; re++) {
+
+    while (*len > 0 && *re != ')') {


As above, might be easier to have re_top variable instead of len, and do re < re_top (maybe...).

dpgeorge · 2022-01-22T23:10:41Z

lib/re1.5/compilecode.c

                }
-                if (!*re) return NULL;
+                if (*len <= 0) return NULL;
                EMIT(PC++, *re);
                if (re[1] == '-' && re[2] != ']') {


This will need to guard that re[2] is not off the end of the array.

andrewleech · 2022-02-01T21:04:14Z

Thanks for the while (repl < repl_top) style suggestions @dpgeorge I've made those changes and it definitely improved / simplified the code.

I'm pretty sure I've added exit / buffer length guards everywhere needed - there were a few unsafe places in the original code too that should now be protected.

codecov-commenter · 2022-08-04T04:28:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (7924b31) to head (dc8e943).
Report is 178 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8152      +/-   ##
==========================================
- Coverage   98.57%   98.54%   -0.04%     
==========================================
  Files         164      169       +5     
  Lines       21349    21844     +495     
==========================================
+ Hits        21045    21526     +481     
- Misses        304      318      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andrewleech · 2023-06-06T03:16:14Z

@dpgeorge This has been rebased and integrated into the other recent changes in re ( 64193c7 ).

github-actions · 2023-06-06T03:19:05Z

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:  +344 +0.040% standard
      stm32:  +112 +0.029% PYBV10
     mimxrt:  +112 +0.030% TEENSY40
        rp2:  +160 +0.017% RPI_PICO_W
       samd:  +120 +0.045% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:   +88 +0.019% VIRT_RV32

dpgeorge · 2023-06-06T04:35:12Z

It's a relatively big change (in terms of code size increase). Would you rather this, or enhanced collections.deque ? 😂

andrewleech · 2023-06-06T04:36:58Z

Yeah I was a bit shocked! Technically there's two changes here; support for \x00 in the search data, and the buffer support. I'm planning to check the relative size of each change.

andrewleech · 2023-06-06T07:05:55Z

Ok so trimming it to just the null char change gives me a local diff of

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:  +240 +0.030% standard
      stm32:  +104 +0.027% PYBV10
        rp2:  +128 +0.039% PICO

So that's a fairly significant majority of the change size. I might have a think about whether that can be made smaller

andrewleech · 2023-06-06T11:29:06Z

lib/re1.5/compilecode.c

@@ -89,19 +94,21 @@ static const char *_compilecode(const char *re, ByteProg *prog, int sizecode)
        case '(': {
            term = PC;
            int sub = 0;
-            int capture = re[1] != '?' || re[2] != ':';
+            int capture = re_top - re > 2 && (re[1] != '?' || re[2] != ':');


To be fair, some of the extra size in the null-content commit is the extra length safety such as this line, preventing potential buffer overflows in the existing code.

That being said many of the similar / new checks above like if (re >= re_top) return NULL; weren't needed in the original code becase it was guaranteed that there's always the "extra" NULL byte on the end of teh "valid" string data so the code can get away with a re++; *re == char once per NULL check.

Add new board: splitkb.com's Liatris

Signed-off-by: Andrew Leech <andrew@alelec.net>

- Restored missing PC increment in re compiler for character classes. - Fixed poll unregister logic to correctly adjust max_used fd count. 🤖 Generated with Anon Kode Co-Authored-By: Anon Kode <noreply@Anon Kode.com>

dpgeorge added the extmod Relates to extmod/ directory in source label Jan 9, 2022

andrewleech force-pushed the re_buffer branch 4 times, most recently from d7c6000 to c8d8b15 Compare January 11, 2022 01:24

andrewleech commented Jan 11, 2022

View reviewed changes

andrewleech force-pushed the re_buffer branch from c8d8b15 to b610994 Compare January 11, 2022 02:22

dpgeorge reviewed Jan 22, 2022

View reviewed changes

andrewleech force-pushed the re_buffer branch 2 times, most recently from ce90f7b to b315c75 Compare January 31, 2022 04:24

andrewleech force-pushed the re_buffer branch from b315c75 to f1c61a6 Compare February 2, 2022 10:37

andrewleech force-pushed the re_buffer branch from f1c61a6 to 8802467 Compare August 2, 2022 01:54

andrewleech force-pushed the re_buffer branch 3 times, most recently from a693098 to 7a7059f Compare August 24, 2022 10:17

andrewleech force-pushed the re_buffer branch from 7a7059f to 71f7c2c Compare June 6, 2023 03:10

andrewleech commented Jun 6, 2023

View reviewed changes

andrewleech force-pushed the re_buffer branch from 314f531 to d722301 Compare June 9, 2023 06:08

ricksorensen mentioned this pull request Jun 11, 2023

ure returns different results than using re (Regular expression) #11724

Open

tannewt added a commit to tannewt/circuitpython that referenced this pull request Jul 11, 2023

Merge pull request micropython#8152 from splitkb/liatris

9f3f2aa

Add new board: splitkb.com's Liatris

andrewleech force-pushed the re_buffer branch from d722301 to 8e354bf Compare July 31, 2023 05:03

Gadgetoid mentioned this pull request Feb 29, 2024

global: Prune trailing whitespace. #13777

Closed

andrewleech force-pushed the re_buffer branch from 8e354bf to dc8e943 Compare February 25, 2025 00:54

andrewleech and others added 4 commits March 29, 2025 21:15

lib/re1.5: Allow null characters in regex patterns.

cc19b0b

Signed-off-by: Andrew Leech <andrew@alelec.net>

extmod/re: Use buffer protocol for data to search through.

aa00432

Signed-off-by: Andrew Leech <andrew@alelec.net>

lib/re1.5: Minor improvement in code size.

5b57ce0

Signed-off-by: Andrew Leech <andrew@alelec.net>

lib/re1.5/charclass: Use faster character range checks.

d88f264

Signed-off-by: Andrew Leech <andrew@alelec.net>

andrewleech force-pushed the re_buffer branch from dc8e943 to d88f264 Compare March 30, 2025 18:55

Fix(re, select): Fix test failures after code size optimizations

4c710e5

- Restored missing PC increment in re compiler for character classes. - Fixed poll unregister logic to correctly adjust max_used fd count. 🤖 Generated with Anon Kode Co-Authored-By: Anon Kode <noreply@Anon Kode.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extmod/re: Use buffer protocol for data to search through. #8152

extmod/re: Use buffer protocol for data to search through. #8152

andrewleech commented Jan 9, 2022

dpgeorge commented Jan 9, 2022

andrewleech commented Jan 9, 2022

andrewleech Jan 11, 2022

dpgeorge Jan 17, 2022

andrewleech Jan 17, 2022

andrewleech Jun 6, 2023

andrewleech commented Jan 11, 2022

dpgeorge Jan 22, 2022

andrewleech Jan 28, 2022

dpgeorge Jan 22, 2022

dpgeorge Jan 22, 2022

dpgeorge Jan 22, 2022

andrewleech Jan 28, 2022

dpgeorge Jan 22, 2022

dpgeorge Jan 22, 2022

andrewleech commented Feb 1, 2022

codecov-commenter commented Aug 4, 2022 •

edited by codecov bot

Loading

andrewleech commented Jun 6, 2023

github-actions bot commented Jun 6, 2023 •

edited

Loading

dpgeorge commented Jun 6, 2023

andrewleech commented Jun 6, 2023

andrewleech commented Jun 6, 2023

andrewleech Jun 6, 2023

extmod/re: Use buffer protocol for data to search through. #8152

Are you sure you want to change the base?

extmod/re: Use buffer protocol for data to search through. #8152

Conversation

andrewleech commented Jan 9, 2022

dpgeorge commented Jan 9, 2022

andrewleech commented Jan 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewleech commented Jan 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewleech commented Feb 1, 2022

codecov-commenter commented Aug 4, 2022 • edited by codecov bot Loading

Codecov Report

andrewleech commented Jun 6, 2023

github-actions bot commented Jun 6, 2023 • edited Loading

dpgeorge commented Jun 6, 2023

andrewleech commented Jun 6, 2023

andrewleech commented Jun 6, 2023

Choose a reason for hiding this comment

codecov-commenter commented Aug 4, 2022 •

edited by codecov bot

Loading

github-actions bot commented Jun 6, 2023 •

edited

Loading