Skip to content

extmod/re: Use buffer protocol for data to search through. #8152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

andrewleech
Copy link
Contributor

The ure module currently requires either a str or byte object to be used for pattern, replace and string arguments.

When searching through a large amount of data with a need to manipulate said data, it's far more efficient to keep it in a bytearray. Converting / copying to a string to pass into re functions requires duplication of the ram usage.

This PR replaces the mp_obj_str_get_data and similar functions used in modure with buffer protocol functionality, meaning any object that can act as a buffer can be used for regex functions.

The second change included here is to always iterate over the string/buffer for the provided length, rather than exiting on NULL character. This allows for NULL characters to be included in search patterns which improves compatibility with cpython / other regex parsers. I personally needed this while porting a data parsing script from cpython to micropython.

@dpgeorge dpgeorge added the extmod Relates to extmod/ directory in source label Jan 9, 2022
@dpgeorge
Copy link
Member

dpgeorge commented Jan 9, 2022

When searching through a large amount of data with a need to manipulate said data, it's far more efficient to keep it in a bytearray. Converting / copying to a string to pass into re functions requires duplication of the ram usage

Yes, definitely it should accept bytearray/memoryview/etc. This is what CPython allows.

The second change included here is to always iterate over the string/buffer for the provided length, rather than exiting on NULL character. This allows for NULL characters to be included in search patterns which improves compatibility with cpython / other regex parsers. I personally needed this while porting a data parsing script from cpython to micropython.

That makes sense as well, if it matches CPython.

There should be some tests added for this new behaviour.

@andrewleech
Copy link
Contributor Author

Yep I need to fix a couple of builds too it seems - I'll definitely add the tests too thanks

@andrewleech andrewleech force-pushed the re_buffer branch 4 times, most recently from d7c6000 to c8d8b15 Compare January 11, 2022 01:24
py/objstr.c Outdated
@@ -2067,6 +2067,8 @@ mp_obj_t mp_obj_new_str_from_vstr(const mp_obj_type_t *type, vstr_t *vstr) {
vstr->alloc = 0;
return MP_OBJ_NEW_QSTR(q);
}
} else {
type = &mp_type_bytes;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dpgeorge the following line was failing:

mp_obj_new_str_from_vstr(mp_obj_get_type(where), &vstr_return);

when where is bytearray etc as this function would copy the incoming type to the output.
The solution proposed here (in separate commit) is to sanitise the provided type to be either mp_type_str or mp_type_bytes the same as is done in mp_obj_new_str_of_type() earlier in this file.

If this seems like a risky / unneccesary change this type checking could be done as-needed in modre.c before calling mp_obj_new_str_from_vstr.
However, mp_type_bytes is currently not exposed in nativeglue.c so dynamic ure module currently fails with this approach - we could certainly expose mp_type_bytes in native modules though if this is a preferred direction; indeed mp_type_str is already exposed there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you did here is a good change, I think all existing callers of this function passed either str or bytes, so functionality shouldn't change. And this scales well to other uses which would have similar semantics.

But actually dynamic ure doesn't enable URE_SUB because dynamic doesn't support mp_obj_new_str_from_vstr()!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... before this one of the unix builds here was failing, I thought it was from a dynamic build of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this change is now incorrect and no longer needed after the work done since on standardising str/byte/bytearray functions. As such I've removed it in the rebase, all tests are passing.

@andrewleech
Copy link
Contributor Author

Other than the above question about mp_obj_new_str_from_vstr I think this PR is ready for review thanks @dpgeorge

extmod/modure.c Outdated

// Append replacement string to result, substituting any regex groups
while (*repl != '\0') {
while (repl_len > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be simpler (and maybe smaller code) to do:

const char *repl_top = bufinfo.buf + bufinfo.len;

while (repl < repl_top) {
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe you're right, not sure why I didn't do it that way from the start. I was very uncomfortable about the manual length tracking, knew right from the start it was a bad / fragile approach.

extmod/modure.c Outdated
if (*repl == '\\') {
++repl;
repl_len--;
bool is_g_format = false;
if (*repl == 'g' && repl[1] == '<') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to guard against repl going beyond the end of the array

extmod/modure.c Outdated
@@ -328,9 +337,11 @@ STATIC mp_obj_t re_sub_helper(size_t n_args, const mp_obj_t *args) {
unsigned int match_no = 0;
do {
match_no = match_no * 10 + (*repl++ - '0');
repl_len--;
} while ('0' <= *repl && *repl <= '9');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a guard

@@ -18,19 +18,21 @@ static void _emit_checked(int at, char *code, int val, bool *err) {
}
}

static const char *_compilecode(const char *re, ByteProg *prog, int sizecode)
static const char *_compilecode(const char *re, size_t *len, ByteProg *prog, int sizecode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function need to pass the updated len back to the caller? If not it's more efficient to just have size_t len.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did need to pass the len back to track how much is consumed when the function is run recursively, but this need goes away with a change to re_top :-)

{
char *code = sizecode ? NULL : prog->insts;
bool err = false;
int start = PC;
int term = PC;
int alt_label = 0;

for (; *re && *re != ')'; re++) {

while (*len > 0 && *re != ')') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, might be easier to have re_top variable instead of len, and do re < re_top (maybe...).

}
if (!*re) return NULL;
if (*len <= 0) return NULL;
EMIT(PC++, *re);
if (re[1] == '-' && re[2] != ']') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to guard that re[2] is not off the end of the array.

@andrewleech andrewleech force-pushed the re_buffer branch 2 times, most recently from ce90f7b to b315c75 Compare January 31, 2022 04:24
@andrewleech
Copy link
Contributor Author

Thanks for the while (repl < repl_top) style suggestions @dpgeorge I've made those changes and it definitely improved / simplified the code.

I'm pretty sure I've added exit / buffer length guards everywhere needed - there were a few unsafe places in the original code too that should now be protected.

@codecov-commenter
Copy link

codecov-commenter commented Aug 4, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (7924b31) to head (dc8e943).
Report is 178 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8152      +/-   ##
==========================================
- Coverage   98.57%   98.54%   -0.04%     
==========================================
  Files         164      169       +5     
  Lines       21349    21844     +495     
==========================================
+ Hits        21045    21526     +481     
- Misses        304      318      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andrewleech andrewleech force-pushed the re_buffer branch 3 times, most recently from a693098 to 7a7059f Compare August 24, 2022 10:17
@andrewleech
Copy link
Contributor Author

@dpgeorge This has been rebased and integrated into the other recent changes in re ( 64193c7 ).

@github-actions
Copy link

github-actions bot commented Jun 6, 2023

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:  +344 +0.040% standard
      stm32:  +112 +0.029% PYBV10
     mimxrt:  +112 +0.030% TEENSY40
        rp2:  +160 +0.017% RPI_PICO_W
       samd:  +120 +0.045% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:   +88 +0.019% VIRT_RV32

@dpgeorge
Copy link
Member

dpgeorge commented Jun 6, 2023

It's a relatively big change (in terms of code size increase). Would you rather this, or enhanced collections.deque ? 😂

@andrewleech
Copy link
Contributor Author

Yeah I was a bit shocked! Technically there's two changes here; support for \x00 in the search data, and the buffer support. I'm planning to check the relative size of each change.

@andrewleech
Copy link
Contributor Author

Ok so trimming it to just the null char change gives me a local diff of

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:  +240 +0.030% standard
      stm32:  +104 +0.027% PYBV10
        rp2:  +128 +0.039% PICO

So that's a fairly significant majority of the change size. I might have a think about whether that can be made smaller

@@ -89,19 +94,21 @@ static const char *_compilecode(const char *re, ByteProg *prog, int sizecode)
case '(': {
term = PC;
int sub = 0;
int capture = re[1] != '?' || re[2] != ':';
int capture = re_top - re > 2 && (re[1] != '?' || re[2] != ':');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, some of the extra size in the null-content commit is the extra length safety such as this line, preventing potential buffer overflows in the existing code.

That being said many of the similar / new checks above like if (re >= re_top) return NULL; weren't needed in the original code becase it was guaranteed that there's always the "extra" NULL byte on the end of teh "valid" string data so the code can get away with a re++; *re == char once per NULL check.

tannewt added a commit to tannewt/circuitpython that referenced this pull request Jul 11, 2023
Add new board: splitkb.com's Liatris
andrewleech and others added 4 commits March 29, 2025 21:15
Signed-off-by: Andrew Leech <andrew@alelec.net>
Signed-off-by: Andrew Leech <andrew@alelec.net>
Signed-off-by: Andrew Leech <andrew@alelec.net>
Signed-off-by: Andrew Leech <andrew@alelec.net>
- Restored missing PC increment in re compiler for character classes.
- Fixed poll unregister logic to correctly adjust max_used fd count.

🤖 Generated with Anon Kode
Co-Authored-By: Anon Kode <noreply@Anon Kode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extmod Relates to extmod/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants