-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
extmod/re: Use buffer protocol for data to search through. #8152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Yes, definitely it should accept bytearray/memoryview/etc. This is what CPython allows.
That makes sense as well, if it matches CPython. There should be some tests added for this new behaviour. |
Yep I need to fix a couple of builds too it seems - I'll definitely add the tests too thanks |
d7c6000
to
c8d8b15
Compare
py/objstr.c
Outdated
@@ -2067,6 +2067,8 @@ mp_obj_t mp_obj_new_str_from_vstr(const mp_obj_type_t *type, vstr_t *vstr) { | |||
vstr->alloc = 0; | |||
return MP_OBJ_NEW_QSTR(q); | |||
} | |||
} else { | |||
type = &mp_type_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dpgeorge the following line was failing:
mp_obj_new_str_from_vstr(mp_obj_get_type(where), &vstr_return);
when where
is bytearray etc as this function would copy the incoming type to the output.
The solution proposed here (in separate commit) is to sanitise the provided type to be either mp_type_str
or mp_type_bytes
the same as is done in mp_obj_new_str_of_type()
earlier in this file.
If this seems like a risky / unneccesary change this type checking could be done as-needed in modre.c
before calling mp_obj_new_str_from_vstr
.
However, mp_type_bytes
is currently not exposed in nativeglue.c so dynamic ure module currently fails with this approach - we could certainly expose mp_type_bytes
in native modules though if this is a preferred direction; indeed mp_type_str
is already exposed there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you did here is a good change, I think all existing callers of this function passed either str or bytes, so functionality shouldn't change. And this scales well to other uses which would have similar semantics.
But actually dynamic ure doesn't enable URE_SUB because dynamic doesn't support mp_obj_new_str_from_vstr()
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... before this one of the unix builds here was failing, I thought it was from a dynamic build of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this change is now incorrect and no longer needed after the work done since on standardising str/byte/bytearray functions. As such I've removed it in the rebase, all tests are passing.
c8d8b15
to
b610994
Compare
Other than the above question about |
extmod/modure.c
Outdated
|
||
// Append replacement string to result, substituting any regex groups | ||
while (*repl != '\0') { | ||
while (repl_len > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be simpler (and maybe smaller code) to do:
const char *repl_top = bufinfo.buf + bufinfo.len;
while (repl < repl_top) {
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I believe you're right, not sure why I didn't do it that way from the start. I was very uncomfortable about the manual length tracking, knew right from the start it was a bad / fragile approach.
extmod/modure.c
Outdated
if (*repl == '\\') { | ||
++repl; | ||
repl_len--; | ||
bool is_g_format = false; | ||
if (*repl == 'g' && repl[1] == '<') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to guard against repl
going beyond the end of the array
extmod/modure.c
Outdated
@@ -328,9 +337,11 @@ STATIC mp_obj_t re_sub_helper(size_t n_args, const mp_obj_t *args) { | |||
unsigned int match_no = 0; | |||
do { | |||
match_no = match_no * 10 + (*repl++ - '0'); | |||
repl_len--; | |||
} while ('0' <= *repl && *repl <= '9'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a guard
lib/re1.5/compilecode.c
Outdated
@@ -18,19 +18,21 @@ static void _emit_checked(int at, char *code, int val, bool *err) { | |||
} | |||
} | |||
|
|||
static const char *_compilecode(const char *re, ByteProg *prog, int sizecode) | |||
static const char *_compilecode(const char *re, size_t *len, ByteProg *prog, int sizecode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this function need to pass the updated len
back to the caller? If not it's more efficient to just have size_t len
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did need to pass the len back to track how much is consumed when the function is run recursively, but this need goes away with a change to re_top
:-)
lib/re1.5/compilecode.c
Outdated
{ | ||
char *code = sizecode ? NULL : prog->insts; | ||
bool err = false; | ||
int start = PC; | ||
int term = PC; | ||
int alt_label = 0; | ||
|
||
for (; *re && *re != ')'; re++) { | ||
|
||
while (*len > 0 && *re != ')') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, might be easier to have re_top
variable instead of len
, and do re < re_top
(maybe...).
lib/re1.5/compilecode.c
Outdated
} | ||
if (!*re) return NULL; | ||
if (*len <= 0) return NULL; | ||
EMIT(PC++, *re); | ||
if (re[1] == '-' && re[2] != ']') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to guard that re[2]
is not off the end of the array.
ce90f7b
to
b315c75
Compare
Thanks for the I'm pretty sure I've added exit / buffer length guards everywhere needed - there were a few unsafe places in the original code too that should now be protected. |
b315c75
to
f1c61a6
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #8152 +/- ##
==========================================
- Coverage 98.57% 98.54% -0.04%
==========================================
Files 164 169 +5
Lines 21349 21844 +495
==========================================
+ Hits 21045 21526 +481
- Misses 304 318 +14 ☔ View full report in Codecov by Sentry. |
a693098
to
7a7059f
Compare
Code size report:
|
It's a relatively big change (in terms of code size increase). Would you rather this, or enhanced |
Yeah I was a bit shocked! Technically there's two changes here; support for |
Ok so trimming it to just the null char change gives me a local diff of
So that's a fairly significant majority of the change size. I might have a think about whether that can be made smaller |
lib/re1.5/compilecode.c
Outdated
@@ -89,19 +94,21 @@ static const char *_compilecode(const char *re, ByteProg *prog, int sizecode) | |||
case '(': { | |||
term = PC; | |||
int sub = 0; | |||
int capture = re[1] != '?' || re[2] != ':'; | |||
int capture = re_top - re > 2 && (re[1] != '?' || re[2] != ':'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, some of the extra size in the null-content commit is the extra length safety such as this line, preventing potential buffer overflows in the existing code.
That being said many of the similar / new checks above like if (re >= re_top) return NULL;
weren't needed in the original code becase it was guaranteed that there's always the "extra" NULL byte on the end of teh "valid" string data so the code can get away with a re++; *re == char
once per NULL check.
Add new board: splitkb.com's Liatris
8e354bf
to
dc8e943
Compare
Signed-off-by: Andrew Leech <andrew@alelec.net>
Signed-off-by: Andrew Leech <andrew@alelec.net>
Signed-off-by: Andrew Leech <andrew@alelec.net>
Signed-off-by: Andrew Leech <andrew@alelec.net>
- Restored missing PC increment in re compiler for character classes. - Fixed poll unregister logic to correctly adjust max_used fd count. 🤖 Generated with Anon Kode Co-Authored-By: Anon Kode <noreply@Anon Kode.com>
The
ure
module currently requires either astr
orbyte
object to be used for pattern, replace and string arguments.When searching through a large amount of data with a need to manipulate said data, it's far more efficient to keep it in a bytearray. Converting / copying to a string to pass into
re
functions requires duplication of the ram usage.This PR replaces the
mp_obj_str_get_data
and similar functions used inmodure
with buffer protocol functionality, meaning any object that can act as a buffer can be used for regex functions.The second change included here is to always iterate over the string/buffer for the provided length, rather than exiting on NULL character. This allows for NULL characters to be included in search patterns which improves compatibility with cpython / other regex parsers. I personally needed this while porting a data parsing script from cpython to micropython.