-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
regex library to implement "re" module #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
See dpgeorge#9 (comment) for some previous discussion. |
I like regex, and it's useful, so I think it's definitely worth trying to get it on the MCU. If we go for a cutdown version, then it should be a strict subset of Python's regex, not a variation. |
as an FYI, the musl c library contains the tre regex code. https://github.com/laurikari/tre As well, it contains some bug fixes that were never fixed in the original code. (according to the wiki at musl) |
Note that tre does not appear to have the following (which Python re does):
Its not a terrible set of limitiations - providing the other aspects match the same of course. |
Note Python does not use PCRE. The default "re" has not been PCRE-based since at least 2.2, and the PCRE implementation of "re" was only included for (AFAIK undocumented) backwards compatibility until 2.3. |
FYI, I started writing FFI bindings for libpcre in https://github.com/micropython/micropython-lib/tree/master/re-pcre , with the idea to have more or less complete re module implementation for unix port. |
Not sure how useful it'll be, but you can look at On Tue, Apr 8, 2014 at 11:20 PM, Paul Sokolovsky
John Lenton (jlenton@gmail.com) ::: http://chipaca.com |
re-pcre now has enough gear to run json module from CPy. |
You mean, json module copied from CPy can run under uPy, by utilising re-pcre, right? |
Yep, the only change required was this: micropython/micropython-lib@a4833ed (CPy stdlib is full of such dirtiness). Ah, disclaimer: can run on trivial input ;-) |
Ok, I had another pass on this. Let's remind ourselves that regex matching can be implemented either as "recursive-descent" executor with backtracking (essentially, a Turing machine), or NFA/DFA (non-deterministic and deterministic finite automaton). The latter offers some performance guarantees, but harder to understand (and thus modify to add more features), and usually needs some bunch of memory to represent compiled regex, the former is easier to understand/modify and add non-regular features (with modern regex'es full of them). Technically good contender for "backtracking" category: SLRE . Compiles to under 5K (x86, -Os). Problem spot: has "active" "license life" with several incarnations:
@dpgeorge, do you envision any probs adding MIT version to codebase? (This is certainly better than situation we had with "stm" port, which contained potentially proprietary code.) |
Example of FA-based contender: http://swtch.com/plan9port/unix/ . Messy code from outer space. Support captures, which was big progress for FA impl as http://swtch.com/~rsc/regexp/regexp1.html says. Doesn't support non-greedy quantifers. Compiles to ~7K. There're bunch of hacks which seem to add non-greedy's and stay unbloated (unlike TLE):
|
Generally, grepping for "regex" for github can lead to few interesting, if practical, finds, e.g. https://github.com/fy0/tinyre "A tiny python style regex engine implement.", with nice Chinese readme and comments. |
No. There is no problem if it's MIT and we acknowledge the source of the code. Now with the extmod/ directory, we can land it there. (Of course, I would love to write my own re engine, but that's not a good use of time :) fy0/tinyre looks quite nice, and the license might even be MIT compatible. |
FYI, I kinda coded a module for slre, but then next step is refactoring slre to have sane interface (re-refactoring, because it's used to have it such, but latest explicitly MIT-licensed version has screwed it up), optimize memory usage (it just allocates big statically-sized buffer for compiled regexp), etc, etc. All that seemed kinda boring, so I made another pass over solutions mentioned above. I looked at tinyre, and it actually compiles to small enough code too, but all the Chinese comments doesn't help, and I saw funny programming practices like infinite loop in case of error. So, previous assessment holds: nobody known what exactly is done there and why. Then I took a more detailed look at https://code.google.com/p/re1/ (I almost skipped it last time). And that's true gem - first of all, it's truly minimal code (but implements all important features like greedy vs non-greedy quantifiers and submatch captures), and then, it separates regexp compiler from execution backends, and then provides few alternative backends (essentially, tiny VMs). Backends size from 400 bytes on. Frontend is implemented using yacc though, so is big, but I already reimplemented parser in adhoc way. There's issue that backtracking (== requiring least explicit runtime state for execution, but potentially lot of stack space) backends don't have protection against known backtracking killers like |
Nice find with p/re1. I also would have reimplemented the parser. |
Yeah, it's really nice, spotting VM executor byte sizes from 220 bytes (thumb2). Of course, when you start to add real-world features like NUL-cleanness and assertions, it diminish those impressive figures, but nonetheless, it should be possible to have regexp impl in 2K. My current stuff is at https://github.com/pfalcon/re1.5 (yeah, already re-dubbed it as "re1.5" ;-) ). Things implemented: exact allocation of internal state, passing input string by length, "^" & "$" assertions. TODO before it'll be basically usable: options (like multiline, which changes semantics of "^/$") and char classes. Bottom line: @dpgeorge, if you wanted to implement regex engine (almost) from scratch, there's still an opportunity ;-D. |
Ha, sounds like you are almost there! Are you trying to making it CPython compliant? |
Depending on what you mean. To start talking about "CPython compliance", would need to implement named groups. And full "CPython compliance" is not achievable for finite-automaton based implementation (and re1's feature is completely replaceable backends, so FA featureset is a common bound). |
Ok, "ure" module based on re1.5 as described above was merged (and actually went thru couple upgrade iterations adding more features). |
Thanks @pfalcon, really great work! |
adding tear pin to pindef
Make it possible to build micropython outside HDMI2USB-litex-firmware repo
check badge orientation before entering menu screen
ftpserver.c: Add sanity checks
Rename MLX module and add qstrings.
Commit of first semi stable version
CPython uses (possibly patched) PCRE library http://www.pcre.org/ (BSD license). Nothing would preclude its usage in MicroPython, but:
So, PCRE would definitely fit "unix" port, to be completely compatible with CPython "re" module. But I'm not sure how usable it will be for MCU ports.
So, what I imagine is be able to support 2 (or more?) regex libraries - one PCRE for complete compatibility, another using other library(ies) for unbloatness, at the expense of reduced feature set. Note that "reduced" subset can live in a separate module (say, "remini").
This ticket would be a good place to discuss that and collect candidates for alternative re lib.
The text was updated successfully, but these errors were encountered: