Skip to content

ZJIT: Compile toregexp #14200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 19, 2025
Merged

Conversation

composerinteralia
Copy link
Contributor

toregexp is fairly similar to concatstrings, so this commit extracts a helper for pushing and popping operands on the native stack.

There's probably opportunity to move some of this into lir (e.g. Alan suggested a push_many that could use STP on ARM to push 2 at a time), but I might save that for another day.

@matzbot matzbot requested a review from a team August 13, 2025 01:40
Copy link
Contributor

@tekknolagi tekknolagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you

@@ -2823,6 +2842,14 @@ pub fn iseq_to_hir(iseq: *const rb_iseq_t) -> Result<Function, ParseError> {
let insn_id = fun.push_insn(block, Insn::StringConcat { strings, state: exit_id });
state.stack_push(insn_id);
}
YARVINSN_toregexp => {
let opt = get_arg(pc, 0).as_usize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this something more descriptive and also store it in a way that's more conducive to examination than a raw int?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least, if we store it as a raw int---which we'll need to pass into codegen later---print it out more nicely as OR-ed flags?

Copy link
Member

@k0kubun k0kubun Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this something more descriptive

I want this to be consistent with whatever name insns.def uses (currently opt). Since we have to compare the interpreter implementation and the ZJIT implementation for checking compatibility, it'd be harder to maintain if the names have inconsistency.

If you need a description, we should add a comment. If you do need a rename of the variable, please rename that in insns.def first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want this to be consistent with whatever name insns.def uses

That's the rule I've been following for naming things too. 👍🏻

print it out more nicely as OR-ed flags?

I added some fancier printing (I wasn't printing the options at all before), plus a comment about what opt is. Does that seem sufficient?

This comment has been minimized.

@tekknolagi tekknolagi linked an issue Aug 14, 2025 that may be closed by this pull request
zjit/src/hir.rs Outdated
}

let opt = *opt as u32;
if opt & (ONIG_OPTION_MULTILINE | ONIG_OPTION_IGNORECASE | ONIG_OPTION_EXTEND) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a bunch more of these like ONIG_OPTION_DONT_CAPTURE_GROUP. Maybe something like this?

/// make a pair of stringified and int
macro_rules! strenum {
    ($name:ident) => {
        (stringify!($name), $name)
    };
}

static ALL_FLAGS: &[(&str, u32)] = &[
    strenum!(ONIG_OPTION_NONE),
    strenum!(ONIG_OPTION_IGNORECASE),
    strenum!(ONIG_OPTION_EXTEND),
    strenum!(ONIG_OPTION_MULTILINE),
    strenum!(ONIG_OPTION_SINGLELINE),
    strenum!(ONIG_OPTION_FIND_LONGEST),
];

/// iterate over ALL_FLAGS and call bitflag!
macro_rules! bitflags {
    ($writer:expr, $flag:expr) => {
        let mut sep = "";
        for (name, flag) in ALL_FLAGS {
            if $flag & flag != 0 {
                write!($writer, "{sep}{name}")?;
                sep = "|";
            }
        }
    };
}

impl std::fmt::Display for F {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        bitflags!(f, self.options);
        Ok(())
    }
}

Please pardon the Rust---I'm very much not an expert

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is cleaner I think

macro_rules! bitflags {
    ($writer:expr, $bitfield:expr, $($flag:ident),*) => {
        let mut sep = "";
        $(
            if $bitfield & $flag != 0 {
                write!($writer, "{sep}{}", stringify!($flag))?;
                sep = "|";
            }
        )*
    };
}

impl std::fmt::Display for F {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        bitflags!(f, self.options, ONIG_OPTION_NONE, ONIG_OPTION_IGNORECASE, ONIG_OPTION_EXTEND, ONIG_OPTION_MULTILINE, ONIG_OPTION_SINGLELINE, ONIG_OPTION_FIND_LONGEST);
        Ok(())
    }
}

Copy link
Contributor

@tekknolagi tekknolagi Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but please get all the relevant flags from include/ruby/onigmo.h)

Copy link
Contributor Author

@composerinteralia composerinteralia Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't think all of those onigmo flags match exactly what is in opt here. I modeled this printing after

ruby/re.c

Lines 322 to 331 in a04555c

static char *
option_to_str(char str[OPTBUF_SIZE], int options)
{
char *p = str;
if (options & ONIG_OPTION_MULTILINE) *p++ = 'm';
if (options & ONIG_OPTION_IGNORECASE) *p++ = 'i';
if (options & ONIG_OPTION_EXTEND) *p++ = 'x';
*p = 0;
return str;
}
but I guess there actually are some other options for encodings that don't typically get printed. This might be the more complete list:

ruby/re.c

Lines 4863 to 4872 in a04555c

/* see Regexp.options and Regexp.new */
rb_define_const(rb_cRegexp, "IGNORECASE", INT2FIX(ONIG_OPTION_IGNORECASE));
/* see Regexp.options and Regexp.new */
rb_define_const(rb_cRegexp, "EXTENDED", INT2FIX(ONIG_OPTION_EXTEND));
/* see Regexp.options and Regexp.new */
rb_define_const(rb_cRegexp, "MULTILINE", INT2FIX(ONIG_OPTION_MULTILINE));
/* see Regexp.options and Regexp.new */
rb_define_const(rb_cRegexp, "FIXEDENCODING", INT2FIX(ARG_ENCODING_FIXED));
/* see Regexp.options and Regexp.new */
rb_define_const(rb_cRegexp, "NOENCODING", INT2FIX(ARG_ENCODING_NONE));

So 16 is ARG_ENCODING_FIXED and 32 is ARG_NO_ENCODING. Those don't match up with the ognimo options with the same values. NOENCODING with the n flag, for example:

% ruby --dump=insn -e '/#{}/n'
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,6)>
0000 putobject                              ""                        (   1)[Li]
0002 putnil
0003 dup
0004 objtostring                            <calldata!mid:to_s, argc:0, FCALL|ARGS_SIMPLE>
0006 anytostring
0007 toregexp                               32, 2
0010 leave

% ruby -e 'p Regexp::NOENCODING; p //n.options'
32
32

Copy link
Contributor Author

@composerinteralia composerinteralia Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARG_NO_ENCODING is easy enough to display as n. I'm not sure how best to print ARG_ENCODING_FIXED though since it seems like the option itself doesn't tell us whether it's an e, s, or u.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.......huh. That's a weird subset. I don't think we need to print as one-char; printing as OR-ed long-name is probably fine (like the function call flags). Thank you for investigating. I trust your judgement here

Copy link
Contributor Author

@composerinteralia composerinteralia Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try printing it as the Ruby constant name like IGNORECASE|EXTENDED|MULTILINE|FIXEDENCODING|NOENCODING

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've got something that works and I think covers all the possible options. I didn't end up using the macros you offered above since using the Ruby constant name made them somewhat less useful, but I'm open to suggestions if there's a more elegant way to do this.

`toregexp` is fairly similar to `concatstrings`, so this commit extracts
a helper for pushing and popping operands on the native stack.

There's probably opportunity to move some of this into lir (e.g. Alan
suggested a push_many that could use STP on ARM to push 2 at a time),
but I might save that for another day.
@tekknolagi tekknolagi merged commit fc5ee24 into ruby:master Aug 19, 2025
86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZJIT: Support toregexp insn
3 participants