NL2RE

InfeRE: Step-by-Step Regex Generation via Chain
of Inference
Shuai Zhang, Xiaodong Gu, Yuting Chen, Beijun Shen*
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
{zhangshuai2000, xiaodong.gu, chenyt, bjshen}@sjtu.edu.cn
Abstract—Automatically generating regular expressions (ab-

Generation order by autoregressive LM:
brev. regexes) from natural language description (NL2RE) has Q: lines starting with a lower-case
been an emerging research area. Prior studies treat regex as letter and ending with vowel
a linear sequence of tokens and generate the final expressions
arXiv:2308.04041v1 [cs.AI] 8 Aug 2023
autoregressively in a single pass. They did not take into account

Intermediate steps
the step-by-step internal text-matching processes behind the Step 1 lowercase [a-z]
final results. This significantly hinders the efficacy and inter- Step 2 start with [a-z](.*) The real order of text-matching processes:
pretability of regex generation by neural language models. In Step 3 vowel [AEIOUaeiou]
this paper, we propose a new paradigm called InfeRE, which Step 4 end with (.*)[AEIOUaeiou]
Step 5 and (([a-z])(.*))&((.*)([AEIOUaeiou]))
decomposes the generation of regexes into chains of step-by-
step inference. To enhance the robustness, we introduce a self-
consistency decoding mechanism that ensembles multiple outputs Fig. 1: A motivation example of regex generation.
sampled from different models. We evaluate InfeRE on two
publicly available datasets, NL-RX-Turk and KB13, and compare
the results with state-of-the-art approaches and the popular autoregressive models generate regex one token at a time.
tree-based generation approach TRANX. Experimental results
show that InfeRE substantially outperforms previous baselines, Hence, they may not capture the real order of text processing
yielding 16.3% and 14.7% improvement in DFA@5 accuracy on behind the regexes. The generated tokens can violate the
two datasets, respectively. structure of regexes. Moreover, it is difficult to interpret the
Index Terms—Regex Generation, Chain of Inference, Self- underlying processes and reasoning behind the generation.
Consistency Decoding Figure 1 shows an example of regex generation by conven-
I. I NTRODUCTION tional autoregressive models and the real text matching order.
The integer numbers under the regex indicate the order for
emitting each token. In conventional autoregressive models,
Regular expressions (abbrev. regexes) have been highly
the decoder generates tokens one by one based on previously
effective tools for text matching and manipulation, with exten-
decoded tokens. Right from the beginning, the decode must
sive use in a broad range of fields. They enjoy robust support
possess a comprehensive understanding of the entire output
across nearly all programming languages [1], and are crucial
in order to give the correct quantity of left brackets before
to many practical applications such as performance testing
generating the genuinely semantically meaningful tokens. In
tool JMeter, command lines in Linux, and database operations
reality, the text-matching process does not follow a left-to-
that require powerful and flexible text processing. Despite the
right order as the conventional sequence-to-sequence model
wide utility of regexes, however, regexes can be complex,
assumes. For example, the first left bracket is emitted lastly
opaque, and challenging to compose, even for experienced
when we consider the and operator. This example shows that
programmers [2].
conventional autoregressive models often fail to generate regex
Automatically generating regexes from natural language
in the order of real text processing.
descriptions (NL2RE) has been an emerging research area [3]–
Recent research [12], [13] for large language models
[11]. Aside from earlier work on semantic parsing [3], existing
(LLMs) has shown that chain-of-thought prompting, which
studies mainly treat NL2RE as a sequence-to-sequence learn-
decomposes a complex problem into a series of reasoning
ing problem and utilize an autoregressive language model. For
steps, can engage the reasoning capability of LLMs, resulting
example, Deep-Regex [4] employs an encoder-decoder model
in substantial improvement in model performance. Addition-
based on LSTM with attention. SemRegex extends the LSTM
ally, multi-step generation allows for better interpretability
encoder-decoder model with reinforcement learning [5], which
of LLMs, since it provides understandable and transparent
rewards the model for generating diverse regexes. S2 RE [11]
insights into how language model arrives at its predictions.
augments SemRegex by a new rewarding strategy to encourage
However, directly applying chain-of-thought prompting to the
the model to generate syntactically correct regexes.
regex generation task is challenging, primarily due to the
One common downside of existing studies is that they
wide range of domains encompassed by existing LLMs, as
generate the entire regexes autoregressively. On the one side,
well as the considerable expense associated with the manual
* Beijun Shen is the corresponding author formulation of prompts [14].
Inspired by the strengths of chain-of-though prompting, we implementation of two main mechanisms of InfeRE, chain
propose InfeRE, a novel formulation of regular expressions of inference and self-consistency decoding. Sections IV and
called chain-of-inference, with each chain representing a sub- V evaluate InfeRE against state-of-the-art approaches on two
regex that indicates a specific text-matching procedure. Instead benchmark datasets. Section VI concludes.
of generating plain regexes autoregressively, we formulate
regex generation as step-by-step inference of internal processes II. R ELATED W ORK
for text matching.
A. Automatic Regex Generation
InfeRE rationalizes the text-processing order through step-
by-step generation. Particularly, the original problem is de- Automatically generating regexes has gained much attention
composed into several simple subproblems by explicitly in- due to the popularity and significance of regexes [2], [16],
termediate steps and the order in which these subprob- [17]. Early researches attempt to generate regexes from exam-
lems are addressed is consistent with the actual order of ples [18]–[23]. They enumerate all possible regexes that are
text processing. As Figure 1(b) shows, for the plain regex consistent with the provided examples and perform exhaustive
(([a-z])(.*))&((.*)([AEIOUaeiou])), the gener- searches with constraints. For example, AlphaRegex [18]
ation of a regex can be decomposed into 5 steps, with synthesizes regexes by pruning out a large enumerative search
the step description (middle) and the internal results (right). space using over- and under-approximations of regexes. How-
Each step indicates a specific text-matching procedure, such ever, AlphaRegex can only generate simple regexes in specific
as step1=<low> and step2=startwith(<low>). forms. RegexGenerator++ [19] generates regexes based on
Then, we train a language model to generate inference chains genetic programming, but the correctness of generated regex
from natural language. Additionally, to enhance the robustness is not guaranteed. FLASHREGEX [20] reduces the ambiguity
of model performance [15], we design a self-consistency of generated regexes by deterministic regex constraints. There
decoding mechanism, aiming to ensemble multiple outputs and is also a large number of studies that infer regexes from
select the most consistent ones. examples using XML schemas [21]–[23]. A strong restriction
We evaluate InfeRE on two benchmark datasets, NL-RX- of example-based approaches is that they rely heavily on high-
Turk and KB13, and compare the results with state-of-the-art quality examples. Hence, the systems tend to overfit when the
approaches and the popular tree-based generation approach examples provided by users are incorrect or limited.
TRANX. Experimental results show that InfeRE substantially Recently, there has been emerging research on NL2RE,
outperforms previous baselines, yielding 16.3% and 14.7% i.e., regex generation from natural language descriptions.
improvement in DFA@5 accuracy on two datasets, respec- Ranta [24] firstly developed a tool to convert NL queries
tively. Particularly, InfeRE outperforms the popular tree-based to regexes based on abstract syntax rules. Kushman and
generation approach by 18.1% and 11.3% on both datasets, Barzilay [3] parsed natural language queries into regex by
respectively, in terms of DFA@5 accuracy. Ablation studies on utilizing semantic unification. They also built the first bench-
components, backbone language models, and data sizes indi- mark and the DFA (deterministic finite automatons) metric for
cate that the proposed chain of inference and self-consistency evaluating regex generation. Locascio et al. [4] proposed Deep-
decoding are effective and robust in various configurations. Regex, which is the first approach to employ deep learning
InfeRE also demonstrates strong ability in low-resource regex for regex generation and provided another dataset of 10,000
generation. natural language descriptions and regex pairs. Deep-Regex
In our paper, we mainly explore the application of InfeRE treats NL2RE as a machine translation task and employs
to regex generation, but we believe that it is suggestive and an LSTM-based sequence-to-sequence model. SemRegex [5]
may be broadly applied to other areas such as SQL generation, alters the syntax-based training objective of Deep-Regex to
as they all follow a certain structural order. a semantics-based objective by using reinforcement learning,
The contributions of this work are as follows: which maximizes the DFA semantic correctness of generated
• We introduce chain-of-inference, a novel representation regexes. SoftRegex introduces an additional reward model to
format for regular expressions. Each chain indicates a compute and soften the similarity of two regexes in SemRegex.
specific text-matching procedure. Both SemRegex and Deep-Regex can generate correct regexes
• We propose a novel paradigm of regex generation via that may not resemble the ground truth benefiting from their
chain of inference. In comparison with existing autore- reinforcement learning method. S2 RE [11] further improves
gressive approaches, InfeRE builds sub-regexes in a man- the policy gradient to ensure the validity of generated regexes
ner that is similar to the human thought process. by rewarding the model if it generates valid regexes.
• We evaluate InfeRE on two publicly available datasets To the best of our knowledge, InfeRE is the first attempt
and compare the results with state-of-the-art regex gen- to employ pre-trained language models (PLM) in the field of
eration approaches as well as tree-based code generation automatic regex generation. Moreover, unlike current autore-
methods. Experimental results show that InfeRE substan- gressive models, InfeRE first leverages the idea of chain of
tially outperforms existing techniques on the datasets. thought for regex generation, which considers the real order
This paper is organized as follows: Section II presents of text processing and enhances the interpretability of large
the related work. Section III presents the algorithm and the language models.
2
② Learning & Generation ① Decompose regexes into chains of inference
Chain of Regex
Inference Tree
Decomposing Tree Parsing

Training Learning Plain Regex
Inference ③ Self-Consistency decoding
Regexes
Sampling &
Voting
Query Chain PLM Unification
Final Regex
Regexes
Tree PLM
Fig. 2: Overview of InfeRE.
B. Chain-of-Thought Prompting as the Transformer [29]. However, autoregressively generating

Recent research has shown that chain-of-thought prompting, the plain regex violates the real order of generation. Therefore,
i.e., prompting by step-by-step rationale, can significantly we design a novel chain-of-inference paradigm for regex
improve the performance of large language models on rea- generation: as opposed to directly generating y, we generate
soning tasks, such as arithmetic, commonsense, and symbolic its chain of inference s = s1 , ...,sK , where each st represents
reasoning [13], [25]–[27]. These approaches leverage a large an internal text-matching process involved in the final regex.
language model to jointly generate the task output as well as Figure 2 shows the framework of our approach. The pipeline
the explanations to mimic the process of human reasoning. involves three main steps. First, we decompose plain regexes
More importantly, chain-of-thought prompting requires no into chains of inference, namely, step-by-step operations in
modifications to the architecture or training procedures of a the regexes. Second, we train a sequence-to-sequence model
large PLM. to generate the chains of inference. These chains of infer-
Related research reveals that the quality of chains of inference are assembled into plain regexes. Third, we design a
ence significantly affects the final model performance [28], self-consistency decoding mechanism that ensembles multiple
and theirs effects vary in different downstream tasks [13]. inference chains.
Wiegreffe et al. [26] proposed an over-generation and filtration B. Decomposing Regexes into Chains of Inference
pipeline for producing model-generated rationales by utilizing Our goal becomes how to convert a query x into a chain
GPT-3. They conducted a series of human studies and found of inference s. A straightforward idea is to create a parallel
that crowdworkers often prefer model-generated explanations corpus of (x, s) pairs and train a language model p(s|x) on the
to crowdsourced explanations. The follow-up study further corpus. However, our datasets are in the form of (x, y) pairs.
indicates the sub-optimality of manually provided rationales It is impractical to label the chains of inference manually.
increases sensitivity of large language models, and the ensem- Hence, we want to decompose the original regex y into chains
ble of model-generated and human-written chains of inference of processes s automatically. The decomposition must satisfy
can reduce this sensitivity [27]. Therefore, appropriate chains two objectives: first, the decomposed chains must express the
of inference designed for specific downstream task are crucial. step-by-step text-matching process clearly; second, the chains
Comparatively, InfeRE adopts the idea of chain of thought cannot be redundant. Otherwise, the inferred chains may have
and generates the step-by-step process of human reasoning extraordinary lengths, which degenerate the performance in
instead of the final regex. However, InfeRE is not a prompting turn.
method. That means, the generated sub-regexes are not taken Our solution is simple: since regexes are tree-structured [10]
as prompts to large language models. Instead, we leverage a and the top-down hierarchy in the parsing trees explicitly
large language model to generate chains of thought directly. delineates the step-by-step construction process of a regex [4],
III. A PPROACH we can decompose the original problem into small sub-
problems by leveraging this inherent tree hierarchy.
A. Overview Algorithm 1 describes the decomposition process. First, we
For a given natural language query x = x1 , ..., xN , the goal parse the original plain regex y into trees based on predefined
of NL2RE is to generate a regex y = y1 , ..., yT that aligns with rules defined in [10]. Nodes in the trees represent operators
the text transformation implied by the query. This problem while edges represent the relationships between two sub-
can be realized using any sequence-to-sequence model, such regexes. Table I lists all the rules for parsing plain regex into
3
Regex Tree step5
Q: lines with letter, 3 or more Plain Regex：
times and contain a number ( ( [A-Za-z] ) { 3 , } ) & ( . * [ 0-9] . * ) and
step2 step4
repeat_atleast contain
Encoder
lines with letter number step1 step3
Inference Chain：
… step1 = [A-Za-z] // letter <letter> 3 <num>
step2 = repeat_atleast(step1, 3)
Decoder step3 = [0-9] // number [A-Za-z] [0-9]
… step4 = contain(step3)
step5 = and (step2, step4)
step1=… step2=… step3=… step5=…
Chain PLM
Fig. 3: An illustration of creating chains of inference from plain regexes.
Algorithm 1: Convert Plain Regexes to Chains of TABLE I: Rules for parsing plain regex into parenthetical
Inference notation for trees [7], [10].
Input: y: original regex; Op: a dictionary of all operators
and their corresponding number of operands Operator r :=
1 N odeStack = (), C = (), Boperand = (), i = 1; startwith (r) r.∗
2 Parse y into a tree T (y); endwith(r) .∗r
3 Post-order traverse T (y) and add nodes to N odeStack in contain(r) . ∗ r.∗
turn; not(r) ∼r
4 Reverse N odeStack; optional(r) r?
5 while N odeStack.size() ̸= 1 do star(r) r∗
6 node = N odeStack.pop(); concat(r1 , r2 ) r1 r2
7 if node is operator then and(r1 , r2 ) r1 & r2
/* Build one step in a chain of or(r1 , r2 ) r1 — r2
inference */ rep(r, k) r{k}
8 Nnode = Op[node] //Lookup the operands repeat least(r, k) r{k, }
number for a node rep range(r, k1 , k2 ) r{k1 , k2 }
9 sstep =”step” + i + ”=(”;
10 for i = 0 to Nnode do Operand :=
11 oi = Boperand .pop(); <let> [A-Za-z]
12 sstep = sstep + oi + ”,”; <cap> [A-Z]
13 end <low> [a-z]
14 sstep = sstep + ”)”; <num> [0-9]
15 C.add(sstep ); <any> .
16 Boperand .add(”step” + i); <spec> [-,;+:!@# $%&*=ˆ]
17 i = i + 1;
18 else
19 Boperand .add(node);
20 end size of the regex tree decreases continuously as the traversal
21 end proceeds. We repeat this process until there is only a single
Output: C: chain of inference
step node left in the tree, which means the completion of the
chain of inference.
Figure 3 illustrates an example. For a pair of (query, regex)
parenthetical notation for trees [7], [10]. The rules involve all in the training set, training a sequence-to-sequence model to
operators and operands that appear in the benchmark datasets. generate chain of inference is difficult. Hence, we parse the
Notably, the obtained parenthetical notation for trees can be regex y=(([A-Za-z]){3,})&(.*[0-9].*) into a tree
readily mapped back to plain regex based on these established and traverse the tree in post-order. When we encounter the
rules. first nonterminal node <letter> we create a sub-regex
Having obtained the parsing tree, we then traverse the tree step1=[A-Za-z] in the chain of inference to indicate
in a post-order. Whenever we encounter an operator node, we the first inferred matching process. The sub-tree rooted at
regard the sub-regex corresponding to its sub-tree as a step <letter> is replaced with a special node step1 to indicate
in the chain of inference. For the i-th node, we represent the the internal results after step 1. Next, we replace the sub-tree
sub-regex as step i in the chain of inference. Then, we replace repeat_atleast(step1,3) with a special node step2
the sub-tree of the current operator with a step-i node. The when we encounter the node repeat_atleast. Next, we
4
replace the third occurred node <num> with a node step Algorithm 2: Revert Chains of Inference to Plain
3 and append step3=[0-9] in the chain of inference. The Regexes
traversal continues until meeting the root node and, where Input: C: chain of inference;
we generate step5=and(step2,step4) in the chain of 1 re = C.pop();
inference and replace the sub-tree with step5. 2 while ‘step-i’ in re do
The decomposition step above is reminiscent of tree-based 3 get the corresponding sub-regex subrei of step-i
models in code generation such as TRANX [30]. Although from C[i];
both techniques generate code in a tree-based fashion, our 4 re = re.replace(step-i, subrei ) ;
architecture and training method are significantly different. 5 end
TRANX decomposes the tree into three types of actions and 6 Convert re to plain regex based on rules defined in
calculates their corresponding probabilities. This process devi- Table I;
ates from the pre-training task of current PLMs, which makes Output: re: plain regex
it challenging to engage with large-scale language models. In
contrast, InfeRE conforms to both the tree generation laws
and the pre-training task of language models. Notably, InfeRE
does not require any modifications to the underlying model generated chains are more convincing [32]. In other words,
structure or training process, making it highly flexible and multiple wrong chains are unlikely to reach the same answer.
adaptable to various corpora and models. Therefore, we can sample a group of inference chains from
C. Learning and Generation the language model, revert them to plain regexes, and select
the most consistent ones (i.e., chains of inference that lead to
Having parsed all regexes into chains of inference, we
the same regexes) as the final results.
obtain a parallel corpus of (query, chain) pairs. We regard the
generation of chains as a sequence-to-sequence learning task Based on this idea, we propose a self-consistency decoding
and leverage the Transformer model for the generation, where mechanism that selects the most consistent answer by sam-
the encoder represents the input query as vectors while the pling in the output space of language model [15]. Figure 4
decoder generates the target sequence (i.e., a chain of inferred illustrates the process of self-consistency decoding.
regexes) based on the encoded hidden vectors. First, we employ two language models: a chain PLM
In order to leverage large domain knowledge on texts, we which is trained on (x, s) pairs to generate chains of in-
use T5 [31], a popular PLM for sequence-to-sequence genera- ference, and a tree PLM which is trained on (x, y) pairs
tion. T5 is first pre-trained on large corpora of texts using self- to generate the parsing trees in parenthetical notation (e.g.,
supervision. The pre-training enables T5 to gain rich generic and(startwith(<low>, endwith(<vow>))).
knowledge, while the fine-tuning phase adapts the model to
At the decoding phase, each language model samples k
domain-specific tasks such as regex generation. Specifically,
outputs. The total 2k outputs are reverted to plain regexes and
we take the open-source T5 model1 as the backbone of InfeRE
are used to select the final regex using a plurality vote strategy.
and further fine-tune it on the (query, chain) pairs as a text
Since different regexes may represent the same semantic, we
generation task by minimizing the cross-entropy loss.
take a plurality vote based on semantic consistency.2 One
During the inference phase, we generate the chains of
output represents one vote. Regexes with the same semantics
inference autoregressively by feeding the natural language
will share the votes, and we randomly select one of them as the
query to the fine-tuned model. Then, we revert the generated
representative. Finally, InfeRE chooses the regex that receives
chains to regex trees. We initialize the tree with the last step in
the most votes as the final output. If there is more than one
the chain of inference, e.g., step5=and(step2,step4).
answer with equal votes, we take the answer first encountered.
For each step-i node, we replace the step-i node in the
sequence with its original sub-regex in the chain of inference. Compared with standard decoding methods, self-
This process may introduce new step-i node, so we repeat this consistency encourages correctness and diversity at the
process until there is no step node left and get corresponding same time [15].
parenthetical notation for trees. Finally, we convert the paren- It is worth noting that our idea is different from previous
thetical notation for trees into plain regex using the rules in studies [15], [27] on diversity driven decoding, i.e., introducing
Table I. The whole procedures are summarized in Algorithm 2. randomness in the input and output space of a single model.
We train multiple language models to generate both regexes
D. Self-Consistency Decoding
and chains of inference. The different forms of output can
Since we formulate regex generation as a chain of inference greatly enrich the input and output space of language models
for text matching, it is natural to suppose that there are and thus enhance the self-consistency decoding mechanism.
multiple chains that lead to the same text matching. Moreover,
if multiple chains reach the same matching, that means the
1 https://github.com/google-research/text-to-text-transfer- 2 We measure the semantic consistency using the DFA-EQ metric introduced
transformer#released-model-checkpoints in Section IV-C.
5
Inf Chain1: startwith(<low>) endwith(<vow>) and ⟹ (([a-z])(.*))&((.*)([AEIOUaeiou])) √
Chain PLM endwith(<vow>) and ⟹ ((.*)([AEIOUaeiou])) & (([a-z])(.*)) √
Inf Chain2: startwith(<low>)
Q: lines starting with a
…
…
…
(([a-z])(.*))&((.*)([AEIOUaeiou]))
lower-case letter and
ending with vowel √
Tree seq1: and(startwith(<low>, endwith(<vow>)) ⟹ (([a-z])(.*))&((.*)([AEIOUaeiou]))
Tree PLM
Tree seq2: startwith(and(<low>, endwith(<vow>) )) ⟹ (([a-z])&(.*)([AEIOUaeiou]))(.*) ×
…
…
Fig. 4: Illustration of self-consistency decoding.
IV. E XPERIMENTAL S ETUP Semantic-Unify learns a combinatory categorial grammar,

A. Research Questions which consists of words or phrases with their correspond-
ing regular expression functions.
We evaluate InfeRE by addressing the following research
• Deep-Regex [4]: an LSTM sequence-to-sequence model
questions:
with attention. Deep-Regex regards regex generation as
• RQ1: How effective is InfeRE in generating regexes
a 1-to-1 translation task without considering semanti-
from natural language descriptions? We evaluate the
cally equivalent regexes with different forms. The model
performance of InfeRE on two widely used datasets
can be trained by either maximizing likelihood (Deep-
and compare it against state-of-the-art regex generation
RegexMLE [4]) or maximizing marginal likelihood (Deep-
approaches and a tree-based code generation approach.
RegexMML [6]).
• RQ2: What is the effect of chain of inference? We
• SemRegex [5]: the same model as Deep-Regex, but
conduct an ablation study on the efficacy of the chain of
trained using policy gradient [33] to reward the model
inference. We further apply chain of inference to different
for generating equivalent regexes in different forms.
PLMs.
• SoftRegex [6]: an extension of SemRegex by training an
• RQ3: What is the effect of self-consistency decoding
additional reward model (EQ Reg), which measures the
and how the number of output samples affects the
equivalence probability of two regexes and then uses the
performance of self-consistency decoding? We conduct
probability to reward the sequence-to-sequence model.
an ablation study to analyze the effect of self-consistency
• S2 RE [11]: an LSTM sequence-to-sequence model
decoding mechanism. As the self-consistency decoding
trained via policy gradient. S2 RE introduces both syntac-
relies heavily on the number k of output samples, we
tic and semantic validity to generate more valid regexes.
further vary the number k to explore the performance of
To our knowledge, S2 RE is the state-of-the-art approach
self-consistency decoding.
for regex generation.
• RQ4: How does different backbone PLMs impact
• S2 RE-T5 [11]: a variant of S2 RE by replacing LSTM
the performance of InfeRE? We study the effect of
with T5, which is a PLM with a larger number of
different backbone PLMs on performance. We replace
parameters. For fair comparison, we implement S2 RE-T5
the backbone with popular PLMs of different types and
using the configurations in [11].
sizes and compare their performance.
• TRANX [30]: a tree-based code generation model that
• RQ5: What is the impact of data size on the perfor-
is widely used in recent studies [34]–[37]. Similar to
mance of regex generation? In practical applications,
InfeRE, TRANX decouples code generation as a series
sufficient high-quality parallel corpora are difficult to
of tree construction actions using user-defined, task-
obtain. To answer this question, we aim to study the
dependent grammar.
effect of data size on performance. We vary the size of
the training set and compare the performance of various We reproduce the results of Deep-Regex, SoftRegex, and
approaches under different data sizes. TRANX using their officially published code. For S2 RE-T5,
we initialize the parameters by pre-training the model on the
B. Comparison Methods
(query,regex) pairs training set and leverage policy gradient to
We compare InfeRE with five regex generation methods, train the model with the semantics-based and syntactic-based
including the state-of-the-art method S2 RE [11]. Since S2 RE objective. For baselines that are not open-source, we directly
is not pre-trained, we extend S2 RE to S2 RE-T5 by replacing excerpt the statistics from the original paper and leave blanks
the LSTM backbone with T5. InfeRE is also related to if the results are not reported.
tree-based generation models. Therefore, we compare it with
TRANX [30], a popular tree-based code generation method.
C. Evaluation Metrics
Specifically, we compare InfeRE with the following baselines.
• Semantic-Unify [3]: a grammar-based approach that di- We measure the performance of regex generation using two
rectly parses natural language description into regexes. commonly used metrics:
6
TABLE II: Statistics of the datasets. All experiments are carried out on a workstation with 2 RTX
2080Ti GPUs and an operating system of Ubuntu 18.04.4 LTS.
Dataset Train Dev Test
NL-RX-Turk 6,500 1,000 2,500 V. R ESULTS
KB13 618 206 206
A. Effectiveness of InfeRE in Regex Generation (RQ1)
Table III compares the performance of various methods on
the two datasets. Overall, InfeRE achieves the best perfor-
DFA-EQ [3]. Since regexes in different forms may represent mance among all the approaches.
the same semantic, DFA-EQ converts regexes to determin- Compared with state-of-the-art NL2RE methods, InfeRE
istic finite automatons (DFAs) and measures their semantic demonstrates strong strength, with the top-5 DFA-EQ accuracy
equivalence by comparing their DFAs. If their DFAs exactly increased by 16.3% and 14.7% against S2 RE on NL-RX-Turk
match, the two regexes are semantically equivalent. DFA- and KB13, respectively. A more significant improvement can
EQ@m measures the proportion of cases for which the top-m be observed in the EM metric, which is increased by 33.7%
generated regexes contain at least one regex that is semanti- and 16.4% on two datasets compared to S2 RE.
cally equivalent to the ground truth. In our experiments, we We notice that InfeRE also surpasses S2 RE-T5, a PLM with
compute the top-m DFA-EQ accuracy with m = 1 and 5. the same number of parameters as InfeRE. The DFA@1 and
EM (Exact Match). EM measures the ratio of regexes that DFA@5 are improved by 2.3% and 4.2% on NL-RX-Turk.
exactly match ground-truth regexes. Different from the DFA- The results indicate that the improvement does not solely
EQ, EM compares two regexes in plain texts. come from pre-training, which affirms the effectiveness of
the proposed chain of inference and self-consistency decoding
D. Datasets mechanisms.
Compared with the tree-based code generation model
We evaluate InfeRE on two widely used datasets, namely, TRANX, InfeRE gains 13.2%, 14.7%, and 22.1% improve-
NL-RX-Turk [4] and KB13 [3]. The statistics of the two ment in the three metrics, respectively, demonstrating the
datasets are presented in Table II. superiority of InfeRE over the traditional tree-based method.
NL-RX-Turk is the largest dataset which contains 10,000 One notable phenomenon is that the improvement is more
regexes with natural language annotations. Due to the heavy significant on NL-RX-Turk than on KB13. One main reason
workload of manual labeling, Locascio et al. employ a is that NL-RX-Turk is larger and more complex than KB13,
generate-and-paraphrase process for annotating data automati- allowing the model to capture richer knowledge from the
cally [4]. They initially generated 10,000 regexes with natural chains of inference.
language descriptions stochastically based on manually-crafted We further qualitatively analyze concrete regexes generated
grammar. Then, human annotators were asked to paraphrase by various approaches. Three examples are listed in Table IV.
the synthetic natural language descriptions. • In Example 1, the query “either start with
KB13 is a smaller and easier dataset comprising 824 (query, string <M0> or end with any number” in-
regex) pairs. The dataset was created by crowdsourcing: volves two steps, namely, “start with string
the authors first asked crowdsource workers to write natural <M0>” and “end with any number”, which are
language queries. Then, each query was presented to expe- successfully predicted by InfeRE. By contrast, all lan-
rienced programmers who wrote the corresponding regexes. guage models without chain of inference mistakenly
On average, the length of natural language descriptions in put the ”or” operator in the middle, despite the correct
NL-RX-Turk is 50% longer than KB13. Following previous operators and operands. This suggests that the chain of
studies [4]–[6], [11], we split the original dataset into train, inference captures the real text-matching processes of
develop, and test sets. regexes better than plain regex and parenthetical notation
for regex trees.
E. Implementation Details
• Example 2 compares the generated regexes for the
We build our models on top of the T5 using the same query “lines with the string <M0> ending
configuration as T5-base (d ff = 3072, d kv = 64, d model in zero or more of a capital letter”.
= 768). Both the encoder and decoder consist of 12 blocks The query is clearly relevant to “capital letter”.
with a dropout probability of 0.1. On both datasets, the batch However, Deep-RegexM M L and SoftRegex mistakenly
size is set to 12 per GPU. We optimize the model using Adam omit the “zero or more” information. Both InfeRE
optimizer with a learning rate of 3e−5 and an epsilon of and its variant without self-consistency correctly
1e−8 . For the self-consistency decoding, following previous generate the gold regex, because the chain of inference
studies [15], we set the number k of outputs sampled from decomposes the regex into multiple easier sub-regexes,
each model to 40. and therefore helps the language model understand the
We train the models for 30 and 60 epochs on NL-RX-Turk complete information of the query.
and KB13, respectively. The checkpoint that obtains the best • Example 3 shows the results of “items with a any
performance on the development set is selected for testing. letter preceding <M0>”. The ground truth is
7
TABLE III: Performance of various approaches on regex generation (SC=Self-Consistency; CI=Chain of Inference).
NL-RX-Turk KB13
Approach
DFA-EQ@1(%) DFA-EQ@5(%) EM(%) DFA-EQ@1(%) DFA-EQ@5(%) EM(%)
Semantic-Unify 38.6 — — 65.5 — —
Deep-RegexMLE 60.3 76.0 40.7 66.5 75.7 55.8
Deep-RegexMML 62.4 76.8 39.2 68.2 77.7 56.8
SemRegex 62.3 — — 78.2 — —
SoftRegex 62.8 72.1 41.5 78.2 79.6 62.1
S2 RE 62.8 — — 78.2 — —
S2 RE-T5 67.6 85.7 54.4 82.0 88.8 71.4
TRANX 58.8 75.6 44.0 73.8 82.0 61.2
InfeRE (ours) 69.2 89.3 53.4 82.5 91.3 69.4
- w/o SC 67.8 85.9 55.5 81.6 87.9 72.3
- w/o SC+CI 67.2 85.5 54.2 82.0 88.8 70.4
TABLE IV: Generated regex examples by various approaches. The middle column shows the final regexes. For ease of
analysis, we also present the intermediate trees in a parenthetical notation in the right column. All results are generated via
greedy decoding except InfeRE with self-consistency decoding.
Example 1. Query: either start with string <M0> or end with any number.
Gold: ((.*)([0-9]))—((<m0>)(.*)) or(endwith(<num>),startwith(<m0>))
Deep-RegexM M L : ((<m0>)—((.*)([0-9])))(.*) startwith(or(<m0>,endwith(<num>)))
SoftRegex: ((<m0>)—(.*)([0-9]))(.*) startwith(or(<m0>,endwith(<num>)))
InfeRE-w/o SC+CI: ((<m0>)—((.*)([0-9])))(.*) startwith(or(<m0>,endwith(<num>)))
InfeRE-w/o SC: ((<m0>)(.*))—((.*)([0-9])) or(startwith(<m0>),endwith(<num>))
InfeRE: ((<m0>)(.*))—((.*)([0-9])) or(startwith(<m0>),endwith(<num>))
Example 2. Query: lines with the string <M0> ending in zero or more of a capital letter.
Gold: (<m0>)(((.*)([<A-Z>]))*) concat(<m0>,star(endwith(<cap>)))
Deep-RegexM M L : ((<m0>)(1,))—((.*)([A-Z])) or(repeat least(<m0>,1),endwith(<cap>))
SoftRegex: ((<m0>)(1,))&((.*)([A-Z])) and(repeat least(<m0>,1),endwith(<cap>))
InfeRE-w/o SC+CI: (.*(.*<m0>.*))* star(endwith(contain(<m0>)))
InfeRE-w/o SC: (<m0>)(((.*)([<A-Z>]))*) concat(<m0>,star(endwith(<cap>)))
InfeRE: (<m0>)(((.*)([<A-Z>]))*) concat(<m0>,star(endwith(<cap>)))
Example 3. Query: items with a any letter preceding <M0>.
Gold: (([A-Z])—([a-z]))(<m0>) concat(or(<low>,<cap>),<m0>)
Deep-RegexM M L : (([a-z])&([a-z] ))(<m0>) concat(and(<low>,<low>),<m0>)
SoftRegex: ([a-z])(<m0>) concat(<low>,<m0>)
InfeRE-w/o SC+CI: (([A-Z])&([a-z]))(<m0>) concat(and(<low>,<cap>),<m0>)
InfeRE-w/o SC: (([A-Z])&(([a-z])&(a-zA-Z)))(<m0>) concat(and(and(<low>,<cap>),<let>),<m0>)
InfeRE: (([A-Z])—(([a-z])—(a-zA-Z)))(<m0>) concat(or(or(<low>,<cap>),<let>),<m0>)
“or(<low>,<cap>)”, while Deep-RegexM M L and B. Effect of Chain of Inference (RQ2)

SoftRegex misunderstand “any letter” as any lower We conduct a number of ablation experiments to explore
case letter, and the variants of InfeRE misunderstand the effect of chain of inference.
the keyword letter as “and(<low>,<cap>)”. Al- 1) Chain of inference in InfeRE: We compare two variants
though both variants generate the wrong regexes when to verify the efficacy of chain of inference, namely, InfeRE
decoding greedily, InfeRE with self-consistency predicts without self-consistency (-w/o SC) and InfeRE with neither
the right answer. self-consistency nor chain of inference (-w/o SC+CI). The
former is built with a chain PLM that only generates chains
These examples demonstrate the superiority of InfeRE in of inference, while the latter is built with a PLM that directly
regex generation, affirming the effectiveness of chain of infer- generates trees in the form of parenthetical notated sequences.
ence and self-consistency decoding. The results are shown in the bottom half of Table III.
As we can see, chain of inference, by step-by-step rationale,
can greatly enhance the performance of regex generation.
Answer to RQ1: InfeRE can achieve better DFA-EQ and EM Compared with a simple PLM (-w/o SC+CI), InfeRE without
accuracy than the state-of-the-art regex generation methods self-consistency (-w/o SC) gains visible improvements in most
and popular tree generation method. Even if we replace the metrics. For example, the EM increases by about 2.4% and
backbone of S2 RE with T5, InfeRE achieves better perfor- 2.7% on NL-RX-Turk and KB13, respectively.
mance. The advantage of InfeRE is more obvious on the 2) Chain of inference with different PLMs: We further
complex datatset NL-RX-Turk. explore the generalizability of the chain of inference with
8
against InfeRE without self-consistency (-w/o SC) to study
the effect of self-consistency decoding. The results are shown
in the bottom half of Table III.
As the result shows, self-consistency decoding demonstrates
its effectiveness. InfeRE significantly outperforms the variant
without self-consistency (-w/o SC), with 1.6% and 3.9%
improvement on average in terms of DFA-EQ@1 and DFA-
EQ@5. The improvement is more significant when self-
(a) NL-RX-Turk (b) KB13 consistency decoding and chain of inference mechanisms are
used together, e.g., 4.4% in terms of DFA-EQ@5 on NL-RX-
Fig. 5: Performance of InfeRE under different numbers of Turk.
sampled outputs. However, we have also observed that InfeRE without self-
consistency gains the best EM score. The reason for this,
TABLE V: Effectiveness of chain of inference with differ- in addition to the chain of inference, is that self-consistency
ent PLMs. We evaluate DFA-EQ@1(%) on the NL-RX-Turk decoding takes semantic consistency as the metric. Hence it
dataset.w/o CI: the model trained on (query, regex) pairs; w/ boosts the generation of more regexes with different syntax
CI: the model trained on (query, chain) pairs. but the same semantics, which leads to a decrease in the EM
accuracy. The performance degradation on EM metric does
Approach GPT2 BART-small BART-base T5-small T5-base not affect the actual application, since we are more concerned
- w/o CI 52.0 55.4 55.9 65.6 67.2 with semantically correct regexes in regex generation task.
- w/ CI 55.7 59.6 59.8 65.2 67.8 Impact of number of output samples: As the self-consistency
decoding relies on the number k of output samples, we vary k
and evaluate the DFA-EQ accuracy of our approach on the two
different PLMs. Specifically, we train GPT2, BART-small, datasets. Besides InfeRE, which employs both a chain PLM
BART-base, T5-small, and T5-base on both (query, regex) and a tree PLM to perform self-consistency decoding, we also
and (query, chain) pairs, respectively, and evaluate their DFA- evaluate InfeRE with only one single PLM, i.e., tree PLM or
EQ@1 accuracy on the NL-RX-Turk dataset. The results are chain PLM. The results are shown in Figure 5.
presented in Table V. As the results show, most PLMs perform As the number of outputs sampled increases, all three
better when trained with chains of inference than trained with models tend to perform better. The two variants with only
regexes in the form of parenthetical notated sequences. The single PLM achieve 68.6% and 68.3% DFA-equal accuracy on
improvement from the chain of inference is most significant NL-RX-Turk relying on self-consistency respectively, outper-
on GPT2, BART-small, and BART-base, which are about forming the result 67.8% and 67.2% when decoding greedily.
7.1%, 7.6%, and 6.9%. We conjecture that these models do InfeRE with multi PLMs achieve the best performance, 69.2%
not fully understand the regex in the format of parenthetical DFA-equal accuracy. Specially, we can see that when k
notated sequences, so the chain of inference can greatly exceeds 15, InfeRE with multi PLMs consistently outperforms
increase their performance. Comparatively, the T5 model has two variants with a single PLM. The reason is that self-
achieved high performance on NL-RX-Turk, which obscures consistency decoding aggregates the outputs from different
the improvement brought by the chain of inference. PLMs and thus reduces the variance of the final results.
The results corroborate previous findings that median-sized Besides, InfeRE with only chain PLM achieves stronger
language models like T5-base can also solve certain multi- performance than that with tree PLM under almost all numbers
step problems through inference [38], as compared to ultra- of output samples. The difference is more significant when k
large language models such as ChatGPT. Unlike using chain- is relatively small. This indicates that the chain of inference
of-thought prompting for large language models, we fine-tune enhances the stability of the model in regex generation, which
T5 using a sufficient number of automatically generated chains helps the self-consistency decoding to sample correct answer
of inference. Such task-specific fine-tuning can be seen as an even when output samples are insufficient.
effective way to stimulate the reasoning ability of T5-base on Answer to RQ3: Self-consistency decoding can significantly
downstream tasks [38]. help InfeRE generate semantically correct regexes, achiev-
Answer to RQ2: Chain of inference makes an appreciable ing 1.6% and 3.9% improvement on average in terms of
contribution to the performance of InfeRE in regex genera- DFA-EQ@1 and DFA-EQ@5 metrics, and the effect of self-
tion. In addition, chain of inference can lead to considerable consistency decoding becomes much more apparent as the
performance improvements on different PLMs. number of ouput samples increases.
C. Effect of Self-Consistency Decoding (RQ3) D. Impact of Different Backbone PLMs (RQ4)

We conduct an ablation experiments to explore the effect We further assess the impacts of different backbone PLMs
of self-consistency decoding. Specially, we compare InfeRE on the performance of InfeRE. We experiment with two
9
TABLE VI: Ablation results of backbone PLMs on NL-RX- self-consistency decoding mechanism. Conversely, the chain
Turk. of inference mechanism is much more suitable for enabling
a language model to learn structural information in situations
Model DFA-EQ@1(%) DFA-EQ@5(%) EM(%) where sufficient data is available.
T5-base 69.2 89.3 55.5
T5-small 66.8 88.9 51.7 Answer to RQ5: InfeRE outperforms the baseline models
BART-base 59.4 79.9 43.6 under all data sizes. In particular, self-consistency decoding
BART-small 60.9 70.9 42.3 exhibits a more significant advantage in low-resource settings,
while the chain of inference contributes more to the perfor-
mance when sufficient data is available.
F. Threats to Validity
We have identified the following threats to our approach:
1) The format of inference chains: In our experiments,
we represent chain of inference as nodes of parsing trees.
Results show that our chain of inference has little effect on
the performance when the training data size is extremely
limited. Hence it remains to investigate whether other formats
(a) Size from 100 to 1000 (b) Size from 2000 to 6500
of inference chains can lead to better performance.
2) Generalization: We evaluate the efficacy of the chain
Fig. 6: Performance under different training set sizes on NL-
of inference on two regex datasets. In theory, the chain of
RX-Turk.
inference we propose can be extended to any tree-based
generation task. We leave the evaluation on other related tasks
types of backbone models with two sizes on the NL-RX-Turk for future work.
dataset, including BART-base, BART-small, T5-small, and T5- VI. C ONCLUSION
base. The results are shown in Table VI.
Overall, T5 surpasses BART in the regex generation task. This paper introduces InfeRE, a novel approach to generate
The pre-training tasks of BART include document-level de- regular expressions from natural language descriptions. InfeRE
noising autoencoding. We conjecture that its pre-training ob- formulates the process of regex generation as a chain of
jectives and pre-training data make it inappropriate for struc- thought and leverages pre-trained language models to generate
tured expression generation. In particular, T5-base performs step-by-step internal processes. The internal processes chains
the best in terms of all metrics. Clearly, larger language models are assembled into the final regexes using a self-consistency
boost greater performance in regex generation. decoding mechanism, which combines the sampled outputs to
enhance the robustness of the generated regexes. Our exper-
Answer to RQ4: The type and size of backbone PLM strongly imental results demonstrate that InfeRE achieves remarkable
affects the performance of InfeRE and InfeRE with T5-base improvements on two benchmark datasets. In the future, we
backbone performs best. will explore the chain of inference in other forms and apply it
E. Impact of Data Size (RQ5) to more code intelligent tasks. For example, a program can be
regarded as a chain of module interaction. The introduction
We also study the effect of training data size on the NL- of the chain of inference has the potential to improve the
RX-Turk dataset. We vary the size of data from 100 to capability of existing models to generate a large complete
full data, i.e., 6500, and investigate its effect on the DFA- program instead of function-level code.
EQ accuracy of regex generation. Specifically, we compare Source code and dataset to reproduce our work are available
InfeRE, InfeRE (-w/o SC), InfeRE (-w/o SC+CI), SoftRegex at https://github.com/Smallqqqq/InfeRE.
and DeepRegexM M L . Figure 6 shows the results.
As we can see, InfeRE and its variants consistently outper- ACKNOWLEDGMENT
form baseline models under different training sizes. Notably, This research is supported by National Natural Sci-
when the training data gets smaller (e.g., <1000), the improve- ence Foundation of China (Grant No. 62232003, 62102244,
ment becomes much more significant. This indicates that our 62032004 and 62272296) and CCF-Tencent Open Research
approach has a strong ability in a low-resource generation. Fund (RAGR20220129).
We also observe a more profound advantage of the self-
consistency decoding in the low-resource setting, as opposed R EFERENCES
to the chain of inference mechanism, which exhibits a more [1] J. C. Davis, L. G. Michael IV, C. A. Coghlan, F. Servant, and D. Lee,
substantial improvement under conditions of sufficient data. “Why aren’t regular expressions a lingua franca? an empirical study
This phenomenon may be attributed to the fact that the low- on the re-use and portability of regular expressions,” in Proceedings of
the 2019 27th ACM Joint Meeting on European Software Engineering
resource setting tends to result in less robust performance, Conference and Symposium on the Foundations of Software Engineering,
which can be alleviated through integrated sampling via the 2019, pp. 443–454.
10
[2] B. Luo, Y. Feng, Z. Wang, S. Huang, R. Yan, and D. Zhao, “Marrying [19] A. Bartoli, A. De Lorenzo, E. Medvet, and F. Tarlao, “Inference of reg-
up regular expressions with neural networks: A case study for spoken ular expressions for text extraction from examples,” IEEE Transactions
language understanding,” arXiv preprint arXiv:1805.05588, 2018. on Knowledge and Data Engineering, vol. 28, no. 5, pp. 1217–1230,
[3] N. Kushman and R. Barzilay, “Using semantic unification to generate 2016.
regular expressions from natural language,” in Human Language Tech- [20] Y. Li, Z. Xu, J. Cao, H. Chen, T. Ge, S.-C. Cheung, and H. Zhao,
nologies: Conference of the North American Chapter of the Association “Flashregex: deducing anti-redos regexes from examples,” in 2020 35th
of Computational Linguistics, Proceedings, June 9-14, 2013, Westin IEEE/ACM International Conference on Automated Software Engineer-
Peachtree Plaza Hotel, Atlanta, Georgia, USA, 2013, pp. 826–836. ing (ASE). IEEE, 2020, pp. 659–671.
[4] N. Locascio, K. Narasimhan, E. DeLeon, N. Kushman, and R. Barzilay, [21] G. J. Bex, F. Neven, T. Schwentick, and S. Vansummeren, “Inference of
“Neural generation of regular expressions from natural language with concise regular expressions and dtds,” ACM Transactions on Database
minimal domain knowledge,” in Proceedings of the 2016 Conference Systems (TODS), vol. 35, no. 2, pp. 1–47, 2010.
on Empirical Methods in Natural Language Processing, EMNLP 2016, [22] G. J. Bex, W. Gelade, F. Neven, and S. Vansummeren, “Learning
Austin, Texas, USA, November 1-4, 2016, 2016, pp. 1918–1923. deterministic regular expressions for the inference of schemas from xml
[5] Z. Zhong, J. Guo, W. Yang, J. Peng, T. Xie, J. Lou, T. Liu, and data,” ACM Transactions on the Web (TWEB), vol. 4, no. 4, pp. 1–32,
D. Zhang, “Semregex: A semantics-based approach for generating 2010.
regular expressions from natural language specifications,” in Proceedings [23] D. D. Freydenberger and T. Kötzing, “Fast learning of restricted regular
of the 2018 Conference on Empirical Methods in Natural Language expressions and dtds,” in Proceedings of the 16th International Confer-
Processing, Brussels, Belgium, October 31 - November 4, 2018, 2018, ence on Database Theory, 2013, pp. 45–56.
pp. 1608–1618. [24] A. Ranta, “A multilingual natural-language interface to regular expres-
[6] J. Park, S. Ko, M. Cognetta, and Y. Han, “Softregex: Generating regex sions,” in International Workshop on Finite State Methods in Natural
from natural language descriptions using softened regex equivalence,” in Language Processing, 1998, p. 79–90.
Proceedings of the 2019 Conference on Empirical Methods in Natural [25] S. Narang, C. Raffel, K. Lee, A. Roberts, N. Fiedel, and K. Malkan,
Language Processing and the 9th International Joint Conference on “Wt5?! training text-to-text models to explain their predictions,” arXiv
Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, preprint arXiv:2004.14546, 2020.
China, November 3-7, 2019, 2019, pp. 6424–6430. [26] S. Wiegreffe, J. Hessel, S. Swayamdipta, M. Riedl, and Y. Choi, “Re-
[7] X. Ye, Q. Chen, I. Dillig, and G. Durrett, “Benchmarking mul- framing human-ai collaboration for generating free-text explanations,”
timodal regex synthesis with complex structures,” arXiv preprint arXiv preprint arXiv:2112.08674, 2021.
arXiv:2005.00663, 2020. [27] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou,
[8] Q. Chen, X. Wang, X. Ye, G. Durrett, and I. Dillig, “Multi-modal syn- “Rationale-augmented ensembles in language models,” arXiv preprint
thesis of regular expressions,” in Proceedings of the 41st ACM SIGPLAN arXiv:2207.00747, 2022.
conference on programming language design and implementation, 2020, [28] X. Ye and G. Durrett, “The unreliability of explanations in few-shot
pp. 487–502. in-context learning,” arXiv preprint arXiv:2205.03401, 2022.
[9] X. Ye, Q. Chen, I. Dillig, and G. Durrett, “Optimal neural [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
program synthesis from multimodal specifications,” arXiv preprint L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
arXiv:2010.01678, 2020. in Neural Information Processing Systems 30: Annual Conference on
[10] X. Ye, Q. Chen, X. Wang, I. Dillig, and G. Durrett, “Sketch-driven Neural Information Processing Systems, 2017, pp. 5998–6008.
regular expression generation from natural language and examples,” [30] P. Yin and G. Neubig, “Tranx: A transition-based neural abstract
Trans. Assoc. Comput. Linguistics, vol. 8, pp. 679–694, 2020. syntax parser for semantic parsing and code generation,” arXiv preprint
[11] Y. Li, S. Li, Z. Xu, J. Cao, Z. Chen, Y. Hu, H. Chen, and S. Cheung, arXiv:1810.02720, 2018.
“TransRegex: multi-modal regular expression synthesis by generate- [31] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
and-repair,” in 43rd IEEE/ACM International Conference on Software Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 2021, with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21,
pp. 1210–1222. pp. 140:1–140:67, 2020.
[12] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, [32] K. E. Stanovich and R. F. West, “Individual differences in reasoning:
“Large language models are zero-shot reasoners,” arXiv preprint Implications for the rationality debate?” Behavioral and brain sciences,
arXiv:2205.11916, 2022. vol. 23, no. 5, pp. 645–665, 2000.
[13] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, [33] R. J. Williams, “Simple statistical gradient-following algorithms for
“Chain of thought prompting elicits reasoning in large language models,” connectionist reinforcement learning,” Machine learning, vol. 8, no. 3,
arXiv preprint arXiv:2201.11903, 2022. pp. 229–256, 1992.
[14] M. Pourreza and D. Rafiei, “Din-sql: Decomposed in-context learning of [34] H. Jiang, L. Song, Y. Ge, F. Meng, J. Yao, and J. Su, “An ast structure
text-to-sql with self-correction,” arXiv preprint arXiv:2304.11015, 2023. enhanced decoder for code generation,” IEEE/ACM Transactions on
[15] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou, “Self- Audio, Speech, and Language Processing, vol. 30, pp. 468–476, 2021.
consistency improves chain of thought reasoning in language models,” [35] Y. Dong, G. Li, and Z. Jin, “Antecedent predictions are dominant for
arXiv preprint arXiv:2203.11171, 2022. tree-based code generation,” arXiv preprint arXiv:2208.09998, 2022.
[16] E. Spishak, W. Dietl, and M. D. Ernst, “A type system for regular [36] N. Beau and B. Crabbé, “The impact of lexical and grammatical
expressions,” in Proceedings of the 14th Workshop on Formal Techniques processing on generating code from natural language,” arXiv preprint
for Java-like Programs, 2012, pp. 20–26. arXiv:2202.13972, 2022.
[17] X. Liu, Y. Jiang, and D. Wu, “A lightweight framework for regular [37] S. Shen, X. Zhu, Y. Dong, Q. Guo, Y. Zhen, and G. Li, “Incorporating
expression verification,” in 2019 IEEE 19th International Symposium domain knowledge through task augmentation for front-end javascript
on High Assurance Systems Engineering (HASE). IEEE, 2019, pp. code generation,” in Proceedings of the 30th ACM Joint European
1–8. Software Engineering Conference and Symposium on the Foundations
[18] M. Lee, S. So, and H. Oh, “Synthesizing regular expressions from exam- of Software Engineering, 2022, pp. 1533–1543.
ples for introductory automata assignments,” in Proceedings of the 2016 [38] N. Ho, L. Schmid, and S.-Y. Yun, “Large language models are reasoning
ACM SIGPLAN International Conference on Generative Programming: teachers,” arXiv preprint arXiv:2212.10071, 2022.
Concepts and Experiences, 2016, pp. 70–80.
11

NL2RE

Uploaded by

Copyright:

Available Formats

NL2RE

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NL2RE

Uploaded by

Copyright:

Available Formats

InfeRE: Step-by-Step Regex Generation via Chain

Abstract—Automatically generating regular expressions (ab-

autoregressively in a single pass. They did not take into account

Decomposing Tree Parsing

Inference ③ Self-Consistency decoding

Fig. 2: Overview of InfeRE.

B. Chain-of-Thought Prompting as the Transformer [29]. However, autoregressively generating

Fig. 3: An illustration of creating chains of inference from plain regexes.

IV. E XPERIMENTAL S ETUP Semantic-Unify learns a combinatory categorial grammar,

“or(<low>,<cap>)”, while Deep-RegexM M L and B. Effect of Chain of Inference (RQ2)

C. Effect of Self-Consistency Decoding (RQ3) D. Impact of Different Backbone PLMs (RQ4)

You might also like