NL2RE
NL2RE
NL2RE
of Inference
Shuai Zhang, Xiaodong Gu, Yuting Chen, Beijun Shen*
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
{zhangshuai2000, xiaodong.gu, chenyt, bjshen}@sjtu.edu.cn
2
② Learning & Generation ① Decompose regexes into chains of inference
Chain of Regex
Inference Tree
Regexes
Sampling &
Voting
Query Chain PLM Unification
Final Regex
Regexes
Tree PLM
3
Regex Tree step5
Q: lines with letter, 3 or more Plain Regex:
times and contain a number ( ( [A-Za-z] ) { 3 , } ) & ( . * [ 0-9] . * ) and
step2 step4
repeat_atleast contain
Encoder
lines with letter number step1 step3
Inference Chain:
… step1 = [A-Za-z] // letter <letter> 3 <num>
step2 = repeat_atleast(step1, 3)
Decoder step3 = [0-9] // number [A-Za-z] [0-9]
… step4 = contain(step3)
step5 = and (step2, step4)
step1=… step2=… step3=… step5=…
Chain PLM
Algorithm 1: Convert Plain Regexes to Chains of TABLE I: Rules for parsing plain regex into parenthetical
Inference notation for trees [7], [10].
Input: y: original regex; Op: a dictionary of all operators
and their corresponding number of operands Operator r :=
1 N odeStack = (), C = (), Boperand = (), i = 1; startwith (r) r.∗
2 Parse y into a tree T (y); endwith(r) .∗r
3 Post-order traverse T (y) and add nodes to N odeStack in contain(r) . ∗ r.∗
turn; not(r) ∼r
4 Reverse N odeStack; optional(r) r?
5 while N odeStack.size() ̸= 1 do star(r) r∗
6 node = N odeStack.pop(); concat(r1 , r2 ) r1 r2
7 if node is operator then and(r1 , r2 ) r1 & r2
/* Build one step in a chain of or(r1 , r2 ) r1 — r2
inference */ rep(r, k) r{k}
8 Nnode = Op[node] //Lookup the operands repeat least(r, k) r{k, }
number for a node rep range(r, k1 , k2 ) r{k1 , k2 }
9 sstep =”step” + i + ”=(”;
10 for i = 0 to Nnode do Operand :=
11 oi = Boperand .pop(); <let> [A-Za-z]
12 sstep = sstep + oi + ”,”; <cap> [A-Z]
13 end <low> [a-z]
14 sstep = sstep + ”)”; <num> [0-9]
15 C.add(sstep ); <any> .
16 Boperand .add(”step” + i); <spec> [-,;+:!@# $%&*=ˆ]
17 i = i + 1;
18 else
19 Boperand .add(node);
20 end size of the regex tree decreases continuously as the traversal
21 end proceeds. We repeat this process until there is only a single
Output: C: chain of inference
step node left in the tree, which means the completion of the
chain of inference.
Figure 3 illustrates an example. For a pair of (query, regex)
parenthetical notation for trees [7], [10]. The rules involve all in the training set, training a sequence-to-sequence model to
operators and operands that appear in the benchmark datasets. generate chain of inference is difficult. Hence, we parse the
Notably, the obtained parenthetical notation for trees can be regex y=(([A-Za-z]){3,})&(.*[0-9].*) into a tree
readily mapped back to plain regex based on these established and traverse the tree in post-order. When we encounter the
rules. first nonterminal node <letter> we create a sub-regex
Having obtained the parsing tree, we then traverse the tree step1=[A-Za-z] in the chain of inference to indicate
in a post-order. Whenever we encounter an operator node, we the first inferred matching process. The sub-tree rooted at
regard the sub-regex corresponding to its sub-tree as a step <letter> is replaced with a special node step1 to indicate
in the chain of inference. For the i-th node, we represent the the internal results after step 1. Next, we replace the sub-tree
sub-regex as step i in the chain of inference. Then, we replace repeat_atleast(step1,3) with a special node step2
the sub-tree of the current operator with a step-i node. The when we encounter the node repeat_atleast. Next, we
4
replace the third occurred node <num> with a node step Algorithm 2: Revert Chains of Inference to Plain
3 and append step3=[0-9] in the chain of inference. The Regexes
traversal continues until meeting the root node and, where Input: C: chain of inference;
we generate step5=and(step2,step4) in the chain of 1 re = C.pop();
inference and replace the sub-tree with step5. 2 while ‘step-i’ in re do
The decomposition step above is reminiscent of tree-based 3 get the corresponding sub-regex subrei of step-i
models in code generation such as TRANX [30]. Although from C[i];
both techniques generate code in a tree-based fashion, our 4 re = re.replace(step-i, subrei ) ;
architecture and training method are significantly different. 5 end
TRANX decomposes the tree into three types of actions and 6 Convert re to plain regex based on rules defined in
calculates their corresponding probabilities. This process devi- Table I;
ates from the pre-training task of current PLMs, which makes Output: re: plain regex
it challenging to engage with large-scale language models. In
contrast, InfeRE conforms to both the tree generation laws
and the pre-training task of language models. Notably, InfeRE
does not require any modifications to the underlying model generated chains are more convincing [32]. In other words,
structure or training process, making it highly flexible and multiple wrong chains are unlikely to reach the same answer.
adaptable to various corpora and models. Therefore, we can sample a group of inference chains from
C. Learning and Generation the language model, revert them to plain regexes, and select
the most consistent ones (i.e., chains of inference that lead to
Having parsed all regexes into chains of inference, we
the same regexes) as the final results.
obtain a parallel corpus of (query, chain) pairs. We regard the
generation of chains as a sequence-to-sequence learning task Based on this idea, we propose a self-consistency decoding
and leverage the Transformer model for the generation, where mechanism that selects the most consistent answer by sam-
the encoder represents the input query as vectors while the pling in the output space of language model [15]. Figure 4
decoder generates the target sequence (i.e., a chain of inferred illustrates the process of self-consistency decoding.
regexes) based on the encoded hidden vectors. First, we employ two language models: a chain PLM
In order to leverage large domain knowledge on texts, we which is trained on (x, s) pairs to generate chains of in-
use T5 [31], a popular PLM for sequence-to-sequence genera- ference, and a tree PLM which is trained on (x, y) pairs
tion. T5 is first pre-trained on large corpora of texts using self- to generate the parsing trees in parenthetical notation (e.g.,
supervision. The pre-training enables T5 to gain rich generic and(startwith(<low>, endwith(<vow>))).
knowledge, while the fine-tuning phase adapts the model to
At the decoding phase, each language model samples k
domain-specific tasks such as regex generation. Specifically,
outputs. The total 2k outputs are reverted to plain regexes and
we take the open-source T5 model1 as the backbone of InfeRE
are used to select the final regex using a plurality vote strategy.
and further fine-tune it on the (query, chain) pairs as a text
Since different regexes may represent the same semantic, we
generation task by minimizing the cross-entropy loss.
take a plurality vote based on semantic consistency.2 One
During the inference phase, we generate the chains of
output represents one vote. Regexes with the same semantics
inference autoregressively by feeding the natural language
will share the votes, and we randomly select one of them as the
query to the fine-tuned model. Then, we revert the generated
representative. Finally, InfeRE chooses the regex that receives
chains to regex trees. We initialize the tree with the last step in
the most votes as the final output. If there is more than one
the chain of inference, e.g., step5=and(step2,step4).
answer with equal votes, we take the answer first encountered.
For each step-i node, we replace the step-i node in the
sequence with its original sub-regex in the chain of inference. Compared with standard decoding methods, self-
This process may introduce new step-i node, so we repeat this consistency encourages correctness and diversity at the
process until there is no step node left and get corresponding same time [15].
parenthetical notation for trees. Finally, we convert the paren- It is worth noting that our idea is different from previous
thetical notation for trees into plain regex using the rules in studies [15], [27] on diversity driven decoding, i.e., introducing
Table I. The whole procedures are summarized in Algorithm 2. randomness in the input and output space of a single model.
We train multiple language models to generate both regexes
D. Self-Consistency Decoding
and chains of inference. The different forms of output can
Since we formulate regex generation as a chain of inference greatly enrich the input and output space of language models
for text matching, it is natural to suppose that there are and thus enhance the self-consistency decoding mechanism.
multiple chains that lead to the same text matching. Moreover,
if multiple chains reach the same matching, that means the
1 https://github.com/google-research/text-to-text-transfer- 2 We measure the semantic consistency using the DFA-EQ metric introduced
transformer#released-model-checkpoints in Section IV-C.
5
Inf Chain1: startwith(<low>) endwith(<vow>) and ⟹ (([a-z])(.*))&((.*)([AEIOUaeiou])) √
Chain PLM endwith(<vow>) and ⟹ ((.*)([AEIOUaeiou])) & (([a-z])(.*)) √
Inf Chain2: startwith(<low>)
Q: lines starting with a
…
…
…
(([a-z])(.*))&((.*)([AEIOUaeiou]))
lower-case letter and
ending with vowel √
Tree seq1: and(startwith(<low>, endwith(<vow>)) ⟹ (([a-z])(.*))&((.*)([AEIOUaeiou]))
Tree PLM
Tree seq2: startwith(and(<low>, endwith(<vow>) )) ⟹ (([a-z])&(.*)([AEIOUaeiou]))(.*) ×
…
…
Fig. 4: Illustration of self-consistency decoding.
6
TABLE II: Statistics of the datasets. All experiments are carried out on a workstation with 2 RTX
2080Ti GPUs and an operating system of Ubuntu 18.04.4 LTS.
Dataset Train Dev Test
NL-RX-Turk 6,500 1,000 2,500 V. R ESULTS
KB13 618 206 206
A. Effectiveness of InfeRE in Regex Generation (RQ1)
Table III compares the performance of various methods on
the two datasets. Overall, InfeRE achieves the best perfor-
DFA-EQ [3]. Since regexes in different forms may represent mance among all the approaches.
the same semantic, DFA-EQ converts regexes to determin- Compared with state-of-the-art NL2RE methods, InfeRE
istic finite automatons (DFAs) and measures their semantic demonstrates strong strength, with the top-5 DFA-EQ accuracy
equivalence by comparing their DFAs. If their DFAs exactly increased by 16.3% and 14.7% against S2 RE on NL-RX-Turk
match, the two regexes are semantically equivalent. DFA- and KB13, respectively. A more significant improvement can
EQ@m measures the proportion of cases for which the top-m be observed in the EM metric, which is increased by 33.7%
generated regexes contain at least one regex that is semanti- and 16.4% on two datasets compared to S2 RE.
cally equivalent to the ground truth. In our experiments, we We notice that InfeRE also surpasses S2 RE-T5, a PLM with
compute the top-m DFA-EQ accuracy with m = 1 and 5. the same number of parameters as InfeRE. The DFA@1 and
EM (Exact Match). EM measures the ratio of regexes that DFA@5 are improved by 2.3% and 4.2% on NL-RX-Turk.
exactly match ground-truth regexes. Different from the DFA- The results indicate that the improvement does not solely
EQ, EM compares two regexes in plain texts. come from pre-training, which affirms the effectiveness of
the proposed chain of inference and self-consistency decoding
D. Datasets mechanisms.
Compared with the tree-based code generation model
We evaluate InfeRE on two widely used datasets, namely, TRANX, InfeRE gains 13.2%, 14.7%, and 22.1% improve-
NL-RX-Turk [4] and KB13 [3]. The statistics of the two ment in the three metrics, respectively, demonstrating the
datasets are presented in Table II. superiority of InfeRE over the traditional tree-based method.
NL-RX-Turk is the largest dataset which contains 10,000 One notable phenomenon is that the improvement is more
regexes with natural language annotations. Due to the heavy significant on NL-RX-Turk than on KB13. One main reason
workload of manual labeling, Locascio et al. employ a is that NL-RX-Turk is larger and more complex than KB13,
generate-and-paraphrase process for annotating data automati- allowing the model to capture richer knowledge from the
cally [4]. They initially generated 10,000 regexes with natural chains of inference.
language descriptions stochastically based on manually-crafted We further qualitatively analyze concrete regexes generated
grammar. Then, human annotators were asked to paraphrase by various approaches. Three examples are listed in Table IV.
the synthetic natural language descriptions. • In Example 1, the query “either start with
KB13 is a smaller and easier dataset comprising 824 (query, string <M0> or end with any number” in-
regex) pairs. The dataset was created by crowdsourcing: volves two steps, namely, “start with string
the authors first asked crowdsource workers to write natural <M0>” and “end with any number”, which are
language queries. Then, each query was presented to expe- successfully predicted by InfeRE. By contrast, all lan-
rienced programmers who wrote the corresponding regexes. guage models without chain of inference mistakenly
On average, the length of natural language descriptions in put the ”or” operator in the middle, despite the correct
NL-RX-Turk is 50% longer than KB13. Following previous operators and operands. This suggests that the chain of
studies [4]–[6], [11], we split the original dataset into train, inference captures the real text-matching processes of
develop, and test sets. regexes better than plain regex and parenthetical notation
for regex trees.
E. Implementation Details
• Example 2 compares the generated regexes for the
We build our models on top of the T5 using the same query “lines with the string <M0> ending
configuration as T5-base (d ff = 3072, d kv = 64, d model in zero or more of a capital letter”.
= 768). Both the encoder and decoder consist of 12 blocks The query is clearly relevant to “capital letter”.
with a dropout probability of 0.1. On both datasets, the batch However, Deep-RegexM M L and SoftRegex mistakenly
size is set to 12 per GPU. We optimize the model using Adam omit the “zero or more” information. Both InfeRE
optimizer with a learning rate of 3e−5 and an epsilon of and its variant without self-consistency correctly
1e−8 . For the self-consistency decoding, following previous generate the gold regex, because the chain of inference
studies [15], we set the number k of outputs sampled from decomposes the regex into multiple easier sub-regexes,
each model to 40. and therefore helps the language model understand the
We train the models for 30 and 60 epochs on NL-RX-Turk complete information of the query.
and KB13, respectively. The checkpoint that obtains the best • Example 3 shows the results of “items with a any
performance on the development set is selected for testing. letter preceding <M0>”. The ground truth is
7
TABLE III: Performance of various approaches on regex generation (SC=Self-Consistency; CI=Chain of Inference).
NL-RX-Turk KB13
Approach
DFA-EQ@1(%) DFA-EQ@5(%) EM(%) DFA-EQ@1(%) DFA-EQ@5(%) EM(%)
Semantic-Unify 38.6 — — 65.5 — —
Deep-RegexMLE 60.3 76.0 40.7 66.5 75.7 55.8
Deep-RegexMML 62.4 76.8 39.2 68.2 77.7 56.8
SemRegex 62.3 — — 78.2 — —
SoftRegex 62.8 72.1 41.5 78.2 79.6 62.1
S2 RE 62.8 — — 78.2 — —
S2 RE-T5 67.6 85.7 54.4 82.0 88.8 71.4
TRANX 58.8 75.6 44.0 73.8 82.0 61.2
InfeRE (ours) 69.2 89.3 53.4 82.5 91.3 69.4
- w/o SC 67.8 85.9 55.5 81.6 87.9 72.3
- w/o SC+CI 67.2 85.5 54.2 82.0 88.8 70.4
TABLE IV: Generated regex examples by various approaches. The middle column shows the final regexes. For ease of
analysis, we also present the intermediate trees in a parenthetical notation in the right column. All results are generated via
greedy decoding except InfeRE with self-consistency decoding.
Example 1. Query: either start with string <M0> or end with any number.
Gold: ((.*)([0-9]))—((<m0>)(.*)) or(endwith(<num>),startwith(<m0>))
Deep-RegexM M L : ((<m0>)—((.*)([0-9])))(.*) startwith(or(<m0>,endwith(<num>)))
SoftRegex: ((<m0>)—(.*)([0-9]))(.*) startwith(or(<m0>,endwith(<num>)))
InfeRE-w/o SC+CI: ((<m0>)—((.*)([0-9])))(.*) startwith(or(<m0>,endwith(<num>)))
InfeRE-w/o SC: ((<m0>)(.*))—((.*)([0-9])) or(startwith(<m0>),endwith(<num>))
InfeRE: ((<m0>)(.*))—((.*)([0-9])) or(startwith(<m0>),endwith(<num>))
Example 2. Query: lines with the string <M0> ending in zero or more of a capital letter.
Gold: (<m0>)(((.*)([<A-Z>]))*) concat(<m0>,star(endwith(<cap>)))
Deep-RegexM M L : ((<m0>)(1,))—((.*)([A-Z])) or(repeat least(<m0>,1),endwith(<cap>))
SoftRegex: ((<m0>)(1,))&((.*)([A-Z])) and(repeat least(<m0>,1),endwith(<cap>))
InfeRE-w/o SC+CI: (.*(.*<m0>.*))* star(endwith(contain(<m0>)))
InfeRE-w/o SC: (<m0>)(((.*)([<A-Z>]))*) concat(<m0>,star(endwith(<cap>)))
InfeRE: (<m0>)(((.*)([<A-Z>]))*) concat(<m0>,star(endwith(<cap>)))
Example 3. Query: items with a any letter preceding <M0>.
Gold: (([A-Z])—([a-z]))(<m0>) concat(or(<low>,<cap>),<m0>)
Deep-RegexM M L : (([a-z])&([a-z] ))(<m0>) concat(and(<low>,<low>),<m0>)
SoftRegex: ([a-z])(<m0>) concat(<low>,<m0>)
InfeRE-w/o SC+CI: (([A-Z])&([a-z]))(<m0>) concat(and(<low>,<cap>),<m0>)
InfeRE-w/o SC: (([A-Z])&(([a-z])&(a-zA-Z)))(<m0>) concat(and(and(<low>,<cap>),<let>),<m0>)
InfeRE: (([A-Z])—(([a-z])—(a-zA-Z)))(<m0>) concat(or(or(<low>,<cap>),<let>),<m0>)
8
against InfeRE without self-consistency (-w/o SC) to study
the effect of self-consistency decoding. The results are shown
in the bottom half of Table III.
As the result shows, self-consistency decoding demonstrates
its effectiveness. InfeRE significantly outperforms the variant
without self-consistency (-w/o SC), with 1.6% and 3.9%
improvement on average in terms of DFA-EQ@1 and DFA-
EQ@5. The improvement is more significant when self-
(a) NL-RX-Turk (b) KB13 consistency decoding and chain of inference mechanisms are
used together, e.g., 4.4% in terms of DFA-EQ@5 on NL-RX-
Fig. 5: Performance of InfeRE under different numbers of Turk.
sampled outputs. However, we have also observed that InfeRE without self-
consistency gains the best EM score. The reason for this,
TABLE V: Effectiveness of chain of inference with differ- in addition to the chain of inference, is that self-consistency
ent PLMs. We evaluate DFA-EQ@1(%) on the NL-RX-Turk decoding takes semantic consistency as the metric. Hence it
dataset.w/o CI: the model trained on (query, regex) pairs; w/ boosts the generation of more regexes with different syntax
CI: the model trained on (query, chain) pairs. but the same semantics, which leads to a decrease in the EM
accuracy. The performance degradation on EM metric does
Approach GPT2 BART-small BART-base T5-small T5-base not affect the actual application, since we are more concerned
- w/o CI 52.0 55.4 55.9 65.6 67.2 with semantically correct regexes in regex generation task.
- w/ CI 55.7 59.6 59.8 65.2 67.8 Impact of number of output samples: As the self-consistency
decoding relies on the number k of output samples, we vary k
and evaluate the DFA-EQ accuracy of our approach on the two
different PLMs. Specifically, we train GPT2, BART-small, datasets. Besides InfeRE, which employs both a chain PLM
BART-base, T5-small, and T5-base on both (query, regex) and a tree PLM to perform self-consistency decoding, we also
and (query, chain) pairs, respectively, and evaluate their DFA- evaluate InfeRE with only one single PLM, i.e., tree PLM or
EQ@1 accuracy on the NL-RX-Turk dataset. The results are chain PLM. The results are shown in Figure 5.
presented in Table V. As the results show, most PLMs perform As the number of outputs sampled increases, all three
better when trained with chains of inference than trained with models tend to perform better. The two variants with only
regexes in the form of parenthetical notated sequences. The single PLM achieve 68.6% and 68.3% DFA-equal accuracy on
improvement from the chain of inference is most significant NL-RX-Turk relying on self-consistency respectively, outper-
on GPT2, BART-small, and BART-base, which are about forming the result 67.8% and 67.2% when decoding greedily.
7.1%, 7.6%, and 6.9%. We conjecture that these models do InfeRE with multi PLMs achieve the best performance, 69.2%
not fully understand the regex in the format of parenthetical DFA-equal accuracy. Specially, we can see that when k
notated sequences, so the chain of inference can greatly exceeds 15, InfeRE with multi PLMs consistently outperforms
increase their performance. Comparatively, the T5 model has two variants with a single PLM. The reason is that self-
achieved high performance on NL-RX-Turk, which obscures consistency decoding aggregates the outputs from different
the improvement brought by the chain of inference. PLMs and thus reduces the variance of the final results.
The results corroborate previous findings that median-sized Besides, InfeRE with only chain PLM achieves stronger
language models like T5-base can also solve certain multi- performance than that with tree PLM under almost all numbers
step problems through inference [38], as compared to ultra- of output samples. The difference is more significant when k
large language models such as ChatGPT. Unlike using chain- is relatively small. This indicates that the chain of inference
of-thought prompting for large language models, we fine-tune enhances the stability of the model in regex generation, which
T5 using a sufficient number of automatically generated chains helps the self-consistency decoding to sample correct answer
of inference. Such task-specific fine-tuning can be seen as an even when output samples are insufficient.
effective way to stimulate the reasoning ability of T5-base on Answer to RQ3: Self-consistency decoding can significantly
downstream tasks [38]. help InfeRE generate semantically correct regexes, achiev-
Answer to RQ2: Chain of inference makes an appreciable ing 1.6% and 3.9% improvement on average in terms of
contribution to the performance of InfeRE in regex genera- DFA-EQ@1 and DFA-EQ@5 metrics, and the effect of self-
tion. In addition, chain of inference can lead to considerable consistency decoding becomes much more apparent as the
performance improvements on different PLMs. number of ouput samples increases.
9
TABLE VI: Ablation results of backbone PLMs on NL-RX- self-consistency decoding mechanism. Conversely, the chain
Turk. of inference mechanism is much more suitable for enabling
a language model to learn structural information in situations
Model DFA-EQ@1(%) DFA-EQ@5(%) EM(%) where sufficient data is available.
T5-base 69.2 89.3 55.5
T5-small 66.8 88.9 51.7 Answer to RQ5: InfeRE outperforms the baseline models
BART-base 59.4 79.9 43.6 under all data sizes. In particular, self-consistency decoding
BART-small 60.9 70.9 42.3 exhibits a more significant advantage in low-resource settings,
while the chain of inference contributes more to the perfor-
mance when sufficient data is available.
F. Threats to Validity
We have identified the following threats to our approach:
1) The format of inference chains: In our experiments,
we represent chain of inference as nodes of parsing trees.
Results show that our chain of inference has little effect on
the performance when the training data size is extremely
limited. Hence it remains to investigate whether other formats
(a) Size from 100 to 1000 (b) Size from 2000 to 6500
of inference chains can lead to better performance.
2) Generalization: We evaluate the efficacy of the chain
Fig. 6: Performance under different training set sizes on NL-
of inference on two regex datasets. In theory, the chain of
RX-Turk.
inference we propose can be extended to any tree-based
generation task. We leave the evaluation on other related tasks
types of backbone models with two sizes on the NL-RX-Turk for future work.
dataset, including BART-base, BART-small, T5-small, and T5- VI. C ONCLUSION
base. The results are shown in Table VI.
Overall, T5 surpasses BART in the regex generation task. This paper introduces InfeRE, a novel approach to generate
The pre-training tasks of BART include document-level de- regular expressions from natural language descriptions. InfeRE
noising autoencoding. We conjecture that its pre-training ob- formulates the process of regex generation as a chain of
jectives and pre-training data make it inappropriate for struc- thought and leverages pre-trained language models to generate
tured expression generation. In particular, T5-base performs step-by-step internal processes. The internal processes chains
the best in terms of all metrics. Clearly, larger language models are assembled into the final regexes using a self-consistency
boost greater performance in regex generation. decoding mechanism, which combines the sampled outputs to
enhance the robustness of the generated regexes. Our exper-
Answer to RQ4: The type and size of backbone PLM strongly imental results demonstrate that InfeRE achieves remarkable
affects the performance of InfeRE and InfeRE with T5-base improvements on two benchmark datasets. In the future, we
backbone performs best. will explore the chain of inference in other forms and apply it
E. Impact of Data Size (RQ5) to more code intelligent tasks. For example, a program can be
regarded as a chain of module interaction. The introduction
We also study the effect of training data size on the NL- of the chain of inference has the potential to improve the
RX-Turk dataset. We vary the size of data from 100 to capability of existing models to generate a large complete
full data, i.e., 6500, and investigate its effect on the DFA- program instead of function-level code.
EQ accuracy of regex generation. Specifically, we compare Source code and dataset to reproduce our work are available
InfeRE, InfeRE (-w/o SC), InfeRE (-w/o SC+CI), SoftRegex at https://github.com/Smallqqqq/InfeRE.
and DeepRegexM M L . Figure 6 shows the results.
As we can see, InfeRE and its variants consistently outper- ACKNOWLEDGMENT
form baseline models under different training sizes. Notably, This research is supported by National Natural Sci-
when the training data gets smaller (e.g., <1000), the improve- ence Foundation of China (Grant No. 62232003, 62102244,
ment becomes much more significant. This indicates that our 62032004 and 62272296) and CCF-Tencent Open Research
approach has a strong ability in a low-resource generation. Fund (RAGR20220129).
We also observe a more profound advantage of the self-
consistency decoding in the low-resource setting, as opposed R EFERENCES
to the chain of inference mechanism, which exhibits a more [1] J. C. Davis, L. G. Michael IV, C. A. Coghlan, F. Servant, and D. Lee,
substantial improvement under conditions of sufficient data. “Why aren’t regular expressions a lingua franca? an empirical study
This phenomenon may be attributed to the fact that the low- on the re-use and portability of regular expressions,” in Proceedings of
the 2019 27th ACM Joint Meeting on European Software Engineering
resource setting tends to result in less robust performance, Conference and Symposium on the Foundations of Software Engineering,
which can be alleviated through integrated sampling via the 2019, pp. 443–454.
10
[2] B. Luo, Y. Feng, Z. Wang, S. Huang, R. Yan, and D. Zhao, “Marrying [19] A. Bartoli, A. De Lorenzo, E. Medvet, and F. Tarlao, “Inference of reg-
up regular expressions with neural networks: A case study for spoken ular expressions for text extraction from examples,” IEEE Transactions
language understanding,” arXiv preprint arXiv:1805.05588, 2018. on Knowledge and Data Engineering, vol. 28, no. 5, pp. 1217–1230,
[3] N. Kushman and R. Barzilay, “Using semantic unification to generate 2016.
regular expressions from natural language,” in Human Language Tech- [20] Y. Li, Z. Xu, J. Cao, H. Chen, T. Ge, S.-C. Cheung, and H. Zhao,
nologies: Conference of the North American Chapter of the Association “Flashregex: deducing anti-redos regexes from examples,” in 2020 35th
of Computational Linguistics, Proceedings, June 9-14, 2013, Westin IEEE/ACM International Conference on Automated Software Engineer-
Peachtree Plaza Hotel, Atlanta, Georgia, USA, 2013, pp. 826–836. ing (ASE). IEEE, 2020, pp. 659–671.
[4] N. Locascio, K. Narasimhan, E. DeLeon, N. Kushman, and R. Barzilay, [21] G. J. Bex, F. Neven, T. Schwentick, and S. Vansummeren, “Inference of
“Neural generation of regular expressions from natural language with concise regular expressions and dtds,” ACM Transactions on Database
minimal domain knowledge,” in Proceedings of the 2016 Conference Systems (TODS), vol. 35, no. 2, pp. 1–47, 2010.
on Empirical Methods in Natural Language Processing, EMNLP 2016, [22] G. J. Bex, W. Gelade, F. Neven, and S. Vansummeren, “Learning
Austin, Texas, USA, November 1-4, 2016, 2016, pp. 1918–1923. deterministic regular expressions for the inference of schemas from xml
[5] Z. Zhong, J. Guo, W. Yang, J. Peng, T. Xie, J. Lou, T. Liu, and data,” ACM Transactions on the Web (TWEB), vol. 4, no. 4, pp. 1–32,
D. Zhang, “Semregex: A semantics-based approach for generating 2010.
regular expressions from natural language specifications,” in Proceedings [23] D. D. Freydenberger and T. Kötzing, “Fast learning of restricted regular
of the 2018 Conference on Empirical Methods in Natural Language expressions and dtds,” in Proceedings of the 16th International Confer-
Processing, Brussels, Belgium, October 31 - November 4, 2018, 2018, ence on Database Theory, 2013, pp. 45–56.
pp. 1608–1618. [24] A. Ranta, “A multilingual natural-language interface to regular expres-
[6] J. Park, S. Ko, M. Cognetta, and Y. Han, “Softregex: Generating regex sions,” in International Workshop on Finite State Methods in Natural
from natural language descriptions using softened regex equivalence,” in Language Processing, 1998, p. 79–90.
Proceedings of the 2019 Conference on Empirical Methods in Natural [25] S. Narang, C. Raffel, K. Lee, A. Roberts, N. Fiedel, and K. Malkan,
Language Processing and the 9th International Joint Conference on “Wt5?! training text-to-text models to explain their predictions,” arXiv
Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, preprint arXiv:2004.14546, 2020.
China, November 3-7, 2019, 2019, pp. 6424–6430. [26] S. Wiegreffe, J. Hessel, S. Swayamdipta, M. Riedl, and Y. Choi, “Re-
[7] X. Ye, Q. Chen, I. Dillig, and G. Durrett, “Benchmarking mul- framing human-ai collaboration for generating free-text explanations,”
timodal regex synthesis with complex structures,” arXiv preprint arXiv preprint arXiv:2112.08674, 2021.
arXiv:2005.00663, 2020. [27] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou,
[8] Q. Chen, X. Wang, X. Ye, G. Durrett, and I. Dillig, “Multi-modal syn- “Rationale-augmented ensembles in language models,” arXiv preprint
thesis of regular expressions,” in Proceedings of the 41st ACM SIGPLAN arXiv:2207.00747, 2022.
conference on programming language design and implementation, 2020, [28] X. Ye and G. Durrett, “The unreliability of explanations in few-shot
pp. 487–502. in-context learning,” arXiv preprint arXiv:2205.03401, 2022.
[9] X. Ye, Q. Chen, I. Dillig, and G. Durrett, “Optimal neural [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
program synthesis from multimodal specifications,” arXiv preprint L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
arXiv:2010.01678, 2020. in Neural Information Processing Systems 30: Annual Conference on
[10] X. Ye, Q. Chen, X. Wang, I. Dillig, and G. Durrett, “Sketch-driven Neural Information Processing Systems, 2017, pp. 5998–6008.
regular expression generation from natural language and examples,” [30] P. Yin and G. Neubig, “Tranx: A transition-based neural abstract
Trans. Assoc. Comput. Linguistics, vol. 8, pp. 679–694, 2020. syntax parser for semantic parsing and code generation,” arXiv preprint
[11] Y. Li, S. Li, Z. Xu, J. Cao, Z. Chen, Y. Hu, H. Chen, and S. Cheung, arXiv:1810.02720, 2018.
“TransRegex: multi-modal regular expression synthesis by generate- [31] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
and-repair,” in 43rd IEEE/ACM International Conference on Software Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 2021, with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21,
pp. 1210–1222. pp. 140:1–140:67, 2020.
[12] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, [32] K. E. Stanovich and R. F. West, “Individual differences in reasoning:
“Large language models are zero-shot reasoners,” arXiv preprint Implications for the rationality debate?” Behavioral and brain sciences,
arXiv:2205.11916, 2022. vol. 23, no. 5, pp. 645–665, 2000.
[13] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, [33] R. J. Williams, “Simple statistical gradient-following algorithms for
“Chain of thought prompting elicits reasoning in large language models,” connectionist reinforcement learning,” Machine learning, vol. 8, no. 3,
arXiv preprint arXiv:2201.11903, 2022. pp. 229–256, 1992.
[14] M. Pourreza and D. Rafiei, “Din-sql: Decomposed in-context learning of [34] H. Jiang, L. Song, Y. Ge, F. Meng, J. Yao, and J. Su, “An ast structure
text-to-sql with self-correction,” arXiv preprint arXiv:2304.11015, 2023. enhanced decoder for code generation,” IEEE/ACM Transactions on
[15] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou, “Self- Audio, Speech, and Language Processing, vol. 30, pp. 468–476, 2021.
consistency improves chain of thought reasoning in language models,” [35] Y. Dong, G. Li, and Z. Jin, “Antecedent predictions are dominant for
arXiv preprint arXiv:2203.11171, 2022. tree-based code generation,” arXiv preprint arXiv:2208.09998, 2022.
[16] E. Spishak, W. Dietl, and M. D. Ernst, “A type system for regular [36] N. Beau and B. Crabbé, “The impact of lexical and grammatical
expressions,” in Proceedings of the 14th Workshop on Formal Techniques processing on generating code from natural language,” arXiv preprint
for Java-like Programs, 2012, pp. 20–26. arXiv:2202.13972, 2022.
[17] X. Liu, Y. Jiang, and D. Wu, “A lightweight framework for regular [37] S. Shen, X. Zhu, Y. Dong, Q. Guo, Y. Zhen, and G. Li, “Incorporating
expression verification,” in 2019 IEEE 19th International Symposium domain knowledge through task augmentation for front-end javascript
on High Assurance Systems Engineering (HASE). IEEE, 2019, pp. code generation,” in Proceedings of the 30th ACM Joint European
1–8. Software Engineering Conference and Symposium on the Foundations
[18] M. Lee, S. So, and H. Oh, “Synthesizing regular expressions from exam- of Software Engineering, 2022, pp. 1533–1543.
ples for introductory automata assignments,” in Proceedings of the 2016 [38] N. Ho, L. Schmid, and S.-Y. Yun, “Large language models are reasoning
ACM SIGPLAN International Conference on Generative Programming: teachers,” arXiv preprint arXiv:2212.10071, 2022.
Concepts and Experiences, 2016, pp. 70–80.
11