0% found this document useful (0 votes)
0 views40 pages

Faithful Logical Reasoning Via Symbolic Chain of Thought

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 40

Faithful Logical Reasoning via Symbolic Chain-of-Thought

Jundong Xu1 , Hao Fei1∗, Liangming Pan2 , Qian Liu3 , Mong-Li Lee1 , Wynne Hsu1
1
National University of Singapore, Singapore
2
University of California, Santa Barbara, USA
3
University of Auckland, New Zealand
jundong.xu@u.nus.edu; haofei37@nus.edu.sg; liangmingpan@ucsb.edu
liu.qian@auckland.ac.nz; dcsleeml@nus.edu.sg; whsu@comp.nus.edu.sg

Abstract Question
Premises: There was one six-way tie on the leaderboard and
one person in the six-way tie was from Belgium. Descampe is from
While the recent Chain-of-Thought (CoT) tech- Belgium and is on the leaderboard of the 1992 du Maurier Classic.
nique enhances the reasoning ability of large Is the following statement true, false, or uncertain? Descampe is in
the six-way tie in the leaderboard of the 1992 du Maurier Classic.
language models (LLMs) with the theory of
arXiv:2405.18357v2 [cs.CL] 11 Jun 2024

mind, it might still struggle in handling logical Chain-Of-Thought


overgeneralization
The context states that there was one six-way tie on the leaderboard
reasoning that relies much on symbolic expres- and one person in the six-way tie was from Belgium. Descampe is
sions and rigid deducing rules. To strengthen from Belgium and is on theovergeneralization
leaderboard of the 1992 du Maurier
Classic. Therefore, it can be inferred that Descampe is in the six-
the logical reasoning capability of LLMs, we way tie in the leaderboard of the 1992 du Maurier Classic.
propose a novel Symbolic Chain-of-Thought,
Symbolic Chain-Of-Thought
namely SymbCoT, a fully LLM-based frame-
Translated Symbolic Premises:
work that integrates symbolic expressions and • ∃x (Tie(x, sixWay) ∧ From(x, Belgium)
logic rules with CoT prompting. Technically, • From(Descampe, Belgium) ∧ Leaderboard(Descampe,

building upon an LLM, SymbCoT 1) first trans- 1992DuMaurierClassic)


Reasoning Based on Symbolic Representation and Rules:
lates the natural language context into the sym- Since there is at least one person from Belgium in the six-way tie
bolic format, and then 2) derives a step-by-step and Descampe is from Belgium, it is possible that Descampe is the
plan to solve the problem with symbolic logi- person in the six-way tie. Existential Instantiation ,Conjunction Elimination
cal rules, 3) followed by a verifier to check the Evaluate the l ink u sing First-Order Logic f rom the step from ∃x
(Tie(x, sixWay) ∧ From(x, Belgium)) to Tie(Descampe, sixWay)
translation and reasoning chain. Via thorough
requires a premise directly stating Descampe's involvement in the
evaluations on 5 standard datasets with both six-way tie, which is absent such as Tie(Descampe, sixWay).
First-Order Logic and Constraint Optimization However, s u c h information is not given. Thus, the truth of the
symbolic expressions, SymbCoT shows strik- statement "Tie(Descampe, sixWay)" remains unknown.

ing improvements over the CoT method consis- Figure 1: An illustrative example of logical reasoning
tently, meanwhile refreshing the current state- via Chain-of-Thought and our proposed Symbolic CoT
of-the-art performances. We further demon- (SymbCoT).
strate that our system advances in more faith-
ful, flexible, and explainable logical reasoning.
reasoning abilities, particularly in logical reason-
To our knowledge, this is the first to combine
symbolic expressions and rules into CoT for ing, is of paramount importance. Logical reasoning
logical reasoning with LLMs. Code is open at (Cummins et al., 1991) stands out as a quintessen-
https://github.com/Aiden0526/SymbCoT. tial form of reasoning that, unlike other types, is
crucial and challenging. It epitomizes a cognitive
process characterized by rigorous evidence evalua-
1 Introduction tion, argument construction, and logical deduction
(Markovits and Vachon, 1989). The latest trend
Achieving human-like logical reasoning capabil-
is integrating LLMs with symbolic solvers to en-
ities is crucial for realizing AGI, which plays a
hance their performance (Olausson et al., 2023;
pivotal role in enabling intelligent systems to en-
Pan et al., 2023). Unfortunately, these efforts
gage in problem-solving, decision-making, and
have been limited to using LLMs merely as text-to-
critical thinking. Recently, LLMs (Patel et al.,
symbolic translators, with the core reasoning still
2023; Hahn et al., 2022) have demonstrated un-
reliant on traditional external reasoners (Robinson,
precedented capabilities in semantic understanding,
1965). Such an approach, first, does not intrinsi-
casting a beacon of hope toward achieving AGI.
cally strengthen LLMs’ capability in logical reason-
Further enhancing LLMs to achieve human-level
ing. Besides, over-reliance on external symbolic

Corresponding author solvers often results in inflexibility, information
omission, and unexplainability. plicit intentions or crucial contextual informa-
On another note, the concept of CoT (Wei et al., tion embedded within questions. Yet LLMs
2022) has been introduced to mimic human think- excel at interpreting such nuanced information
ing processes by encouraging LLMs to explicitly and contexts. Thus, we consider a combina-
consider intermediate steps during problem-solving tion of symbolic and natural language expres-
and to provide rationales for decisions, thereby en- sions to leverage the mutual strengths of both:
hancing the reliability of the reasoning process. freely expressed implicit intents and contextual
CoT has been successfully integrated into a wide information in natural language and rigorous
array of tasks (Zhang et al., 2023b; Fei et al., 2023; expression in symbolic forms.
Inaba et al., 2023), significantly improving LLMs’ 2) Unlike the straightforward prompting of “think-
reasoning capabilities, sometimes even matching ing step by step” in vanilla CoT, SymbCoT con-
human performance in certain scenarios (Duni- siders a plan-then-solve architecture. This in-
vin, 2024). There is growing interest in apply- volves decomposing the original complex prob-
ing CoT for logical reasoning (Huang and Chang, lem into a series of smaller, more manageable
2023), and developing advanced strategies such as sub-problems, which are then addressed one
self-consistency (Wang et al., 2023b) and Tree-of- by one. This way, the entire reasoning process
Thought (Yao et al., 2023) for enhancement. How- becomes more trackable, enabling a clearer and
ever, applying basic CoT directly to logical rea- more structured approach to problem-solving.
soning is inherently limited, due to the abstractive 3) Furthermore, we devise a retrospective veri-
nature of language expression. Logical reasoning fication mechanism. At both the translation
demands rigorous logical calculations, heavily re- and subsequent problem-solving stages, we ret-
lying on both symbolic expressions and rigid de- rospectively validate the correctness of each
ducing rules to represent the internal structure of step’s outcome, by tracing back to the origi-
problems. Plain texts often fall short of support- nal given condition. This verification process
ing such precise logic, especially in scenarios that ensures the accuracy and reliability of the oper-
demand strict logical representation. For instance, ations performed during the reasoning process.
as shown in Fig. 1, when tackling a logical rea- In experiments, we test SymbCoT with symbolic
soning problem, utilizing symbolic representations expressions of FOL and Constraint Optimization
like First-Order Logic (FOL) is more representative (CO) on five logical reasoning datasets using both
and precise than fully natural language rationales GPT-3.5 and GPT-4. Results demonstrate that Sym-
in CoT, enabling strict logical reasoning through bCoT significantly enhances the reasoning capabil-
clear inference rules. ities of vanilla CoT, outperforming current SoTA
To address these challenges, we introduce a solutions clearly. We further demonstrate that the
novel Symbolic CoT (namely SymbCoT) for log- more complex the logical reasoning task, the more
ical reasoning. Unlike existing state-of-the-art pronounced the improvement of SymbCoT over
(SoTA) LLM-based symbolic reasoning systems vanilla CoT, further with the verification mecha-
(Olausson et al., 2023; Pan et al., 2023), SymbCoT nism ensuring the faithfulness of the reasoning pro-
is entirely facilitated by LLMs without relying on cess. Our in-depth analysis reveals that fully LLM-
any external reasoners/tools, i.e., encompassing based logical reasoning can offer better symbolic
both the initial translation and subsequent reason- syntax robustness, human-readable explanations,
ing phases. Fig. 2 provides a high-level illustration and fuller utilization of information.
of the overall system workflow. Technically, Sym- In summary, our technical contributions are:
bCoT comprises four main modules: Translator,
Planner, Solver, and Verifier. Notably, SymbCoT • proposing a fully LLM-based logical reason-
is characterized by the following three core aspects: ing framework based on CoT, demonstrating
that LLMs can achieve robust logical reasoning
1) SymbCoT integrates symbolic expressions into capabilities without external reasoning tools.
CoT to describe intermediate reasoning pro- Compared to existing SoTA solutions relying
cesses, facilitating more precise logical calcu- on external resolvers, SymbCoT offers better
lations. However, relying solely on symbolic robustness against translation errors and more
representation still has its limitations, as it of- human-understandable explanations.
ten fails to capture certain content, such as im- • innovatively integrating the strengths of sym-
LLM

P
S Translator Planner Solver Verifier

Checked Symbolic
Selected Premises Premises
Premises Final
& Symbolic Premises Step-by-Step Plan Derived Insights Verified Insights
Question Answer
Conclusion Verified Answer

Step-1 Step-2 Step-3 Step-4


Figure 2: Overview of the workflow in our proposed symbolic CoT framework.

bolic forms and natural language expressions, reasoning processes has been shown to enhance
enabling precise reasoning calculations while outcomes in code generation tasks (Li et al., 2023),
fully interpreting implicit information and cap- while adopting mathematical equations for CoT’s
turing rich contexts. steps has proven beneficial in solving mathemat-
• introducing a plan-then-solve architecture for ical problems (Imani et al., 2023). Focusing on
CoT reasoning, along with a retrospective veri- logical reasoning, it becomes evident that solely
fication mechanism, enhancing the faithfulness using natural language formats for intermediate
of the reasoning process. reasoning steps inevitably leads to significant in-
formation loss, especially when tackling complex
2 Related work logical reasoning jobs. This paper, therefore, pro-
Recent achievements in reasoning research pow- poses a symbolic-oriented CoT approach tailored
ered by LLMs have shown promising results for logical reasoning.
(Huang and Chang, 2023; Dunivin, 2024), bringing Logical reasoning (Cummins et al., 1991), a
LLMs closer to human-level reasoning capabilities paramount aspect of the reasoning domain, de-
due to their profound semantic understanding (Wu mands models that can precisely grasp and ma-
et al., 2023; Fei et al., 2022). Among these, the CoT nipulate complex logical structures. Previous
series methodology (Wei et al., 2022) has garnered works have explored rule-based (Robinson, 1965)
increasing attention for its emulation of human and neural-based solving (Amayuelas et al., 2022;
discrete chain reasoning. By considering more in- Gerasimova et al., 2023) methods for interpret-
termediate steps and the rationales behind decision- ing symbolic representations. The latest trend in-
making, CoT has significantly enhanced overall volves integrating LLMs into the symbolic reason-
reasoning performance on many downstream appli- ing process (Ye et al., 2023; Gaur and Saunshi,
cations (Fei et al., 2023, 2024). Subsequent tech- 2023). For example, Logic-LM (Pan et al., 2023)
nologies have introduced more advanced reason- and LINC (Olausson et al., 2023) consider using
ing frameworks, incorporating mechanisms such as LLMs as translators to convert natural language
self-consistency and non-linear, multidimensional into symbolic syntax such as FOL, which is then
topological structures, e.g., Tree-of-Thought (Yao processed by external reasoning tools to enhance
et al., 2023), Graph-of-Thought (Besta et al., 2023; reasoning performance. These approaches main-
Zheng et al., 2024), and other variants (Zhang et al., tain that LLMs cannot parse symbolic expressions
2023b; Wang et al., 2023a). as reliably as external rule-based reasoners.
However, research has also highlighted limita- Nonetheless, merely utilizing LLMs as transla-
tions within CoT due to its reliance on natural lan- tors does not inherently enhance their logical rea-
guage rationales, which may not always be advanta- soning capabilities. This work pioneers the devel-
geous in certain scenarios. Studies have found that opment of the first symbolic CoT specifically de-
representing CoT’s intermediate steps in a struc- signed for logical reasoning, fully utilizing LLMs.
tured manner, reflecting the task’s intrinsic struc- In contrast to approaches like Logic-LM and LINC,
ture, can bolster reasoning capabilities for specific our method demonstrates several advancements:
tasks (Hu et al., 2023; Imani et al., 2023). For in- First, external reasoners require strict formatting,
stance, using pseudo-code to describe intermediate where any translation error by LLMs can lead to
failure in parsing and reasoning. Our reasoning Solver is tasked with deriving the answer through
steps, facilitated by the LLM, exhibit greater robust- a sequential logical inference process given the
ness against syntax errors. Second, the entire rea- premises and question statement, along with a
soning process is conducted by the LLM, providing meticulously crafted plan. The Solver module em-
rationales that ensure a more human-friendly ex- phasizes the importance of basing logical deduc-
planation throughout. Third, we propose a blend of tions on symbolic inference rules. For instance,
symbolic forms and natural language expressions the LLM is explicitly instructed to apply logical
within the logical reasoning process, achieving pre- deduction techniques adhering to FOL such as
cise reasoning calculations while fully interpreting Modus Tollens (Enderton, 1972), e.g., If ¬B, and
implicit information inherent in natural language. (A → B), then ¬A. This ensures that each step in
Finally, we introduce a plan-then-solve CoT rea- the reasoning process is grounded in established
soning architecture and a verification mechanism, logical frameworks, to draw accurate and reliable
ensuring the faithfulness of the reasoning process. conclusions.
Verifier serves two functions in our framework.
3 SymbCoT for Symbolic Reasoning First, it validates the correctness of symbolic trans-
3.1 Task Definition lations by prompting the LLM to ascertain their
semantic equivalence to the corresponding natural
The logical reasoning is defined as: formally, given language. If discrepancies are identified, the lan-
a set of premises P = {p1 , p2 , . . . , pn }, where guage model refines the symbolic translation based
each pi represents a logical statement, we aim to on the original natural language to ensure semantic
derive a conclusion regarding a given statement S. consistency. Second, it verifies the output of the
The objective is to determine whether S is true (T ), Solver containing step-by-step logical reasoning
false (F ), or unknown (U ) based on the logical with the final answer. During this verification pro-
inferences drawn from the premises. cess, the Verifier scrutinizes each logic deduction
▶ Example: made, to ensure they strictly comply with the prin-
<Premises> (P ) ciples of formal logic. If invalid logic is found,
A hawk never lands. Some birds are hawks.
<Statement> (S) the Verifier refines the reasoning steps according to
All birds land. the correct logic to arrive at a new answer. These
<Answer>
False. ensure both the fidelity and reliability of symbolic
translations and logical reasoning soundness.
3.2 Modules 3.3 Reasoning Steps
As shown in Fig. 2, our SymbCoT system is fully We now walk through the SymbCoT reasoning
supported by LLMs and comprises four distinct steps with detailed examples, each corresponding
modules: Translator, Planner, Solver, and Veri- to the process exemplified in Fig. 2.
fier, whose roles are elaborated as follows. 1 Step 1: Translating natural language context
Translator converts the premises and a question into symbolic. Given the original set of premises
statement from natural language to a symbolic for- P = {p1 , p2 , . . . , pn } and the question statement
mat. This process prepares the input in a way that S, the Translator will first interpret them into
aligns with the structured requirements of subse- the corresponding symbolic format, denoted as
quent reasoning processes, ensuring that the reason- P ′ = {p′1 , p′2 , . . . , p′n } and S ′ . We concatenate
ing problems are represented in a format conducive the natural language and symbolic expression P
to logical analysis. and P ′ as Pc , S and S ′ as Sc , i.e., hybrid format
expressions for the future steps of reasoning.
Planner breaks down the raw problem into
▶ Input:
smaller sub-problems, which develop a detailed, Please parse the context and question into First-Order
step-by-step plan that connects the given premises Logic formulas.
<Premises> (P )
to the question statement in both natural language If a cartoon character is yellow, it is from the Simpsons.
and symbolic form. This blueprint guides the struc- If a cartoon character is from Simpsons, then it is loved by
children.
tured reasoning process by outlining a logical se- (... More premises ...)
quence of steps. <Statement> (S)
Ben is ugly or yellow.
1
We use few-shot demonstrations for each module.
▶ Output: from premise 1.
<Premises> (P ′ ) Step 2: Simpsons(ben) → Loved(ben) by Modus Ponens
∀x (Y ellow(x) → Simpsons(x)) from premise 2.
∀x (Simpsons(x) → Loved(x)) (... More steps ...)
(... More premises ...) <Conclusion> (C)
<Statement> (S ′ ) Thus, we can conclude that (Y ellow(ben) ∨ U gly(ben))
(Y ellow(ben) ∨ U gly(ben)) is false by contradiction.

Step 2: Deriving plan based on the natural and


symbolic context. At this stage, the Planner mod- Step4: Verifying the translation and solving pro-
ule, utilizing Pc and Sc , constructs a plan compris- cess. Now based on the concatenated inputs Pc
ing intermediate steps I = {i1 , i2 , . . . , im }. These and Sc from Step 1, and the reasoning chain D
steps form logical inferences sequentially bridging and derived conclusion C from Step 3, the Verifier
the premises Pc towards the question statement Sc . conducts two ways of verification process. First,
▶ Input:
it checks whether the symbolic language is cor-
Please derive a step-by-step plan using the First-Order rectly translated based on the original natural lan-
Logic rule for determining the conclusion based on the guage context, represented as Vtrans = Verify(P →
context.
<Premises> (Pc )
P ′ , S → S ′ ). If any incorrect translation is found,
If a cartoon character is yellow, it is from the Simpsons ::: the Verifier will undertake refinements to produce
∀x (Y ellow(x) → Simpsons(x)) a revised translation of the premises and statement,
If a carton character is from Simpsons, then it is loved by ′ ′
children: ∀x (Simpsons(x) → Loved(x)) denoted as Prefined and Srefined . Second, it will
(... More premises ...) check whether the inference D adheres to valid log-
<Statement> (Sc )
Ben is ugly or yellow ::: (Y ellow(ben) ∨ U gly(ben)) ical rules. If any logical fallacy is detected, the Ver-
▶ Output:
ifier will conduct a refined derivation Drefined using
<Plan> (I) the corrected logic rules. Consequently, these re-
1: Identify the relevant premise of Ben. finements lead to the generation of a refined answer,
2: Identify the relevant premise of yellow and ugly.
(... More plan ...) represented as Crefined . Upon successful verifica-
tion as Vvalidity (D/Drefined ), the Verifier produces a
Step 3: Solving the problem given context and verified answer Cverified .
plan. Here, the Solver operates on the premises ▶ Input:
Pc , the question statement Sc , and the plan I, by Please verify: 1) the symbolic context is consistent with
the natural language; 2) the solving step is logically valid.
systematically applying logic rules. It iteratively <Premises> (Pc )
selects pertinent premises and infers new insights If a cartoon character is yellow, it is from the Simpsons:
∀x (Y ellow(x) → Simpsons(x))
through a sequence of reasoning steps, represented If a carton character is from Simpsons, then it is loved by
as D = {d1 , d2 , . . . , dl }. Here, each dk signifies children: ∀x (Simpsons(x) → Loved(x))
(... More premises ...)
an insight inferred from the relevant premises dur- <Statement> (Sc )
ing the iteration process, contributing to the eluci- Ben is ugly or yellow ::: (Y ellow(ben) ∨ U gly(ben))
<Reasoning Steps> (D)
dation of Sc . This iterative methodology leads to Step 1: Yellow(ben) → Simpsons(ben) by Modus Ponens
a conclusive determination C, assigning the state- from premise 1.
Step 2: Simpsons(ben) → Loved(ben) by Modus Ponens
ment Sc as true (T ), false (F ), or uncertain (U ). from premise 2.
▶ Input: (... More steps ...)
Please solve the question based on First-Order Logic rules <Conclusion> (C)
such as Modus Ponens... Thus, we can conclude that (Y ellow(ben) ∨ U gly(ben))
<Premises> (Pc ) is false by contradiction.
If a cartoon character is yellow, it is from the Simpsons:
∀x (Y ellow(x) → Simpsons(x)) ▶ Output:
′ ′
If a carton character is from Simpsons, then it is loved by <Translation Verification> (P/Prefined ) and (S/Srefined )
children: ∀x (Simpsons(x) → Loved(x)) 1. Verify context consistency: "If a cartoon character is
(... More premises ...) yellow, it is from the Simpsons" is semantically consistent
<Statement> (Sc ) with "∀x (Y ellow(x) → Simpsons(x))"
Ben is ugly or yellow ::: (Y ellow(ben) ∨ U gly(ben)) (... More translation verification ...)
<Plan> (I) <Logic Verification> (D/Drefined )
1: Identify the relevant premise of Ben. 2. Verify solving logic step by step: Step 1 correctly
2: Identify the relevant premise of yellow and ugly. implement Modus Ponens using premise 1; Step 2 correctly
(... More plan ...) implement Modus Ponens using premise 2;
(... More logic verification ...)
▶ Output: <Conclusion Verification> (Cverified /Crefined )
<Reasoning Steps> (D) Thus, the solving process is logically valid. The answer is
Step 1: Yellow(ben) → Simpsons(ben) by Modus Ponens verified to be false.
ProntoQA ProofWriter FOLIO Avg ProntoQA ProofWriter
• GPT-3.5-turbo 100 99.60 99.00 98.60 82.50 82.67
80
Naive 47.40 35.50 45.09 42.66 90
CoT 67.80 49.17 57.35 58.11 60 55.80
80 77.40 52.70
Logic-LM 61.00 58.33 62.74 60.69
SymbCoT 75.80 59.03 57.84 64.22 70 40
(+8.00) (+0.70) (-4.90) (+3.53) 83.32
FOLIO LogicalDeduction
93.00
• GPT-4 80 90 88.67

Accuracy %
76.47
Naive 77.40 52.67 69.11 66.39 80
CoT 98.79 68.11 70.58 79.16 70 68.70 69.10 73.00 71.30
CoT-SC - 69.33 68.14 - 70
ToT - 70.33 69.12 - 60 60
CR - 71.67 69.11 - AR-LSAT Average
80 79.94
DetermLR - 79.17 75.45 - 41.30 77.45
40.43
Logic-LM 83.20 79.66 78.92 80.59 40 39.13
70 67.05
SymbCoT 99.60 82.50 83.33 88.47 35 60.80
(+0.81) (+2.84) (+4.41) (+7.88) 33.30 60
30 50
Table 1: Performance on symbolic reasoning with First- Ours w/o Planner&Solver&Verifier
Order Logical representation. The second best score is w/o Verifier w/o Translator&Planner&Solver&Verifier

underlined and bold one is the best. In the brackets are Figure 3: Ablation study. Since the Solver is dependent
the corresponding improvements in between. on the Planner, they have to be ablated simultaneously.
LogicalDeduction AR-LSAT Avg
Naive 71.33 33.33 52.33 Baseline. We compare with a range of estab-
CoT 75.25 35.06 55.14 lished baselines. Those based on GPT-3.5 are: 1)
CoT-SC 74.67 - -
ToT 76.83 - - Naive Prompting; 2) CoT (Wei et al., 2022); 3)
CR 78.33 - - Logic-LM (Pan et al., 2023). On GPT-4, apart from
DetermLR 85.00 - -
Logic-LM 87.63 43.04 65.34 the above baselines, we further include more sys-
SymbCoT 93.00 43.91 68.46 tems: 4) CoT-SC (Wang et al., 2023b); 5) ToT (Yao
(+5.37) (+0.87) (+3.12)
et al., 2023); 6) Cumulative Reasoning (CR; Zhang
Table 2: Results (using GPT-4) on symbolic reasoning et al., 2023a); 7) DetermLR (Sun et al., 2023);
with Constraint Optimization representation.
Appendix extends more details of the symbolic
4 Experiments structures (§B), the datasets (§C), and baselines
(§D), as well as the full prompt configurations (§I).
4.1 Settings
Model. Following prior work of Logic-LM (Pan 4.2 Main Result
et al., 2023), we assess the baselines and our Table 1 shows that our method significantly outper-
method using GPT-3.5 (Ouyang et al., 2022) and forms Naive, CoT, and Logic-LM baselines, with
GPT-4 (OpenAI, 2023)2 , ensuring a fair compari- gains of 21.56%, 6.11%, 3.53% on GPT-3.5, and
son with identical model parameters. 22.08%, 9.31% and 7.88% on GPT-4, respectively.
Dataset. Five standard datasets are employed, We notice the only exception is on the FOLIO
including PrOntoQA (Saparov and He, 2023), dataset with GPT-3.5, failing to surpass Logic-LM.
ProofWriter (Tafjord et al., 2021), FOLIO (Han The underperformance points to challenges in non-
et al., 2022), LogicalDeduction (Srivastava et al., linear reasoning, reflecting the inherent challenge
2022), and AR-LSAT (Zhong et al., 2022). Each of for LLMs. But, our approach notably surpasses
them takes different symbolic representations and all baselines across both datasets with GPT-4, es-
introduces its own set of challenges in the topic of pecially outperforming Logic-LM by an average
logical reasoning. The primary metric for evalu- of 7.88%, which demonstrates significant improve-
ation is accuracy, measuring the multiple-choice ments in complex reasoning tasks. In Table 2, we
correctness of the questions. show the results with CO symbolic expression on
two datasets. As seen, our approach surpasses both
Symbolic Structure. In datasets PrOntoQA, CoT and Logic-LM by 13.32% and 3.12%, respec-
ProofWriter, and FOLIO, we use FOL as sym- tively, again demonstrating its general versatility in
bolic structure. To test the generalizability of our different symbolic reasoning expressions.
framework among different symbolic structures,
we further consider the CO symbolic expression in 4.3 Model Ablation
datasets LogicalDeduction and AR-LSAT. To ascertain the individual impact of each mod-
2
Specifically, we use gpt-3.5-turbo-0613 and gpt-4-0613. ule within our framework, we perform an ablation
13.7% 18.1%

19.7%

20.6%

22.3%

Figure 6: The left pie shows the error proportion from


the external solver due to 1) Information Loss (IL), 2) In-
formation Error (IE), and Others. The bar chart consists
of two parts. The left bar shows the false rate from the
Figure 4: The effect of reasoning depth with GPT-4 external solver made by IL/IE adding up to 100%. The
on ProofWriter. The red dual-head arrow indicates our right bar shows the reduced false rates via our method.
improvements over vanilla CoT.

Figure 7: The proportion of faithful, unfaith-


ful, and false answers. Faithful/unfaithful denotes
Figure 5: Execution rate between Logic-LM and Ours.
whether the predicated correct answer is derived from
valid&reaonsable logical reasoning.
study. The patterns from Fig. 3 reveal that the con-
tributions to the overall efficacy of our method vary ing issues. Moreover, even at a reasoning depth of
across modules on GPT-4. Notably, the Planner and 5, our method continues to achieve the best perfor-
Solver components are identified as the most influ- mance.
ential, enhancing performance by an average of
10.4%, followed by the Translator module, which 5.2 Robustness to Symbolic Syntax Error
facilitates a secondary improvement of 6.3%. The
In Fig. 5, we conduct a comparative analysis of our
finding highlights the efficacy of our proposed plan-
fully LLM-based reasoner against methods that rely
then-solve design for conquering the raw questions
on external resolvers, such as Logic-LM, specif-
by dividing them into smaller ones. Additionally,
ically focusing on the success rate of executing
the use of symbolic representation and rules shows
symbolic expression syntax. 3 Notably, our method
significant reasoning enhancement.
achieves a remarkable execution success rate of up
to 100%. This represents a significant improvement
5 Analysis and Discussion
over Logic-LM by an average of 17.7% percentage
We now delve into our system further and try to points. Our approach notably enhances the execu-
explore why it advances. tion rate on the AR-LSAT. It boosts the success rate
by 67.4% from Logic-LM, where LLMs are more
5.1 Performance on Complex Reasoning
prone to translating syntax errors. Remarkably, our
In our direct comparison of overall performance, method consistently executes with 100% success,
we have demonstrated that our approach surpasses showcasing remarkable robustness against syntax
the baseline, particularly noting a significant en- errors.
hancement in the performance of the CoT. Now,
we delve deeper into analyzing the performance 5.3 Benefit of Hybrid Expression of Symbolic
of different methods across varying levels of rea- And Natural Language
soning depth. Intuitively, a greater depth indicates LLM’s translations from natural to symbolic lan-
more complex problems. Observing Fig. 4, we guage sometimes omit crucial information or in-
notice that as the depth increases, the improvement 3
For instance, if there are 100 questions and the method
over CoT becomes more pronounced, suggesting successfully executes 80 of them, then the execution rate is
that our advantage lies in tackling more challeng- 80%.
5.5 Impact of Using Different LLMs
+24.3
+21.1 +19.9
+23.7
Our comparison of GPT-3.5 and GPT-4 on three
FOL datasets (Fig. 8) shows the most performance
boost (24.3%) with our method upon upgrading
models. This underscores the synergy between our
Figure 8: The Improvement from GPT-3.5 to GPT-4.
approach and more advanced models. In contrast,
troduce inaccuracies, leading to flawed symbolic methods like Logic-LM, which rely on external
representations. Our analysis, shown in Fig. 6, rule-based solvers, exhibit the least improvements
examines errors in cases wrongfully categorized with stronger models due to their dependence on
as ‘unknown’ by external solvers on FOLIO. We LLMs for translation rather than reasoning. Thus,
identify that 55.6% of these errors were due to although translation quality may improve, reason-
information loss (IL, 40.7%)—where essential ing capabilities remain unchanged as it is capped
details are missed—and information error (IE, by external solvers. Our fully LLM-based method
14.8%)—where translations are incorrect. Imple- shows significant improvements compared to exter-
menting our methodology reduces these errors by nal solver-based approaches as model capabilities
73.3%, with significant declines in IL and IE by increase, indicating a promising advancement path
53.8% and 19.5%, respectively. This demonstrates in line with LLMs’ evolution.
the effectiveness of our LLM-based symbolic rea- 5.6 Error analysis
soning approach, which cross-references both sym-
bolic and natural language data to rectify transla- The error analysis reveals that for an LLM to accu-
tion errors and bolster logical reasoning. rately complete a reasoning task, it requires two key
capabilities: 1) The basic planning ability to iden-
5.4 Reasoning Faithfulness tify the correct reasoning path, and 2) Upon identi-
fying the reasoning path, ensure that each step of
Often, LLMs may deliver correct answers through
the reasoning is based on valid logic. The develop-
flawed reasoning, essentially reaching the right con-
ment of SymbCoT primarily enhances the second
clusion by luck. Thus, we further assess the faith-
capability. However, the first capability, planning,
fulness of reasoning in the CoT, our SymbCoT,
is equally crucial in problem-solving. Consider a
and SymbCoT without a Verifier on the FOLIO
scenario where the model fails to identify the cor-
dataset in Fig. 7. We define an instance as ’faith-
rect path; in such instances, there wouldn’t even be
ful’ if both the answer and the process are correct
an opportunity for logical errors to occur, which
and logical; ’unfaithful’ if the answer is correct
our method aims to address. Yet, our approach
but the process is not; and ’false’ if the answer it-
does not enhance the model’s planning capabilities,
self is incorrect. To verify the logical validity of
as these are inherently reliant on the LLM itself.
the reasoning process when the answer is correct,
Therefore, the LLM needs to meet this reasoning
we employed manual evaluation. This assessment
baseline for SymbCoT to be effective.
is carried out by five computer science graduate
students with adequate training, and the logical 5.7 Case Study
propriety of a process for a given example was
Finally, we showcase two instances to demonstrate
determined based on the majority’s opinion. We
in detail the advantages of our method.
can see that within the CoT, 6% of correct answers
resulted from flawed reasoning, achieved serendip- CoT makes logical fallacy. Our comparison with
itously rather than through correct logical steps. CoT reveals its susceptibility to logical fallacies,
In SymbCoT without a Verifier, the rate of such making invalid deductions like affirming the conse-
unfaithful reasoning dropped to 2%. Integrating quent, while our method effectively prevents such
a Verifier, we eliminated unfaithful reasoning en- errors.
tirely, showcasing that our approach ensures credi- Premise 1: There was one six-way tie on the leaderboard
ble, symbolic-based reasoning and reduces reliance and one person in the six-way tie was from Belgium.
Premise 2: Descampe is from Belgium and is on the leader-
on chance. This highlights the effectiveness of our board of the 1992 du Maurier Classic.
methodology in enhancing the faithfulness of the Question Statement: Descampe is in the six-way tie in the
reasoning process, with the Verifier serving to ver- leaderboard of the 1992 du Maurier Classic. (Ground Truth:
Unknown)
ify and refine reasoning steps.
GPT-4 Incorrect CoT: The context states that there was one Acknowledgments
six-way tie on the leaderboard and one person in the six-way
tie was from Belgium. Descampe is from Belgium and is on This work is supported by the Ministry of Edu-
the leaderboard of the 1992 du Maurier Classic. Therefore, cation, Singapore, under its MOE AcRF TIER 3
it can be inferred that Descampe is in the six-way tie in the
leaderboard of the 1992 du Maurier Classic. Grant (MOEMOET32022-0001), and also CCF-
Ours: From the step from ∃x(Tie(x, sixWay) ∧ Baidu Open Fund.
From(x, Belgium)) to Tie(Descampe, sixWay) requires a
premise directly stating Descampe’s involvement in the six- Limitations
way tie, which is absent such as Tie(Descampe, sixWay).
However, this information is not given. Thus, the truth of the Our paper has the following potential limitations.
statement "Tie(Descampe, sixWay)" remains unknown. Firstly, we only evaluate two symbolic structures
External solver fails due to information loss. in our framework. Despite substantiating the signif-
Using external solvers like Prover9 for FOL entails icant potential of First-Order Logic and Constraint
translation risks, potentially missing key predicates Optimization in augmenting the logical reasoning
and reducing accuracy. Our LLM-based approach capabilities of LLMs, it is imperative to conduct fur-
mitigates this by leveraging its natural language ther assessments on additional symbolic languages
comprehension to better capture the context’s intri- to ensure a comprehensive evaluation. Secondly,
cacies. the implementation of our framework is associated
with considerable expenses. This financial impli-
Premise 1: ∀x(Book(x) → Contains(x, knowledge)) :::
Books contain tons of knowledge. cation is attributed to the methodology of incorpo-
Premise 2: ∀x∀y(Person(x) ∧ Reads(x, y) → rating symbolic rules, which inherently involves an
Gains(x, knowledge)) ::: When a person reads a extended reasoning chain and, consequently, the
book, that person gains knowledge.
Premise 3: ∀x(Person(x) ∧ Gains(x, knowledge) → generation of an increased number of tokens by the
Smarter(x)) ::: If a person gains knowledge, they become model. This escalation in token generation necessi-
smarter.
Premise 4: Reads(harry, walden) ::: Harry read the book
tates additional expenditures related to API usage
“Walden” by Henry Thoreau. or the allocation of computational resources.
Question Statement: Smarter(harry) ::: Harry is smarter
than before. (Ground Truth: True) Ethics Statement
Prover9: Uncertain
Our work on the Symbolic Chain-of-Thought
Ours: Since Walden is a book, and according to the premise
∀x(Book(x) → Contains(x, knowledge)), it follows that (SymbCoT) framework adheres to ethical guide-
Book(walden) → Contains(walden, knowledge). And lines emphasizing transparency, fairness, and re-
Harry reads Walden, which leads to gaining knowledge, as sponsible AI development. We recognize the ethi-
described by the premise ∀x∀y(Person(x) ∧ Reads(x, y) →
Gains(x, knowledge)), thus Reads(harry, walden) → cal implications of enhancing LLMs and have con-
Gains(harry, knowledge). Furthermore, anyone who gains ducted our research with a commitment to mini-
knowledge is considered smarter, as per the premise
∀x(Person(x) ∧ Gains(x, knowledge) → Smarter(x)), im-
mizing biases, ensuring data privacy, and promot-
plying that Gains(harry, knowledge) → Smarter(harry). ing the explainability of AI decisions. Our evalua-
Thus, Smarter(Harry) is true. tions utilized publicly available or ethically sourced
datasets, and we have made efforts to address and
6 Conclusion
mitigate potential biases within these datasets to
This study presents the Symbolic Chain-of- ensure fairness and objectivity in our findings.
Thought (SymbCoT) framework, which innova- The broader impact of SymbCoT, aimed at im-
tively integrates symbolic expressions and logical proving logical reasoning in LLMs, has the poten-
rules with CoT prompting to improve the logical tial to contribute positively to various fields, in-
reasoning abilities of LLMs. Specifically, based cluding natural language processing, knowledge
on the backbone LLM, SymbCoT translates the representation, and automated reasoning. By intro-
natural language context into the symbolic format ducing a method that combines symbolic expres-
and then derives a step-by-step plan to solve the sions with CoT prompting, we foster more faithful,
logical reasoning problem with symbolic logical flexible, and explainable logical reasoning capa-
rules, followed by a verifier to check the translation bilities. We encourage the responsible use of our
and reasoning chain. Experimental results on 5 findings and technologies, and we commit to on-
standard benchmarks with both FOL and CO sym- going evaluation of our work’s societal and ethical
bolic expressions show that SymbCoT strikingly implications.
enhances the vanilla CoT on logical reasoning.
References Tao Yu, Rui Zhang, Shafiq R. Joty, Alexander R. Fab-
bri, Wojciech Kryscinski, Xi Victoria Lin, Caiming
Alfonso Amayuelas, Shuai Zhang, Susie Xi Rao, and Xiong, and Dragomir Radev. 2022. FOLIO: natu-
Ce Zhang. 2022. Neural methods for logical reason- ral language reasoning with first-order logic. CoRR,
ing over knowledge graphs. In Proceedings of the abs/2209.00840.
International Conference on Learning Representa-
tions. Yi Hu, Haotong Yang, Zhouchen Lin, and Muhan Zhang.
2023. Code prompting: a neural symbolic method for
Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger- complex reasoning in large language models. CoRR,
stenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz abs/2305.18507.
Lehmann, Michal Podstawski, Hubert Niewiadom-
ski, Piotr Nyczyk, and Torsten Hoefler. 2023. Graph Jie Huang and Kevin Chen-Chuan Chang. 2023. To-
of thoughts: Solving elaborate problems with large wards reasoning in large language models: A survey.
language models. CoRR, abs/2308.09687. In Proceedings of the Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1049–
D.D. Cummins, T. Lubart, O. Alksnis, et al. 1991. Con- 1065.
ditional reasoning and causation. Memory & Cogni-
tion, 19:274–282. Shima Imani, Liang Du, and Harsh Shrivastava. 2023.
MathPrompter: Mathematical reasoning using large
Zackary Okun Dunivin. 2024. Scalable qualitative cod-
language models. In Proceedings of the Annual Meet-
ing with llms: Chain-of-thought reasoning matches
ing of the Association for Computational Linguistics,
human performance in some hermeneutic tasks.
pages 37–42.
CoRR, abs/2401.15170.
Tatsuro Inaba, Hirokazu Kiyomaru, Fei Cheng, and
Herbert B. Enderton. 1972. A mathematical introduc-
Sadao Kurohashi. 2023. MultiTool-CoT: GPT-3
tion to logic. Academic Press.
can use multiple external tools with chain of thought
Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and prompting. In Proceedings of the Annual Meeting of
Tat-Seng Chua. 2023. Reasoning implicit sentiment the Association for Computational Linguistics, pages
with chain-of-thought prompting. In Proceedings of 1522–1532.
the Annual Meeting of the Association for Computa-
tional Linguistics, pages 1171–1182. Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2023. Structured
chain-of-thought prompting for code generation.
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang,
Meishan Zhang, Mong Li Lee, and Wynne Hsu. 2024. Henry Markovits and Robert Vachon. 1989. Reason-
Video-of-thought: Step-by-step video reasoning from ing with contrary-to-fact propositions. Journal of
perception to cognition. In Proceedings of the Inter- Experimental Child Psychology, 47:398–412.
national Conference on Machine Learning, ICML.
Theo Olausson, Alex Gu, Ben Lipkin, Cedegao Zhang,
Hao Fei, Shengqiong Wu, Yafeng Ren, and Meishan Armando Solar-Lezama, Joshua Tenenbaum, and
Zhang. 2022. Matching structure for dual learning. Roger Levy. 2023. LINC: A neurosymbolic approach
In Proceedings of the International Conference on for logical reasoning by combining language mod-
Machine Learning, ICML, pages 6373–6391. els with first-order logic provers. In Proceedings
of the Conference on Empirical Methods in Natural
Vedant Gaur and Nikunj Saunshi. 2023. Reasoning in Language Processing, pages 5153–5176.
large language models through symbolic math word
problems. In Proceedings of the Annual Meeting of OpenAI. 2023. GPT-4 technical report. CoRR,
the Association for Computational Linguistics, pages abs/2303.08774.
5889–5903.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Olga Gerasimova, Nikita Severin, and Ilya Makarov. Carroll L. Wainwright, Pamela Mishkin, Chong
2023. Comparative analysis of logic reasoning and Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray,
graph neural networks for ontology-mediated query John Schulman, Jacob Hilton, Fraser Kelton, Luke
answering with a covering axiom. IEEE Access, Miller, Maddie Simens, Amanda Askell, Peter Welin-
11:88074–88086. der, Paul F. Christiano, Jan Leike, and Ryan Lowe.
2022. Training language models to follow instruc-
Christopher Hahn, Frederik Schmitt, Julia J. Tillman, tions with human feedback. In Proceedings of the
Niklas Metzger, Julian Siber, and Bernd Finkbeiner. Annual Conference on Neural Information Process-
2022. Formal specifications from natural language. ing Systems, pages 27730–27744.
CoRR, abs/2206.01962.
Liangming Pan, Alon Albalak, Xinyi Wang, and
Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting William Wang. 2023. Logic-LM: Empowering large
Qi, Martin Riddell, Luke Benson, Lucy Sun, Eka- language models with symbolic solvers for faithful
terina Zubova, Yujie Qiao, Matthew Burtell, David logical reasoning. In Proceedings of the Conference
Peng, Jonathan Fan, Yixin Liu, Brian Wong, Mal- on Empirical Methods in Natural Language Process-
colm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, ing, pages 3806–3824.
Ajay Patel, Bryan Li, Mohammad Sadegh Rasooli, and Denny Zhou. 2022. Chain-of-thought prompt-
Noah Constant, Colin Raffel, and Chris Callison- ing elicits reasoning in large language models. In
Burch. 2023. Bidirectional language models are also Proceedings of the Annual Conference on Neural In-
few-shot learners. In Proceedings of the Interna- formation Processing Systems, pages 24824–24837.
tional Conference on Learning Representations.
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-
John Alan Robinson. 1965. A machine-oriented logic Seng Chua. 2023. Next-gpt: Any-to-any multimodal
based on the resolution principle. J. ACM, 12(1):23– llm. arXiv preprint arXiv:2309.05519.
41.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Abulhair Saparov and He He. 2023. Language models Thomas L. Griffiths, Yuan Cao, and Karthik
are greedy reasoners: A systematic formal analysis of Narasimhan. 2023. Tree of thoughts: Deliberate
chain-of-thought. In Proceedings of the International problem solving with large language models. CoRR,
Conference on Learning Representations. abs/2305.10601.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Jiacheng Ye, Chengzu Li, Lingpeng Kong, and Tao Yu.
Abu Awal Md Shoeb, Abubakar Abid, Adam 2023. Generating data for symbolic language with
Fisch, Adam R. Brown, Adam Santoro, Aditya large language models. In Proceedings of the Con-
Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, ference on Empirical Methods in Natural Language
Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Processing, pages 8418–8443.
Alex Ray, Alex Warstadt, Alexander W. Kocurek,
Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par- Yifan Zhang, Jingqin Yang, Yang Yuan, and An-
rish, Allen Nie, Aman Hussain, Amanda Askell, drew Chi-Chih Yao. 2023a. Cumulative reasoning
Amanda Dsouza, Ameet Rahane, Anantharaman S. with large language models. CoRR, abs/2308.04371.
Iyer, Anders Andreassen, Andrea Santilli, Andreas
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex
Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K.
Smola. 2023b. Automatic chain of thought prompt-
Lampinen, Andy Zou, Angela Jiang, Angelica Chen,
ing in large language models. In Proceedings of the
Anh Vuong, Animesh Gupta, Anna Gottardi, Anto-
International Conference on Learning Representa-
nio Norelli, Anu Venkatesh, Arash Gholamidavoodi,
tions.
Arfa Tabassum, Arul Menezes, Arun Kirubarajan,
Asher Mullokandov, Ashish Sabharwal, Austin Her- Li Zheng, Hao Fei, Fei Li, Bobo Li, Lizi Liao,
rick, Avia Efrat, Aykut Erdem, Ayla Karakas, and Donghong Ji, and Chong Teng. 2024. Reverse multi-
et al. 2022. Beyond the imitation game: Quantifying choice dialogue commonsense inference with graph-
and extrapolating the capabilities of language models. of-thought. In Proceedings of the AAAI Conference
CoRR, abs/2206.04615. on Artificial Intelligence, volume 38, pages 19688–
19696.
Hongda Sun, Weikai Xu, Wei Liu, Jian Luan, Bin Wang,
Shuo Shang, Ji-Rong Wen, and Rui Yan. 2023. From Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu,
indeterminacy to determinacy: Augmenting logical Daya Guo, Yining Chen, Jiahai Wang, Jian Yin, Ming
reasoning capabilities with large language models. Zhou, and Nan Duan. 2022. Analytical reasoning of
CoRR, abs/2310.18659. text. In Proceedings of the North American Chap-
ter of the Association for Computational Linguistics,
Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. 2021. pages 2306–2319.
ProofWriter: Generating implications, proofs, and
abductive statements over natural language. In Pro-
ceedings of the Annual Meeting of the Association
for Computational Linguistics, pages 3621–3634.

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu,


Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim.
2023a. Plan-and-solve prompting: Improving zero-
shot chain-of-thought reasoning by large language
models. In Proceedings of the Annual Meeting of
the Association for Computational Linguistics, pages
2609–2634.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V.


Le, Ed H. Chi, Sharan Narang, Aakanksha Chowd-
hery, and Denny Zhou. 2023b. Self-consistency im-
proves chain of thought reasoning in language mod-
els. In Proceedings of the International Conference
on Learning Representations.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten


Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le,
A Future Direction
Combining SymbCoT with an external solver presents a promising avenue for enhancing our reasoning
system. SymbCoT’s main limitation is sometimes failing to identify the correct reasoning path, while
external solvers often struggle with information loss or translation errors. As shown in Section 5.3,
SymbCoT excels at addressing information loss and correcting errors, while the exhaustive, broad-
spectrum search nature of the external solver is more effective in identifying reasoning paths. Therefore,
these methods possess the potential for synergy.
Future work will focus on developing a framework that integrates SymbCoT with an external solver,
leveraging their complementary strengths. We aim to optimize their interaction to improve overall
performance, conducting experiments to validate and refine this hybrid approach.

B Symbolic Representation
B.1 First-Order Logic
First-Order Logic, also known as predicate logic or first-order predicate calculus, is a formal system used
in mathematics, philosophy, linguistics, and computer science to express statements about objects and their
relationships. It extends propositional logic by introducing quantified variables that can represent objects
in a domain of discourse and predicates that can express properties of objects and relations between them.
The syntax of First-Order Logic (FOL) provides a formal structure for constructing expressions that
can represent statements about objects, their properties, and the relationships between them. It defines the
rules for forming well-formed formulas (WFFs), which are expressions in FOL that can be evaluated as
true or false. Below, we introduce the key components of FOL syntax:
Syntax. The syntax of FOL consists of a set of symbols divided into several categories:

• Variables: Symbols used to stand for objects in the domain of discourse (e.g., x, y, z).

• Constant symbols: Names that refer to specific objects in the domain (e.g., a, b, c).

• Predicate symbols: Symbols that represent properties of objects or relations between objects, with an
associated arity (e.g., P, Q, R).

• Function symbols: Symbols representing functions from objects to objects, also with arity (e.g.,
f, g, h).

• Logical connectives: Symbols used to combine or modify statements, including ∧, ∨, ¬, →, ↔.

• Quantifiers: Symbols expressing quantification over objects, including the universal quantifier (∀)
and the existential quantifier (∃).

• Parentheses and commas: Used for grouping and separating arguments.

Terms Terms are the simplest elements in FOL that refer to objects in the domain. Examples include:

• A variable: x

• A constant: a

• A function applied to terms: f (a, x)

Atomic Formulas (or Atoms) Atomic formulas are formed by applying a predicate symbol to the
appropriate number of terms. An example of an atomic formula is:

P (x, a)

where P is a predicate symbol, x is a variable, and a is a constant.


Complex Formulas Complex formulas are constructed from atomic formulas using logical connectives
and quantifiers. For example:
∀x(P (x) → Q(f (x)))
This complex formula states that for all x, if P (x) is true, then Q(f (x)) is also true, where Q is another
predicate symbol and f is a function symbol.
Quantification Quantifiers are used to express statements about all objects or the existence of objects in
the domain. Examples include:

• Universal quantification: ∀xP (x), stating that P holds for all x.

• Existential quantification: ∃xP (x), stating that there exists an x for which P holds.

Well-formed Formulas (WFFs) Expressions that adhere to the syntactic rules of FOL are considered
well-formed. These expressions include terms, atomic formulas, and complex formulas constructed
according to the rules of formation. The syntax ensures that the expressions are meaningful within the
logic.

B.2 Constraint Optimization


Constraint optimization is a branch of optimization theory that focuses on finding the most favorable
solution from a set of possible options, subject to a series of constraints. It is widely used in various fields
such as operations research, artificial intelligence, engineering, and economics to solve problems that
involve limitations or restrictions on the possible solutions.
Definition. Constraint optimization involves determining the optimal solution(s) to a problem defined
by:

• A set of variables, each of which can take on values from a given domain.

• A set of constraints that specify allowable combinations of values for subsets of variables.

• An objective function that needs to be maximized or minimized.

Components The key components of a constraint optimization problem include:

• Variables: The elements that represent the aspects of the problem being optimized (e.g., x, y, z).

• Domains: The possible values that each variable can take (e.g., D(x), D(y), D(z)).

• Constraints: Conditions or relations that must be satisfied by the variables (e.g., x + y ≤ z).

• Objective Function: A mathematical expression that defines the goal of the optimization, which can
be either maximization or minimization (e.g., f (x, y, z) = x2 + y 2 + z 2 ).

C Dataset Statistics
In alignment with the established norms of prior research, notably, the practices adopted by Logic-LM
(Pan et al., 2023), our study employs identical records for the sake of ensuring consistency and facilitating
comparative analysis. Table 3 shows the statistics of each dataset.

D Baseline Specification
Here are the details of each baseline:
Naive Prompting. This approach involves instructing the LLMs to directly answer questions with the
aid of a few in-context demonstrations, prioritizing straightforward response generation without explicit
reasoning processes;
Symbolic Test Opts
Dataset
Structure Size Num
PrOntoQA FOL 500 2
ProofWriter FOL 600 3
FOLIO FOL 204 3
LogicalDeduction CSP 300 3, 5, 7
AR-LSAT CSP 230 5

Table 3: Dataset Statistics

Figure 9: The proportion of predicting True, False, and Unknown on FOLIO.

Chain-of-Thought (CoT). CoT encourages the LLM to articulate a step-by-step reasoning process
leading to the final answer, supported by in-context demonstrations (Wei et al., 2022);
Logic-LM Distinguished by its utilization of the LLM to translate natural language contexts and
questions into a symbolic representation, this baseline leverages external solvers to deduce answers,
offering a hybrid approach to problem-solving (Pan et al., 2023).
CoT-SC An advanced version of CoT, which employs self-consistency by an ensemble strategy through
majority-voting among multiple CoTs (Wang et al., 2023b);
Tree-of-thought (ToT) This approach enables LLMs to explore multiple reasoning pathways through a
decision-making tree structure, incorporating backtracking via a search algorithm to identify the globally
optimal reasoning path (Yao et al., 2023);
Cummulative Reasoning (CR) CR utilizes the LLMs in a cumulative and iterative manner, mimicking
the human thought process by building upon previous reasoning steps to gradually approach the solution
(Zhang et al., 2023a);
DetermLR This method constructs a logical reasoning process, transitioning from indeterminate
premises to a determinate conclusion iteratively, and integrates external modules such as memory storage
to support this process (Sun et al., 2023).
To ensure efficient resource management, we adopt the experimental data from Logic-LM (Pan et al.,
2023) and DetermLR (Sun et al., 2023) for comparative analysis as the datasets and records we used
on our experiments are exactly the same. The primary metric for evaluation is accuracy, reflecting the
multiple-choice format of the questions across the datasets.

E Prediction Behavior
We present the predicting behavior of Chain-of-Thought (CoT), Logic-LM, and our method by showing
the percentage of predicting each option in Fig. 9. Please note that for Logic-LM, to see the behavior of
the external solver (Prover9), we turn off their backup strategy which uses the result of CoT when the
records containing syntax error cannot be execute by the external solver. We found that Logic-LM, using
an external solver, tends to predict more "Unknown". This could be attributed to the loss of information
during the translation process, which results in insufficient data for accurately deducing the correct
100 Precision Recall F1
95%
90 89%
85%
80
81% 81% 79% 82%

Score
75% 72%
70
60
50 CoT Logic-LM SymbCoT
Figure 10: The Precision, Recall, and F1 score for CoT, Logic-LM and Ours on FOLIO.

answers. In such instances, these tools opt to predict an outcome as "unknown", where answers that are
actually "True" or "False" as a consequence of this information loss.

F Evaluation Merics
In Fig. 10, the data encapsulates a comparative analysis of performance metrics—precision, recall, and
F1 score—across three models: CoT (Chain of Thought), Logic-LM, and our proposed method. Ours
method surpass others on the F1 score, by 6% on CoT and 3% on Logic-LM.

F.1 F1 Score Analysis


Starting with the F1 scores, which balance precision and recall, providing a measure of a model’s accuracy
while considering both the false positives and false negatives. Our model achieves the highest F1 score of
85%, indicating a strong balance between precision and recall. This suggests that our approach effectively
mitigates the trade-offs between these two metrics, potentially offering a more balanced and reliable
model for various applications.
In comparison, Logic-LM posts an F1 score of 82. While slightly lower than ours, this score still repre-
sents a commendable balance between precision and recall, hinting at the effectiveness of incorporating
logical reasoning within the model’s framework. However, it slightly lags behind our model, possibly due
to differences in handling information or reasoning processes.
CoT model scores the lowest in terms of F1, at 79. This indicates a more considerable discrepancy
between precision and recall compared to the other models. The CoT approach, while innovative, may
face challenges in optimally balancing false positives and false negatives, which is reflected in its overall
performance metric.

F.2 Recall Analysis


When looking at recall, our model scores the highest of 81, highlighting its capability to identify a high
percentage of all relevant instances. This performance is crucial in scenarios where missing a positive
case can have significant implications, such as medical diagnoses or fraud detection.
Logic-LM has a recall rate of 72, which is the lowest among the three models. This suggests that while
Logic-LM might be precise, it potentially misses a larger portion of relevant instances, a limitation that
could impact its utility in applications where high recall is critical.
CoT exhibits a recall of 75, positioning it between our methodology and Logic-LM. This indicates a
moderate ability to capture relevant instances but also room for improvement, especially when compared
to our method.

F.3 Precision Analysis


Considering precision, our model demonstrates a score of 89. This high precision indicates that the
majority of the instances it identifies as positive are indeed positive, which is essential for reducing the
cost of false positives in many applications.
Logic-LM achieves the highest precision at 95, underscoring its strength in accurately predicting
positive instances when it does make a prediction. This high precision is particularly valuable in contexts
where the cost of false positives is high, even though it may sacrifice some recall to achieve this.
CoT has a precision of 81, the lowest among the three. While still respectable, this suggests that CoT
may generate more false positives than Logic-LM and ours, affecting its efficiency and reliability in
certain contexts.
F.4 Conclusion on Evaluation Metrics
In summary, the analysis reveals that our methodology offers a balanced performance with the highest
F1 score, suggesting it effectively balances precision and recall. Logic-LM stands out for its precision,
making it suitable for applications where accuracy is paramount, at the expense of recall. CoT, while
showing promising results, trails in both precision and recall, indicating potential areas for improvement.
Each model has its strengths and potential applications, with our approach showing a well-rounded
performance that could be particularly beneficial in scenarios requiring a balance between identifying
relevant instances and maintaining high accuracy.

G Output Average Length


Table 4 is the average number of output tokens for each module and dataset.

Model ProntoQA ProofWriter FOLIO LogicalDeduction AR-LSAT


Translator 324 290 145 176 323
Planner 198 202 313 251 627
Solver 214 216 330 282 654
Verifier 384 353 334 472 726

Table 4: Output length across different datasets and modules

H Output examples
To have a better understanding of how SymnbCoT solves logical questions, we select a typical solving
example below that showcases the step-by-step logical deduction.
Input:
Predicates:
• Rabbit(x) ::: x is a rabbit.
• Cute(x) ::: x is cute.
• Turtle(x) ::: x is a turtle.
• Animal(x) ::: x is an animal.
• Squirrel(x) ::: x is a squirrel.
• Skittish(x) ::: x is skittish.
• Still(x) ::: x is still.
Premises:
• ∀x(Rabbit(x) → Cute(x)) ::: All rabbits are cute.

• ∃x Turtle(x) ::: Some turtles exist.

• ∀x(Animal(x) → (Rabbit(x) ⊕ Squirrel(x))) ::: An animal is either a rabbit or a squirrel.

• ∀x(Skittish(x) → ¬Still(x)) ::: If something is skittish, then it is not still.

• ∀x(Squirrel(x) → Skittish(x)) ::: All squirrels are skittish.

• Still(rock) ::: Rock is still.

Conclusion:
• (Cute(rock) ∧ Still(rock)) → (Turtle(rock) ∧ Skittish(rock)) ::: If Rock is cute and still, then Rock is a turtle and
skittish.
Output:

• Step 1: Identify the Goal


Our goal is to determine the truth of the statement "(Cute(rock) ∧ Still(rock)) → (Turtle(rock) ∧
Skittish(rock))". This requires verifying if the conditions of Rock being cute and still necessarily lead to Rock being
a turtle and skittish.
• Step 2: Utilize Premise 6 (Still(rock))
Premise: Still(rock)
Inference Rule: Existential Instantiation
Logical Process: We instantiate Rock and confirm it is still.

• Step 3: Apply Premise 4 (Skittish(x) → ¬Still(x))


Premise: ∀x(Skittish(x) → ¬Still(x))
Inference Rule: Universal Instantiation and Modus Tollens
Logical Process: If an entity x is skittish, then it is not still. Since Rock is still, it cannot be skittish.

• Step 4: Utilize Premise 5 (Squirrel(x) → Skittish(x))


Premise: ∀x(Squirrel(x) → Skittish(x))
Inference Rule: Universal Instantiation and Modus Tollens
Logical Process: If an entity x is a squirrel, then it is skittish. Since Rock is not skittish, it cannot be a squirrel.

• Step 5: Apply Premise 3 (Animal(x) → (Rabbit(x) ⊕ Squirrel(x)))


Premise: ∀x(Animal(x) → (Rabbit(x) ⊕ Squirrel(x)))
Inference Rule: Universal Instantiation and Modus Tollens
Logical Process: If an entity x is an animal, then it is either a rabbit or a squirrel. Since Rock is not a squirrel, if it is an
animal, it must be a rabbit.
• Step 6: Combine the Steps
Logical Process: Combining the above steps, we have established that Rock is still, not skittish, not a squirrel, and if it is
an animal, it must be a rabbit. Therefore, Rock cannot be a turtle and skittish at the same time.
• Step 7: Conclude
Conclusion: Since we have found that Rock cannot be a turtle and skittish at the same time, the statement "(Cute(rock)
∧ Still(rock)) → (Turtle(rock) ∧ Skittish(rock))" is False. Throughout the process, we used logical
operators such as ∧ (conjunction), → (implication), and ∃ (existential), and inference rules like Modus Ponens, Universal
Instantiation, and Modus Tollens. This logical deduction confirms the falsity of the question statement.
Final answer: {false}

I Full Prompting
Below is the detailed prompt of our method across all 5 datasets:

I.1 Ours
I.1.1 ProntoQA
Step-1: Translating natural language context into symbolic
Task Description:
You are given a problem description and a question. The task is to:

1. Define all the predicates in the problem.

2. Parse the problem into logic rules based on the defined predicates.

3. Write all the facts mentioned in the problem.

4. Parse the question into the logic form.

Problem:
• Each jompus is fruity. Every jompus is a wumpus. Every wumpus is not transparent. Wumpuses are tumpuses. Tumpuses
are mean. Tumpuses are vumpuses. Every vumpus is cold. Each vumpus is a yumpus. Yumpuses are orange. Yumpuses
are numpuses. Numpuses are dull. Each numpus is a dumpus. Every dumpus is not shy. Impuses are shy. Dumpuses are
rompuses. Each rompus is liquid. Rompuses are zumpuses. Alex is a tumpus.

Question:
• True or false: Alex is not shy.

Predicates:
• Jompus(x): Is x a jompus?

• Fruity(x): Is x fruity?
• (... More predicates ...)

Facts:
• Tumpuses(Alex): Alex is a tumpus.

• (... More facts ...)


Rules:
• Jompus(x) ⇒ Fruity(x): Each jompus is fruity.

• (... More rules ...)


Query:
• Shy(Alex, False) ::: Alex is not shy

Step-2: Deriving plan based on the natural and symbolic context


Task:
Can you derive a step-by-step plan that can use the premises and First-Order Logic inference rules to infer new knowledge if
needed, in order to prove the statement? Start with identifying the goal and break down the necessary logical inference step
by step.
First-Order Logic Inference Rules:
The First-Order Logic inference rules include but are not limited to: Modus Ponens, Modus Tollens, (... More FOL rules ...).
Predicates:
• Jompus(x): Is x a jompus?

• Fruity(x): Is x fruity?

• (... More predicates ...)

Facts:
• Tumpuses(Alex): Alex is a tumpus.

• (... More facts ...)


Rules:
• Jompus(x) ⇒ Fruity(x): Each jompus is fruity.

• (... More rules ...)


Query:
• Shy(Alex, False) ::: Alex is not shy

Plan:
To determine whether the statement "Shy(Alex, False)" is true, false, or unknown based on the given premises, we will use a
systematic approach applying First-Order Logic inference rules. Here’s the step-by-step plan:

• 1. Identify the Goal: Our goal is to establish whether Alex is not shy (Shy(Alex, False))

• 2. Utilize Facts about Alex: We have Tumpuses(Alex, True) which states Alex belongs to Tumpuses.

• (... More plan ...)

• By following the plan, you can determine whether "Shy(Alex, False)" is true or false.

Step-3: Solving the Problem given the context and plan


Task:
Based on the premises, question and plan, execute each step by selecting relevant premises and making inference based
on First-order Logic inference rules to solve the question. Ensure to follow the plan and indicate what First-Order Logic
inference rules you used.
First-Order Logic Inference Rules:
The First-Order Logic inference rules include but are not limited to: Modus Ponens, Modus Tollens, (... More FOL rules ...)
Predicates:
• Jompus(x): Is x a jompus?

• Fruity(x): Is x fruity?

• (... More predicates ...)


Facts:
• Tumpuses(Alex): Alex is a tumpus.

• (... More facts ...)


Rules:
• Jompus(x) ⇒ Fruity(x): Each jompus is fruity.

• (... More rules ...)


Query:
• Shy(Alex, False) ::: Alex is not shy

Plan:
To determine whether the statement "Shy(Alex, False)" is true, false, or unknown based on the given premises, we will use a
systematic approach applying First-Order Logic inference rules. Here’s the step-by-step plan:

• 1. Identify the Goal: Our goal is to establish whether Alex is not shy (Shy(Alex, False))

• 2. Utilize Facts about Alex: We have Tumpuses(Alex, True) which states Alex belongs to Tumpuses.

• (... More plan ...)

• By following the plan, you can determine whether "Shy(Alex, False)" is true or false.

Execution:
Let’s execute the plan step by step, applying First-Order Logic inference rules.

• Identify the Goal: Goal: Determine if Shy(Alex, False) is true.

• Utilize Facts about Alex: - Fact: Tumpuses(Alex, True). - Inference Rule: Given Fact. - Logical Process:
Tumpuses(Alex, True) is given.

• (... More solving steps ...)

• Thus, "Shy(Alex, False)" is true based on the logical deductions.

Step-4: Verifying the translation and solving process


Task:
The task is to verify whether the original execution correctly determines the value of the given conclusion based on the
context. You should abide by the First-Order Logic rules when checking the original execution and the conclusion.
When verifying the logical process and the final answer, make sure you fully consider the following aspects before
verification:
1. Whether the logical process abides by the First-Order Logic inference rules.
2. Assumption and Consistency Verification. Please check whether the relevant facts or rules used in a logical step are indeed
from the context or inferred from the previous steps. If you think it contains assumption in the logical steps, you need to
clearly what assumption it is and verify whether the assumptions are from the context and whether it is already inferred from
the previous steps. Please note that an assumption that does not exist from the context or was not inferred from the previous
step will be considered an invalid assumption. Please note that if an assumption is semantically equivalent to context or
information inferred from the context, it should be seen as a valid assumption. For example, "X sees X" can be interpreted as
"X sees itself" since they are semantically equivalent.
3. Whether the translation of the First-Order Logic for context and question is semantically consistent with the original
natural language.
4. Refinement of Logical Process: If the original execution violates any of the above steps, refine the logical process. Use the
relevant premises and information correctly derived from previous steps to create a more accurate logical deduction.
5. If applicable, please provide a detailed analysis of each step in the logical process and a refined answer at the end, using
the format true/false. The answer should only be true or false but NOT unknown. 6. Make sure the conclusion aligns with the
information inferred from the logical steps.
Predicates:
• Jompus(x): Is x a jompus?
• Fruity(x): Is x fruity?
• (... More predicates ...)
Facts:
• Tumpuses(Alex): Alex is a tumpus.
• (... More facts ...)
Rules:
• Jompus(x) ⇒ Fruity(x): Each jompus is fruity.

• (... More rules ...)


Query:
• Shy(Alex, False) ::: Alex is not shy

Original Execution:
Let’s execute the plan step by step, applying First-Order Logic inference rules.

• Identify the Goal: Goal: Determine if Shy(Alex, False) is true.

• Utilize Facts about Alex: - Fact: Tumpuses(Alex, True). - Inference Rule: Given Fact. - Logical Process:
Tumpuses(Alex, True) is given.

• (... More solving steps ...)

• Thus, "Shy(Alex, False)" is true based on the logical deductions.

Verification:
To verify the logical validity of the original execution concerning the statement "Shy(Alex, False)" based on the given
premises, let’s assess each step against the requirements of First-Order Logic, assumptions, consistency, and semantic
equivalence.
Context verification:
• The context and query correctly correspond to the information in the original context and are semantically consistent.

Logical Process Verification:


• 1. Fact: Tumpuses(Alex, True) is directly from the context with no invalid assumption.

• (... More logical verification ...)

• Therefore, after verifying the translation between the original context and symbolic format, and the logical process, the
original conclusion "Shy(Alex, Flase) is true" is valid and remains unchanged.

I.1.2 ProofWriter
Step-1: Translating natural language context into symbolic
Task Description:
You are given a problem description and a question. The task is to:
1. Define all the predicates in the problem.

2. Parse the problem into logic rules based on the defined predicates.

3. Write all the facts mentioned in the problem.

4. Parse the question into the logic form.

Problem:
• Anne is quiet. Erin is furry. Erin is green. Fiona is furry. Fiona is quiet. Fiona is red. Fiona is rough. Fiona is white. Harry
is furry. Harry is quiet. Harry is white. Young people are furry. If Anne is quiet then Anne is red. Young, green people are
rough. If someone is green then they are white. If someone is furry and quiet then they are white. If someone is young and
white then they are rough. All red people are young.

Question:
• Is the following statement true, false, or unknown? Anne is white.

Translation:
Predicates:
• Quiet(x): Is x quiet?

• Furry(x): Is x furry?

• (... More predicates ...)

Facts:
• Quite(Anne, True): Anne is quiet.

• (... More facts ...)


Rules:
• Young($x, True) ⇒ Furry($x, True): Young people are furry.

• (... More rules ...)


Conditional Rules:
• Quite(Anne, True) ⇒ Red($x, True): If Anne is quiet then Anne is red.

• (... More conditional rules ...)


Rules with Compound Predicates
• Young($x, True) ∨ Green($x, True): Young, green people are rough.

• (... More conditional rules ...)


Query:
• White(Anne, True) ::: Anne is white.

Step-2: Deriving plan based on the natural and symbolic context


Task:
Can you derive a step-by-step plan that can use the premises and First-Order Logic inference rules to infer new knowledge if
needed, in order to prove the statement? Start with identifying the goal and break down the necessary logical inference step
by step.
First-Order Logic Inference Rules:
The First-Order Logic inference rules include but are not limited to: Modus Ponens, Modus Tollens, (... More FOL rules ...).
Predicates:
• Quiet(x): Is x quiet?

• Furry(x): Is x furry?

• (... More predicates ...)

Facts:
• Quite(Anne, True): Anne is quiet.

• (... More facts ...)


Rules:
• Young($x, True) ⇒ Furry($x, True): Young people are furry.

• (... More rules ...)


Conditional Rules:
• Quite(Anne, True) ⇒ Red($x, True): If Anne is quiet then Anne is red.

• (... More conditional rules ...)


Rules with Compound Predicates
• Young($x, True) ∨ Green($x, True): Young, green people are rough.

• (... More conditional rules ...)


Query:
• White(Anne, True) ::: Anne is white.
Plan:
Plan to determine the truth value of "White(Anne, True)":
• 1. Identify the Goal: Our goal is to establish whether White(Anne, True)

• 2. Define relevant predicates and facts: We need to consider predicates related to being quiet, red, green, white and
condition related to youth and roughness since these are directly or indirectly related to the query about Anne being whte.

• (... More plan ...)

• This plan systematically breaks down the steps necessary to assess the truth value of the statement concerning Anne’s color
based on the predicates, facts, and rules provided, applying First-Order Logic inference rules throughout the process.
Step-3: Solving the Problem given the context and plan
Task:
Based on the premises, question and plan, execute each step by selecting relevant premises and making inferences based
on First-Order Logic inference rules to solve the question. Ensure to follow the plan and indicate what First-Order Logic
inference rules you used.
First-Order Logic Inference Rules:
The First-Order Logic inference rules include but are not limited to: Modus Ponens, Modus Tollens, (... More FOL rules ...)
Predicates:
• Quiet(x): Is x quiet?

• Furry(x): Is x furry?

• (... More predicates ...)

Facts:
• Quite(Anne, True): Anne is quiet.

• (... More facts ...)


Rules:
• Young($x, True) ⇒ Furry($x, True): Young people are furry.

• (... More rules ...)


Conditional Rules:
• Quite(Anne, True) ⇒ Red($x, True): If Anne is quiet then Anne is red.

• (... More conditional rules ...)


Rules with Compound Predicates
• Young($x, True) ∨ Green($x, True): Young, green people are rough.

• (... More conditional rules ...)


Query:
• White(Anne, True) ::: Anne is white.
Plan:
Plan to determine the truth value of "White(Anne, True)":
• 1. Identify the Goal: Our goal is to establish whether White(Anne, True)

• 2. Define relevant predicates and facts: We need to consider predicates related to being quiet, red, green, white and
condition related to youth and roughness since these are directly or indirectly related to the query about Anne being whte.

• (... More plan ...)

• This plan systematically breaks down the steps necessary to assess the truth value of the statement concerning Anne’s color
based on the predicates, facts, and rules provided, applying First-Order Logic inference rules throughout the process.

Execution:
Let’s execute the plan step by step, applying First-Order Logic inference rules.

• Identify the Goal: Goal: Determine if White(Anne, True) is true.

• Review given facts about Anne:


- Fact: Quite(Anne, True).
- Inference Rule: Given Fact.
• Analyze conditional rules involving Anne: Quite(Anne, True) ⇒ Red($x, True)
Inference Rule: Modus Ponens (Given Anne is quiet, we can infer Anne is red.)

• (... More solving steps ...)

• Given the information and the rules, we cannot directly infer "White(Anne, True)" based on the First-Order Logic
inference rules provided. Thus, White(Anne, True) is unknown.
Step-4: Verifying the translation and solving process
Task:
The task is to verify whether the original execution correctly determines the value of the given conclusion based on the
context. You should abide by the First-Order Logic rules when checking the original execution and the conclusion.
When verifying the logical process and the final answer, make sure you fully consider the following aspects before verification:
1. Whether the logical process abides by the First-Order Logic inference rules.
2. Assumption and Consistency Verification. Please check whether the relevant facts or rules used in a logical step are indeed
from the context or inferred from the previous steps. If you think it contains assumption in the logical steps, you need to
clearly what assumption it is and verify whether the assumptions are from the context and whether it is already inferred from
the previous steps. Please note that an assumption that does not exist from the context or was not inferred from the previous
step will be considered an invalid assumption. Please note that if an assumption is semantically equivalent to context or
information inferred from the context, it should be seen as a valid assumption. For example, "X sees X" can be interpreted as
"X sees itself" since they are semantically equivalent.
3. Whether the translation of First-Order Logic is semantically consistent with the natural language. Make sure that for
compound predicates that are separated by a comma, you should treat the comma as "or". For instance, "All red, rough things
are big." should be interpreted as Both red things and rough things are big. Thus, it should be translated to translated to
"Red($x, True) ∨ Rough($x, True) → Big($x, True)" because being red or rough will make the conditional statement being
true.
4. Refinement of Logical Process: If the original execution violates any of the above steps, refine the logical process. Use the
relevant premises and information correctly derived from previous steps to create a more accurate logical deduction.
5. Handling of Conditional Statements in Conclusions: For conclusions containing conditional statements (e.g., "If..."),
remember that in First-Order Logic, a conditional statement is considered false only when the antecedent is true and the
consequent is false. In all other cases, the conditional statement is considered true.
6. If applicable, please provide a detailed analysis of each step in the logical process and a refined answer at the end, using
the format true/false/unknown. The answer should be one of these three options. 7. Make sure the conclusion aligns with the
information inferred from the logical steps.
Predicates:
• Quiet(x): Is x quiet?
• Furry(x): Is x furry?
• (... More predicates ...)
Facts:
• Quite(Anne, True): Anne is quiet.
• (... More facts ...)
Rules:
• Young($x, True) ⇒ Furry($x, True): Young people are furry.
• (... More rules ...)
Conditional Rules:
• Quite(Anne, True) ⇒ Red($x, True): If Anne is quiet then Anne is red.
• (... More conditional rules ...)
Rules with Compound Predicates
• Young($x, True) ∨ Green($x, True): Young, green people are rough.
• (... More conditional rules ...)
Query:
• White(Anne, True) ::: Anne is white.
Original Execution:
Let’s execute the plan step by step, applying First-Order Logic inference rules.

• Identify the Goal: Goal: Determine if White(Anne, True) is true.


• Review given facts about Anne:
- Fact: Quite(Anne, True).
- Inference Rule: Given Fact.
• Analyze conditional rules involving Anne: Quite(Anne, True) ⇒ Red($x, True)
Inference Rule: Modus Ponens (Given Anne is quiet, we can infer Anne is red.)
• (... More solving steps ...)
• Given the information and the rules, we cannot directly infer "White(Anne, True)" based on the First-Order Logic
inference rules provided. Thus, White(Anne, True) is unknown.
Verification:
To verify the logical validity of the original execution concerning the statement "White(Anne, True)" based on the given
premises, let’s assess each step against the requirements of First-Order Logic, assumptions, consistency, and semantic
equivalence.
Context verification:
• The context and query correctly correspond to the information in the original context and are semantically consistent.

Logical Process Verification:


• 1. Fact: Quite(Anne, True) is directly from the context with no invalid assumption.

• 2. The rule "Quite(Anne, True) → Red($x, True)" also contains a typo and should be "Quiet(Anne, True) → Red($x,
True)." This rule incorrectly uses $x instead of specifically stating "Anne," which might lead to confusion. The correct
interpretation for the context should directly indicate that if Anne is quiet, then Anne is red, implying "Quiet(Anne, True)
→ Red(Anne, True)."
• (... More logical verification ...)

• The conclusion that "White(Anne, True)" remains unknown is consistent with the logical assessment of the provided
context, considering the application of First-Order Logic inference rules and the available premises.

I.1.3 FOLIO
Step-1: Translating natural language context into symbolic
Task Description:
Given a problem description and a question, the task is to parse the problem and the question into First-Order Logic formulas.
The grammar of the First-Order Logic formula is defined as follows:

1. Logical conjunction of expr1 and expr2: expr1 ∧ expr2

2. Logical disjunction of expr1 and expr2: expr1 ∨ expr2

3. Logical exclusive disjunction of expr1 and expr2: expr1 ⊕ expr2

4. Logical negation of expr1: ¬expr1

5. expr1 implies expr2: expr1 → expr2

6. expr1 if and only if expr2: expr1 ↔ expr2

7. Logical universal quantification: ∀x

8. Logical existential quantification: ∃x

Problem:
• Miroslav Venhoda was a Czech choral conductor who specialized in the performance of Renaissance and Baroque music.
Any choral conductor is a musician. Some musicians love music. Miroslav Venhoda published a book in 1946 called
Method of Studying Gregorian Chant.

Question:
• Is the following statement true, false, or uncertain? Miroslav Venhoda loved music.

Translation:
Predicates:
• Czech(x): x is a Czech person.

• ChoralConductor(x): x is a choral conductor.

• Musician(x): x is a musician.

• Love(x, y): x loves y.

• Author(x, y): x is the author of y.

• Book(x): x is a book.

• Publish(x, y): x is published in year y.

• Specialize(x, y): x specializes in y.

Premises:
• Czech(miroslav) ∧ ChoralConductor(miroslav) ∧ Specialize(miroslav, renaissance) ∧ Specialize(miroslav, baroque):
Miroslav Venhoda was a Czech choral conductor who specialized in the performance of Renaissance and Baroque music.

• ∀x(ChoralConductor(x) → Musician(x)): Any choral conductor is a musician.

• ∃x(Musician(x) ∧ Love(x, music)): Some musicians love music.

• Book(methodOfStudyingGregorianChant) ∧ Author(miroslav, methodOfStudyingGregorianChant) ∧


Publish(methodOfStudyingGregorianChant, year1946): Miroslav Venhoda published a book in 1946 called
Method of Studying Gregorian Chant.

Query:
• Love(miroslav, music) ::: Miroslav Venhoda loved music.

Step-2: Deriving plan based on the natural and symbolic context


Task:
Can you derive a step-by-step plan that can use the premises and First-Order Logic inference rules to infer new knowledge if
needed, in order to prove the statement? Start with identifying the goal and break down the necessary logical inference step
by step.
First-Order Logic Inference Rules:
The First-Order Logic inference rules include but are not limited to: Modus Ponens, Modus Tollens, (... More FOL rules ...).
Predicates:
• Czech(x): x is a Czech person.

• ChoralConductor(x): x is a choral conductor.

• Musician(x): x is a musician.

• Love(x, y): x loves y.

• Author(x, y): x is the author of y.

• Book(x): x is a book.

• Publish(x, y): x is published in year y.

• Specialize(x, y): x specializes in y.

Premises:
• Czech(miroslav) ∧ ChoralConductor(miroslav) ∧ Specialize(miroslav, renaissance) ∧ Specialize(miroslav, baroque):
Miroslav Venhoda was a Czech choral conductor who specialized in the performance of Renaissance and Baroque music.

• ∀x(ChoralConductor(x) → Musician(x)): Any choral conductor is a musician.

• ∃x(Musician(x) ∧ Love(x, music)): Some musicians love music.

• Book(methodOfStudyingGregorianChant) ∧ Author(miroslav, methodOfStudyingGregorianChant) ∧


Publish(methodOfStudyingGregorianChant, year1946): Miroslav Venhoda published a book in 1946 called
Method of Studying Gregorian Chant.

Query:
• Love(miroslav, music) ::: Miroslav Venhoda loved music.
Plan:
To determine whether the conclusion "Love(miroslav, music) ::: Miroslav Venhoda loved music" is true, false, or unknown
based on the given premises, we can follow these steps:

• 1. Identify the Goal: Our goal is to assess whether it can be logically concluded that Miroslav Venhoda loved music based
on the available premises.

• 2. Define relevant predicates and facts: Premise 1 introduces Miroslav as a Czech choral conductor specializing in
Renaissance and Baroque music. This establishes his professional background and interests but does not directly address
his personal feelings toward music. Premise 2 establishes that all choral conductors are musicians. This is a classification
premise that links the profession of being a choral conductor with being a musician.

• (... More plan ...)

• This plan outlines a systematic approach to assessing the provided conclusion against the premises using First-Order Logic
inference rules.
Step-3: Solving the Problem given the context and plan
Task:
Based on the premises, question, and plan, execute each step by selecting relevant premises and making inferences based
on First-Order Logic inference rules to solve the question. Ensure to follow the plan and indicate what First-Order Logic
inference rules you used.
First-Order Logic Inference Rules:
The First-Order Logic inference rules include but are not limited to: Modus Ponens, Modus Tollens, (... More FOL rules ...)
Predicates:
• Czech(x): x is a Czech person.

• ChoralConductor(x): x is a choral conductor.

• Musician(x): x is a musician.

• Love(x, y): x loves y.

• Author(x, y): x is the author of y.

• Book(x): x is a book.

• Publish(x, y): x is published in year y.

• Specialize(x, y): x specializes in y.

Premises:
• Czech(miroslav) ∧ ChoralConductor(miroslav) ∧ Specialize(miroslav, renaissance) ∧ Specialize(miroslav, baroque):
Miroslav Venhoda was a Czech choral conductor who specialized in the performance of Renaissance and Baroque music.

• ∀x(ChoralConductor(x) → Musician(x)): Any choral conductor is a musician.

• ∃x(Musician(x) ∧ Love(x, music)): Some musicians love music.

• Book(methodOfStudyingGregorianChant) ∧ Author(miroslav, methodOfStudyingGregorianChant) ∧


Publish(methodOfStudyingGregorianChant, year1946): Miroslav Venhoda published a book in 1946 called
Method of Studying Gregorian Chant.

Query:
• Love(miroslav, music) ::: Miroslav Venhoda loved music.
Plan:
To determine whether the conclusion "Love(miroslav, music) ::: Miroslav Venhoda loved music" is true, false, or unknown
based on the given premises, we can follow these steps:

• 1. Identify the Goal: Our goal is to assess whether it can be logically concluded that Miroslav Venhoda loved music based
on the available premises.

• 2. Define relevant predicates and facts: Premise 1 introduces Miroslav as a Czech choral conductor specializing in
Renaissance and Baroque music. This establishes his professional background and interests but does not directly address
his personal feelings toward music. Premise 2 establishes that all choral conductors are musicians. This is a classification
premise that links the profession of being a choral conductor with being a musician.

• (... More plan ...)

• This plan outlines a systematic approach to assessing the provided conclusion against the premises using First-Order Logic
inference rules.
Execution:
To solve the question about whether "Miroslav Venhoda loved music" based on the given premises, let’s proceed systemati-
cally:

• Identify the Goal: Goal: Determine if Love(miroslav, music) is true.

• Apply Universal Instantiation to Premise 2: From Premise 2, since Miroslav is a choral conductor, we can infer he is also
a musician based on the rule that all choral conductors are musicians. Using Universal Instantiation and Modus Ponens,
we deduce Musician(miroslav).
• (... More solving steps ...)

• The conclusion that "Miroslav Venhoda loved music" cannot be definitively proven true solely based on the provided
premises and the strict application of First-Order Logic rules. Therefore, the answer is unknown.
Step-4: Verifying the translation and solving process
Task:
The task is to verify whether the original execution correctly determines the value of the given conclusion based on the
context. You should abide by the First-Order Logic rules when checking the original execution and the conclusion.
When verifying the logical process and the final answer, make sure you fully consider the following aspects before verification:
1. Whether the logical process abides by the First-Order Logic inference rules.
2. Assumption and Consistency Verification. Please check whether the relevant facts or rules used in a logical step are indeed
from the context or inferred from the previous steps. If you think it contains assumption in the logical steps, you need to
clearly what assumption it is and verify whether the assumptions are from the context and whether it is already inferred from
the previous steps. Please note that an assumption that does not exist from the context or was not inferred from the previous
step will be considered an invalid assumption. Please note that if an assumption is semantically equivalent to context or
information inferred from the context, it should be seen as a valid assumption. For example, "X sees X" can be interpreted as
"X sees itself" since they are semantically equivalent.
3. Whether the translation First-Order Logic is semantically consistent with the natural language. Make sure that for
compound predicates that are separated by a comma, you should treat the comma as "or". For instance, "All red, rough things
are big." should be interpreted as Both red things and rough things are big. Thus, it should be translated to translated to
"Red($x, True) ∨ Rough($x, True) → Big($x, True)" because being red or rough will make the conditional statement being
true.
4. Refinement of Logical Process: If the original execution violates any of the above steps, refine the logical process. Use the
relevant premises and information correctly derived from previous steps to create a more accurate logical deduction.
5. Handling of Conditional Statements in Conclusions: For conclusions containing conditional statements (e.g., "If..."),
remember that in First-Order Logic, a conditional statement is considered false only when the antecedent is true and the
consequent is false. In all other cases, the conditional statement is considered true.
6. If applicable, please provide a detailed analysis of each step in the logical process and a refined answer at the end, using
the format true/false/unknown. The answer should be one of these three options. 7. Make sure the conclusion aligns with the
information inferred from the logical steps.
Predicates:
• Czech(x): x is a Czech person.

• ChoralConductor(x): x is a choral conductor.

• Musician(x): x is a musician.

• Love(x, y): x loves y.

• Author(x, y): x is the author of y.

• Book(x): x is a book.

• Publish(x, y): x is published in year y.

• Specialize(x, y): x specializes in y.

Premises:
• Czech(miroslav) ∧ ChoralConductor(miroslav) ∧ Specialize(miroslav, renaissance) ∧ Specialize(miroslav, baroque):
Miroslav Venhoda was a Czech choral conductor who specialized in the performance of Renaissance and Baroque music.

• ∀x(ChoralConductor(x) → Musician(x)): Any choral conductor is a musician.

• ∃x(Musician(x) ∧ Love(x, music)): Some musicians love music.

• Book(methodOfStudyingGregorianChant) ∧ Author(miroslav, methodOfStudyingGregorianChant) ∧


Publish(methodOfStudyingGregorianChant, year1946): Miroslav Venhoda published a book in 1946 called
Method of Studying Gregorian Chant.

Query:
• Love(miroslav, music) ::: Miroslav Venhoda loved music.
Original Execution:
To solve the question about whether "Miroslav Venhoda loved music" based on the given premises, let’s proceed systemati-
cally:

• Identify the Goal: Goal: Determine if Love(miroslav, music) is true.

• Apply Universal Instantiation to Premise 2: From Premise 2, since Miroslav is a choral conductor, we can infer he is also
a musician based on the rule that all choral conductors are musicians. Using Universal Instantiation and Modus Ponens,
we deduce Musician(miroslav).
• (... More solving steps ...)

• The conclusion that "Miroslav Venhoda loved music" cannot be definitively proven true solely based on the provided
premises and the strict application of First-Order Logic rules. Therefore, the answer is unknown.
Verification:
Verification of the original execution involves evaluating the logical process used to determine whether Miroslav Venhoda
loved music, based on the provided context and premises.
Context verification:
• The context and query correctly correspond to the information in the original context and are semantically consistent.

Logical Process Verification:


• 1. Correctly identify the goal.

• 2. Correctly use Universal Instantiation and Modus Ponens to infer Musician(miroslav).

• (... More logical verification ...)

• The original execution’s conclusion that the statement "Love(miroslav, music)" is unknown is correct based on the
available premises. The verification process confirms that the original logical process and conclusion are accurate,
adhering to the principles of First-Order Logic given the information provided.

I.1.4 LogicalDeduction

Step-1: Translating natural language context into symbolic


Task Description:
Task Description: You are given a problem description. The task is to parse the problem as a constraint satisfaction problem,
defining the domain, variables, and constraints.
Problem:
• The following paragraphs each describe a set of three objects arranged in a fixed order. The statements are logically
consistent within each paragraph. In an antique car show, there are three vehicles: a station wagon, a convertible, and a
minivan. The station wagon is the oldest. The minivan is newer than the convertible.

Question:
• Which of the following is true?

Options:
• A) The station wagon is the second-newest.

• B) The convertible is the second-newest.


• C) The minivan is the second-newest.
Translation:
Domain:
• 1: oldest
• 3: newest
Variables:
• station_wagon ∈ {1, 2, 3}

• convertible ∈ {1, 2, 3}

• minivan ∈ {1, 2, 3}

Constraints:
• station_wagon == 1: The station wagon is the oldest.

• minivan > convertible: The minivan is newer than the convertible.


• AllDifferentConstraint([station_wagon, convertible,
minivan]): All vehicles have different values.

Query:

• A)station_wagon == 2: The station wagon is the second-newest.

• B)convertible == 2: The convertible is the second-newest.

• C)minivan == 2: The minivan is the second-newest.


Step-2: Deriving plan based on the natural and symbolic context
Task:
Can you derive a step-by-step plan that can use the domain, variables and constraints relevant inference rules to choose the
correct option satisfying all the constraints?
Domain:
• 1: oldest
• 3: newest
Variables:
• station_wagon ∈ {1, 2, 3}

• convertible ∈ {1, 2, 3}

• minivan ∈ {1, 2, 3}

Constraints:
• station_wagon == 1: The station wagon is the oldest.

• minivan > convertible: The minivan is newer than the convertible.


• AllDifferentConstraint([station_wagon, convertible,
minivan]): All vehicles have different values.

Query:

• A)station_wagon == 2: The station wagon is the second-newest.

• B)convertible == 2: The convertible is the second-newest.

• C)minivan == 2: The minivan is the second-newest.

Plan: To approach the constraint optimization problem described, involving variables like station_wagon, convertible, and
minivan with their respective constraints, you can follow a systematic plan.

• 1. Identify Variables and their domain: List all the variables (station_wagon, convertible, minivan) and note their possible
values (1, 2, 3), where 1 represents the oldest and 3 represents the newest.

• 2. Apply the constraints to narrow down the possibility: The station_wagon is the oldest.
The minivan is newer than the convertible.
All vehicles (station_wagon, convertible, minivan) must have different ages.

• (... More plan ...)

• This plan provides a systematic approach to solving the problem without directly computing the solution. It emphasizes
understanding constraints, logical deduction, and iterative refinement to navigate toward a solution that satisfies all given
conditions.

Step-3: Solving the Problem given the context and plan


Task:
The task is a constraint optimization problem, and you need to select the most appropriate option that is true from the given
options. The domain, variables, constraints, and relevant plan will be given to you to solve the question.
There will be only one answer, so choose the one you think is the most likely.
Here is a workflow you need to follow to solve the constraint optimization problem:
1.Domain and Variables: Identify the possible values (domain) and variables involved.
2.Apply Constraints: Read and apply the specific constraints to the variables.
3.List Possibilities: Enumerate all potential combinations or orders that could satisfy the constraints.
4.Determine Order/Values: Establish the correct order or specific values for each variable as per the constraints.
5.Demonstrate Satisfaction of Constraints: Provide a specific example that shows all constraints are met.
6.Evaluate Queries: Analyze each query against the established order or values.
7.Identify Correct Answer: Choose the option that aligns with the determined order/values.
8.Present Final Answer: Clearly state the solution that satisfies the constraints and query.
Domain:
• 1: oldest
• 3: newest
Variables:
• station_wagon ∈ {1, 2, 3}

• convertible ∈ {1, 2, 3}

• minivan ∈ {1, 2, 3}

Constraints:
• station_wagon == 1: The station wagon is the oldest.

• minivan > convertible: The minivan is newer than the convertible.


• AllDifferentConstraint([station_wagon, convertible,
minivan]): All vehicles have different values.

Query:

• A)station_wagon == 2: The station wagon is the second-newest.

• B)convertible == 2: The convertible is the second-newest.

• C)minivan == 2: The minivan is the second-newest.

Plan: To approach the constraint optimization problem described, involving variables like station_wagon, convertible, and
minivan with their respective constraints, you can follow a systematic plan.

• 1. Identify Variables and their domain: List all the variables (station_wagon, convertible, minivan) and note their possible
values (1, 2, 3), where 1 represents the oldest and 3 represents the newest.

• 2. Apply the constraints to narrow down the possibility: The station_wagon is the oldest.
The minivan is newer than the convertible.
All vehicles (station_wagon, convertible, minivan) must have different ages.

• (... More plan ...)

• This plan provides a systematic approach to solving the problem without directly computing the solution. It emphasizes
understanding constraints, logical deduction, and iterative refinement to navigate toward a solution that satisfies all given
conditions.
Execution:
Let’s solve the given constraint optimization problem following the outlined plan:

• Identify variables and their domains: Variables: station_wagon, convertible, minivan. Domain for each: 1 (oldest), 2, 3
(newest)
• station_wagon == 1: Station wagon is the oldest.
minivan > convertible: Minivan is newer than the convertible.
All vehicles have different ages.

• Apply Direct Constraints: Assign station_wagon the value 1 since it’s the oldest.

• (... More solving steps ...)

• Based on the configurations that satisfy all constraints, the only query that holds true is convertible == 2, making it the
second-newest vehicle. Therefore, the final answer is B.

Step-4: Verifying the translation and solving process


Task:
Please verify the solution to the following constraint optimization problem using a detailed method. The problem consists of
a set of variables, each with a specific domain, and a set of constraints. Your task is to determine if the proposed solution in
the "Original execution" correctly satisfies all the constraints, and if not, find a valid solution that does.
The context will contain the following:
- Domain: Specify the domain for the problem, for example, ’1: highest to 5: lowest’ or ’1: oldest to 5: newest’.
- Variables: List all the variables in the problem with their possible values in the domain.
- Constraints: Describe each constraint in detail, making sure to interpret them correctly according to the domain. For example,
’X > Y’ means ’X finishes below Y’ if the domain is ’1: highest to 5: lowest’.
- Query: State the query that needs to be answered based on the variables and constraints.
Solution Verification Steps:
1. Review the domain, variables, constraints, and query between symbolic format and natural language for consistency with
explanation. If there is an inconsistency between the symbolic format and the natural language description of constraints,
make sure you only change the symbolic format but DON’T change the natural language as natural language is the ground
truth standard.
2. If the proposed solution violates any constraints, attempt to find a valid order that satisfies all constraints. Explain your
reasoning for the new order.
3. Once a valid order is found, evaluate it against the query.
4. Present the final answer based on your evaluation.
Your task is to apply this method to verify the problem, and if necessary, re-solve it to find a valid solution.
Note: Make sure to interpret the constraints accurately based on the domain, and carefully evaluate each constraint against
the proposed solution. If the solution is incorrect, methodically find a valid order that satisfies all constraints, and then answer
the query based on this valid order.
Domain:
• 1: oldest
• 3: newest
Variables:
• station_wagon ∈ {1, 2, 3}

• convertible ∈ {1, 2, 3}

• minivan ∈ {1, 2, 3}

Constraints:
• station_wagon == 1: The station wagon is the oldest.
• minivan > convertible: The minivan is newer than the convertible.
• AllDifferentConstraint([station_wagon, convertible,
minivan]): All vehicles have different values.

Query:

• A)station_wagon == 2: The station wagon is the second-newest.

• B)convertible == 2: The convertible is the second-newest.

• C)minivan == 2: The minivan is the second-newest.

Original Execution:
Let’s solve the given constraint optimization problem following the outlined plan:

• Identify variables and their domains: Variables: station_wagon, convertible, minivan. Domain for each: 1 (oldest), 2, 3
(newest)
• station_wagon == 1: Station wagon is the oldest.
minivan > convertible: Minivan is newer than the convertible.
All vehicles have different ages.
• Apply Direct Constraints: Assign station_wagon the value 1 since it’s the oldest.
• (... More solving steps ...)
• Based on the configurations that satisfy all constraints, the only query that holds true is convertible == 2, making it the
second-newest vehicle. Therefore, the final answer is B.
Verification:
Verification of the translation and the Original Execution:
Context verification:
• The context and query correctly correspond to the information in the original context and are semantically consistent.
Logical Process Verification:
• Given the constraints and the domain’s interpretation, let’s verify the original execution:
• 1. Station Wagon’s Age: As the oldest, its correct assignment is 1. This matches both the symbolic and natural language
descriptions.
• 2. Convertible and Minivan Relationship: The constraint "minivan > convertible" suggests the minivan is newer (higher
number) than the convertible. Given the domain and the direct constraint on the station wagon, the convertible cannot
be the oldest (1), and the minivan cannot be the oldest due to its relationship with the convertible. This necessitates a
careful review of the possible configurations.
• (... More logical verification ...)
• The original solution’s interpretation of the constraints and the final answer is consistent with the domain’s direction and
the natural language description of the constraints. The logic applied matches the expected outcomes, and no changes
are necessary to the symbolic format or the conclusion based on the provided context and constraints. Therefore, the
original answer B remains unchanged.
I.1.5 AR-LSAT
.
Step-1: Translating natural language context into symbolic
Task Description:
Task Description: You are given a problem description. The task is to parse the problem as a constraint satisfaction problem,
defining the domain, variables, and constraints.
Problem:
• Four boys—Fred, Juan, Marc, and Paul—and three girls—Nita, Rachel, and Trisha—will be assigned to a row of five
adjacent lockers, numbered consecutively 1 through 5, arranged along a straight wall. The following conditions govern the
assignment of lockers to the seven children: Each locker must be assigned to either one or two children, and each child
must be assigned to exactly one locker. Each shared locker must be assigned to one girl and one boy. Juan must share a
locker, but Rachel cannot share a locker. Nita’s locker cannot be adjacent to Trisha’s locker. Fred must be assigned to
locker 3.
Question:
• If the first three lockers are assigned to girls, which one of the following must be true?
Options:
• A) Juan is assigned to locker 1.
• B) Nita is assigned to locker 3.
• C) Trisha is assigned to locker 1.
• D) Juan is assigned to the same locker as Trisha.
• E) Paul is assigned to the same locker as Trisha.
Translation:
Domain
• Lockers: 1 to 5
Variables
• Fred_locker ∈ {1, 2, 3, 4, 5}
• Juan_locker ∈ {1, 2, 3, 4, 5}
• Marc_locker ∈ {1, 2, 3, 4, 5}
• Paul_locker ∈ {1, 2, 3, 4, 5}
• Nita_locker ∈ {1, 2, 3, 4, 5}
• Rachel_locker ∈ {1, 2, 3, 4, 5}
• Trisha_locker ∈ {1, 2, 3, 4, 5}
Constraints
1. Each Locker One or Two Children: Each locker can have either one or two children assigned to it.
2. Each Child One Locker: Each child is assigned to exactly one locker.
3. Shared Locker Boy and Girl: If a locker is shared, it must be shared between one boy and one girl.
4. Juan Must Share: Juan_locker must be the same as one girl’s locker.
5. Rachel Cannot Share: Rachel_locker must be unique and not shared with any other child.
6. Nita Not Adjacent to Trisha: |N ita_locker − T risha_locker| =
̸ 1
7. Fred Assigned to Locker 3: Fred_locker = 3
8. First Three Lockers to Girls: At least one girl must be assigned to each of the lockers 1, 2, and 3.
Queries for Options (If the first three lockers are assigned to girls, which one of the following must be true?)
• A) Juan is assigned to locker 1. Query: Juan_locker = 1 and Juan_locker is shared with a girl
• B) Nita is assigned to locker 3. Query: Nita_locker = 3
• C) Trisha is assigned to locker 1. Query: Trisha_locker = 1
• D) Juan is assigned to the same locker as Trisha. Query: Juan_locker = Trisha_locker
• E) Paul is assigned to the same locker as Trisha. Query: Paul_locker = Trisha_locker
Step-2: Deriving plan based on the natural and symbolic context
Task:
Can you derive a step-by-step plan that can use the domain, variables and constraints relevant inference rules to choose the
correct option satisfying all the constraints?
Domain

• Lockers: 1 to 5

Variables

• Fred_locker ∈ {1, 2, 3, 4, 5}

• Juan_locker ∈ {1, 2, 3, 4, 5}

• Marc_locker ∈ {1, 2, 3, 4, 5}

• Paul_locker ∈ {1, 2, 3, 4, 5}

• Nita_locker ∈ {1, 2, 3, 4, 5}

• Rachel_locker ∈ {1, 2, 3, 4, 5}

• Trisha_locker ∈ {1, 2, 3, 4, 5}

Constraints

1. Each Locker One or Two Children: Each locker can have either one or two children assigned to it.

2. Each Child One Locker: Each child is assigned to exactly one locker.

3. Shared Locker Boy and Girl: If a locker is shared, it must be shared between one boy and one girl.

4. Juan Must Share: Juan_locker must be the same as one girl’s locker.

5. Rachel Cannot Share: Rachel_locker must be unique and not shared with any other child.

6. Nita Not Adjacent to Trisha: |N ita_locker − T risha_locker| =


̸ 1

7. Fred Assigned to Locker 3: Fred_locker = 3

8. First Three Lockers to Girls: At least one girl must be assigned to each of the lockers 1, 2, and 3.

Queries for Options (If the first three lockers are assigned to girls, which one of the following must be true?)

• A) Juan is assigned to locker 1. Query: Juan_locker = 1 and Juan_locker is shared with a girl

• B) Nita is assigned to locker 3. Query: Nita_locker = 3

• C) Trisha is assigned to locker 1. Query: Trisha_locker = 1

• D) Juan is assigned to the same locker as Trisha. Query: Juan_locker = Trisha_locker

• E) Paul is assigned to the same locker as Trisha. Query: Paul_locker = Trisha_locker

Plan: To approach this constraint optimization problem regarding locker assignments for children, follow a systematic plan
that accounts for all given constraints and variables. This plan will guide you through understanding the problem, applying
logical reasoning, and deducing possible configurations that satisfy all conditions.

• 1. Understand the Domain and Variables: Acknowledge the domain of lockers numbered 1 to 5. Note each child’s possible
locker assignments (Fred, Juan, Marc, Paul, Nita, Rachel, Trisha) within this domain.

• 2. Comprehend Constraints:
Each locker can have either one or two children assigned, but no more.
Each child is assigned exactly one locker.
(... More constraints ...)
• (... More plan ...)

• This plan emphasizes a logical, step-by-step approach to deducing valid locker assignments for each child, ensuring
compliance with all constraints. By methodically analyzing each query in the context of these constraints, you can identify
which statements about locker assignments must be true.
Step-3: Solving the Problem given the context and plan
Task:
The task is a constraint optimization problem, and you need to select the most appropriate option that is true from the given
options. The domain, variables, constraints, and relevant plan will be given to you to solve the question.
There will be only one answer, so choose the one you think is the most likely.
Here is a workflow you need to follow to solve the constraint optimization problem:
1.Domain and Variables: Identify the possible values (domain) and variables involved.
2.Apply Constraints: Read and apply the specific constraints to the variables.
3.List Possibilities: Enumerate all potential combinations or orders that could satisfy the constraints.
4.Determine Order/Values: Establish the correct order or specific values for each variable as per the constraints.
5.Demonstrate Satisfaction of Constraints: Provide a specific example that shows all constraints are met.
6.Evaluate Queries: Analyze each query against the established order or values.
7.Identify Correct Answer: Choose the option that aligns with the determined order/values.
8.Present Final Answer: Clearly state the solution that satisfies the constraints and query.
Domain

• Lockers: 1 to 5

Variables

• Fred_locker ∈ {1, 2, 3, 4, 5}

• Juan_locker ∈ {1, 2, 3, 4, 5}

• Marc_locker ∈ {1, 2, 3, 4, 5}

• Paul_locker ∈ {1, 2, 3, 4, 5}

• Nita_locker ∈ {1, 2, 3, 4, 5}

• Rachel_locker ∈ {1, 2, 3, 4, 5}

• Trisha_locker ∈ {1, 2, 3, 4, 5}

Constraints

1. Each Locker One or Two Children: Each locker can have either one or two children assigned to it.

2. Each Child One Locker: Each child is assigned to exactly one locker.

3. Shared Locker Boy and Girl: If a locker is shared, it must be shared between one boy and one girl.

4. Juan Must Share: Juan_locker must be the same as one girl’s locker.

5. Rachel Cannot Share: Rachel_locker must be unique and not shared with any other child.

6. Nita Not Adjacent to Trisha: |N ita_locker − T risha_locker| =


̸ 1

7. Fred Assigned to Locker 3: Fred_locker = 3

8. First Three Lockers to Girls: At least one girl must be assigned to each of the lockers 1, 2, and 3.

Queries for Options (If the first three lockers are assigned to girls, which one of the following must be true?)

• A) Juan is assigned to locker 1. Query: Juan_locker = 1 and Juan_locker is shared with a girl

• B) Nita is assigned to locker 3. Query: Nita_locker = 3

• C) Trisha is assigned to locker 1. Query: Trisha_locker = 1

• D) Juan is assigned to the same locker as Trisha. Query: Juan_locker = Trisha_locker

• E) Paul is assigned to the same locker as Trisha. Query: Paul_locker = Trisha_locker

Plan: To approach this constraint optimization problem regarding locker assignments for children, follow a systematic plan
that accounts for all given constraints and variables. This plan will guide you through understanding the problem, applying
logical reasoning, and deducing possible configurations that satisfy all conditions.

• 1. Understand the Domain and Variables: Acknowledge the domain of lockers numbered 1 to 5. Note each child’s possible
locker assignments (Fred, Juan, Marc, Paul, Nita, Rachel, Trisha) within this domain.
• 2. Comprehend Constraints:
Each locker can have either one or two children assigned, but no more.
Each child is assigned exactly one locker.
(... More constraints ...)
• 3. Apply all the constraints to each option.

• (... More plan ...)

• This plan emphasizes a logical, step-by-step approach to deducing valid locker assignments for each child, ensuring
compliance with all constraints. By methodically analyzing each query in the context of these constraints, you can identify
which statements about locker assignments must be true.

Execution:
Let’s solve the given constraint optimization problem following the outlined plan:
Option A: Juan is assigned to locker 1.

• Negation of Condition for A: Juan is not assigned to locker 1.

• Test Against Constraints: If Juan is not in locker 1, he could be in locker 2, 3, 4, or 5. Since locker 3 is taken by Fred (a
boy), Juan cannot share with him due to constraint 3. If Juan shares locker 4 or 5 with a girl, this would still satisfy all
constraints. Therefore, violating the condition in Option A can still lead to a valid order satisfying all the constraints.
Thus, Option A does not have to be true.

Option B: Nita is assigned to locker 3.

• Negation of Condition for B: Nita is not assigned to locker 3.

• Test Against Constraints: Since Fred is assigned to locker 3, Nita cannot be assigned to locker 3. This does not violate
any constraints, as Nita can be assigned to any other locker that is not adjacent to Trisha’s. Therefore, Option B cannot
be true because it directly conflicts with constraint 7.
(... More solving steps ...)
Conclusion:
• Upon evaluation, we notice that only option A must be true to satisfy all the constraint. Therefore, the final answer is A.

Step-4: Verifying the translation and solving process


Task:
Please verify the solution to the following constraint optimization problem using a detailed method. The problem consists of
a set of variables, each with a specific domain, and a set of constraints. Your task is to determine if the proposed solution in
the "Original execution" correctly satisfies all the constraints, and if not, find a valid solution that does.

• Context will contain the following:


– Domain: Specify the domain for the problem, for example, ‘1: highest to 5: lowest’ or ‘1: oldest to 5: newest’.
– Variables: List all the variables in the problem with their possible values in the domain.
– Constraints: Describe each constraint in detail, making sure to interpret them correctly according to the domain.
For example, ‘X > Y’ means ‘X finishes below Y’ if the domain is ‘1: highest to 5: lowest’.
– Query: State the query that needs to be answered based on the variables and constraints.

• Your task is to apply this method to verify the problem, and if necessary, re-solve it to find a valid solution.

• Here are the requirements:


– Please pay attention to the domain, and make sure the value for each variable is consistent with the meaning of the
domain and the natural language. For instance, if the Domain: 1 (cheapest) to 7 (most expensive), and a constraint
says X is the second-cheapest, then the value of X should be 2 but NOT 6 because 6 means the second-most
expensive which violates the meaning of natural language. In this case, the smaller value infers the cheaper the
thing is. So when you do comparison to check whether a constraint is satisfied or not, make sure you correctly
interpret the direction of being cheaper is smaller value but NOT higher value. In this case, if a constraint says X
is cheaper than Y. If X is 3 and Y is 2, this constraint is violated because the domain indicates that smaller value
means cheaper price and 3 being larger than 2 means X is more expensive than Y, which violates the constraint.
– Please pay attention to the meaning of the domain, and make sure you interpret the direction of the value based on
the domain correctly to particular cases.
– IMPORTANT: Review the domain, variables, constraints, and query for consistency. If there is inconsistency
between the symbolic format and the natural language description of constraints, make sure you only change the
symbolic format but DON’T change the natural language as natural language is the ground truth standard.
– Make sure you keep iterating until you find the valid order that can accurately answer the question.
– Make sure you fully understand the question before verification.

Domain

• Lockers: 1 to 5
Variables
• Fred_locker ∈ {1, 2, 3, 4, 5}

• Juan_locker ∈ {1, 2, 3, 4, 5}

• Marc_locker ∈ {1, 2, 3, 4, 5}

• Paul_locker ∈ {1, 2, 3, 4, 5}

• Nita_locker ∈ {1, 2, 3, 4, 5}

• Rachel_locker ∈ {1, 2, 3, 4, 5}

• Trisha_locker ∈ {1, 2, 3, 4, 5}

Constraints
1. Each Locker One or Two Children: Each locker can have either one or two children assigned to it.
2. Each Child One Locker: Each child is assigned to exactly one locker.
3. Shared Locker Boy and Girl: If a locker is shared, it must be shared between one boy and one girl.
4. Juan Must Share: Juan_locker must be the same as one girl’s locker.
5. Rachel Cannot Share: Rachel_locker must be unique and not shared with any other child.

6. Nita Not Adjacent to Trisha: |N ita_locker − T risha_locker| =


̸ 1

7. Fred Assigned to Locker 3: Fred_locker = 3


8. First Three Lockers to Girls: At least one girl must be assigned to each of the lockers 1, 2, and 3.

Queries for Options (If the first three lockers are assigned to girls, which one of the following must be true?)

• A) Juan is assigned to locker 1. Query: Juan_locker = 1 and Juan_locker is shared with a girl
• B) Nita is assigned to locker 3. Query: Nita_locker = 3
• C) Trisha is assigned to locker 1. Query: Trisha_locker = 1
• D) Juan is assigned to the same locker as Trisha. Query: Juan_locker = Trisha_locker
• E) Paul is assigned to the same locker as Trisha. Query: Paul_locker = Trisha_locker

Original Execution:
Let’s solve the given constraint optimization problem following the outlined plan:
Option A: Juan is assigned to locker 1.
• Negation of Condition for A: Juan is not assigned to locker 1.
• Test Against Constraints: If Juan is not in locker 1, he could be in locker 2, 3, 4, or 5. Since locker 3 is taken by Fred (a
boy), Juan cannot share with him due to constraint 3. If Juan shares locker 4 or 5 with a girl, this would still satisfy all
constraints. Therefore, violating the condition in Option A can still lead to a valid order satisfying all the constraints.
Thus, Option A does not have to be true.
Option B: Nita is assigned to locker 3.
• Negation of Condition for B: Nita is not assigned to locker 3.
• Test Against Constraints: Since Fred is assigned to locker 3, Nita cannot be assigned to locker 3. This does not violate
any constraints, as Nita can be assigned to any other locker that is not adjacent to Trisha’s. Therefore, Option B cannot
be true because it directly conflicts with constraint 7.
(... More solving steps ...)
Conclusion:
• Upon evaluation, we notice that only option A must be true to satisfy all the constraints. Therefore, the final answer is A.
Verification:
Verification of the translation and the Original Execution:
Context verification:
• The context and query correctly correspond to the information in the original context and are semantically consistent.
Logical Process Verification:
• Given the constraints and the domain’s interpretation, let’s verify the original execution and other options:

• Option A: Verification of Constraints: If Juan is assigned to locker 1, he must share with a girl due to constraint 4. Since
locker 3 is taken by Fred (a boy), and Rachel cannot share a locker (constraint 5), Rachel must be in locker 2. This leaves
locker 1 for Juan to share with a girl other than Rachel. This satisfies all constraints, making Option A a possibility.

• Option B: Nita is assigned to locker 3. Verification of Constraints: Since Fred is assigned to locker 3 (constraint 7), Nita
cannot be assigned to locker 3. This directly conflicts with constraint 7, making Option B impossible.

• (... More logical verification ...)

• Given that the first three lockers are assigned to girls, Fred is in locker 3, and Rachel cannot share a locker, Rachel must
be in locker 1 or 2. If Rachel is in locker 1, then Trisha cannot be in locker 1, which means Trisha must be in locker 2.
This leaves locker 1 available for Juan to share with a girl, which must be the case since Juan must share a locker with a
girl, and locker 3 is occupied by Fred, a boy. Therefore, Juan must be assigned to locker 1 to satisfy the constraints,
making Option A the one that must be true. Upon verification, we found that Option A indeed is the only possible answer.
Thus, the answer A should remain unchanged.

I.2 Chain-of-Thought (CoT)


Below is the prompt for CoT across each dataset.
I.2.1 ProntoQA
Given a problem statement as context, the task is to answer a logical reasoning question.
Context:
• Jompuses are not shy.
• Jompuses are yumpuses.

• Each yumpus is aggressive.

• Each yumpus is a dumpus.

• Dumpuses are not wooden.


• Dumpuses are wumpuses.

• Wumpuses are red.

• Every wumpus is an impus.

• Each impus is opaque.


• Impuses are tumpuses.
• Numpuses are sour.
• Tumpuses are not sour.
• Tumpuses are vumpuses.

• Vumpuses are earthy.

• Every vumpus is a zumpus.

• Zumpuses are small.


• Zumpuses are rompuses.

• Max is a yumpus.

Question: Is the following statement true or false? Max is sour.


Options:
A) True
B) False
Reasoning:
• Max is a yumpus.

• Each yumpus is a dumpus. So Max is a dumpus.

• Dumpuses are wumpuses. So Max is a wumpus.


• Every wumpus is an impus. So Max is an impus.

• Impuses are tumpuses. So Max is a tumpus.

• Tumpuses are not sour. So Max is not sour.

Conclusion:
• The correct option is: B)

I.2.2 ProofWriter

Given a problem statement as contexts, the task is to answer a logical reasoning question.
Context:
• The cow is blue.
• The cow is round.
• The cow likes the lion.
• The cow visits the tiger.

• The lion is cold.


• The lion is nice.
• The lion likes the squirrel.

• The squirrel is round.

• The squirrel sees the lion.

• The squirrel visits the cow.

• The tiger likes the cow.

• The tiger likes the squirrel.

• If something is cold then it visits the tiger.

• If something visits the tiger then it is nice.

• If something sees the tiger and it is young then it is blue.

• If something is nice then it sees the tiger.

• If something likes the squirrel and it likes the cow then it visits the tiger.

• If something is nice and it sees the tiger then it is young.

• If the cow is cold and the cow visits the lion then the lion sees the squirrel.

Question: Based on the above information, is the following statement true, false, or unknown? The tiger is not young.
Options:
A) True
B) False
C) Unknown
Reasoning:
• The tiger likes the cow.

• The tiger likes the squirrel.

• If something likes the squirrel and it likes the cow, then it visits the tiger. So the tiger visits the tiger.

• If something visits the tiger then it is nice. So the tiger is nice.

• If something is nice and it sees the tiger then it is young. So the tiger is young.

Conclusion:
• The correct option is: B)
I.2.3 FOLIO
.
Given a problem statement as contexts, the task is to answer a logical reasoning question.
Context:
• The Blake McFall Company Building is a commercial warehouse listed on the National Register of Historic Places.

• The Blake McFall Company Building was added to the National Register of Historic Places in 1990.

• The Emmet Building is a five-story building in Portland, Oregon.

• The Emmet Building was built in 1915.

• The Emmet Building is another name for the Blake McFall Company Building.

• John works at the Emmet Building.

Question: Based on the above information, is the following statement true, false, or uncertain? The Blake McFall Company
Building is located in Portland, Oregon.
Options:
A) True
B) False
C) Uncertain
Reasoning:
• The Blake McFall Company Building is another name for the Emmet Building.

• The Emmet Building is located in Portland, Oregon.

• Therefore, the Blake McFall Company Building is located in Portland, Oregon.

Conclusion:
• The correct option is: A)

I.2.4 LogicalDeduction
.
Given a problem statement as contexts, the task is to answer a logical reasoning question.
Context:
• On a branch, there are five birds: a quail, an owl, a raven, a falcon, and a robin.
• The owl is the leftmost.
• The robin is to the left of the raven.
• The quail is the rightmost.
• The raven is the third from the left.
Question: Which of the following is true?
Options:
A) The quail is the rightmost.
B) The owl is the rightmost.
C) The raven is the rightmost.
D) The falcon is the rightmost.
E) The robin is the rightmost.
Reasoning:
• The owl is the leftmost. This means the owl is not the rightmost.
• The robin is to the left of the raven. This means neither the robin nor the raven is the rightmost.
• The quail is the rightmost. This directly answers the question, so option A) The quail is the rightmost is true.
• The raven is the third from the left. This means the raven is also not the rightmost.
Conclusion:
• The correct option is: A)
I.2.5 AR-LSAT
.
Given a problem statement as contexts, the task is to answer a logical reasoning question.
Context:
• During a single week, from Monday through Friday, tours will be conducted of a company’s three divisions—Operations,
Production, and Sales.
• Exactly five tours will be conducted that week, one each day.

• The schedule of tours for the week must conform to the following restrictions:
– Each division is toured at least once.
– The Operations division is not toured on Monday.
– The Production division is not toured on Wednesday.
– The Sales division is toured on two consecutive days, and on no other days.
– If the Operations division is toured on Thursday, then the Production division is toured on Friday.

Question: Which one of the following CANNOT be true of the week’s tour schedule?
Options:
A) The division that is toured on Monday is also toured on Tuesday.

B) The division that is toured on Monday is also toured on Friday.

C) The division that is toured on Tuesday is also toured on Thursday.

D) The division that is toured on Wednesday is also toured on Friday.

E) The division that is toured on Thursday is also toured on Friday.

Reasoning:
• Since Thursday and Friday already have tours planned, only Monday, Tuesday, and Wednesday tours need to be
determined.
• On Tuesday, the Sales division is toured since that division is toured on only two consecutive days, i.e., either Monday
and Tuesday or Tuesday and Wednesday. A different division is toured on Thursday.

• Therefore, the final answer is C.


Conclusion:
• The correct option is: C)

You might also like