0% found this document useful (0 votes)
27 views

Logical Reasoning in Large Language Models A Survey

This survey reviews the advancements in logical reasoning capabilities of large language models (LLMs), focusing on formal and symbolic logic rather than general heuristic strategies. It categorizes logical reasoning into deductive, inductive, abductive, and analogical types, and evaluates various strategies and benchmarks for enhancing reasoning performance. The document emphasizes the need for further exploration in logical reasoning to address challenges related to scalability, consistency, and robustness in AI systems.

Uploaded by

Bill Petrie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Logical Reasoning in Large Language Models A Survey

This survey reviews the advancements in logical reasoning capabilities of large language models (LLMs), focusing on formal and symbolic logic rather than general heuristic strategies. It categorizes logical reasoning into deductive, inductive, abductive, and analogical types, and evaluates various strategies and benchmarks for enhancing reasoning performance. The document emphasizes the need for further exploration in logical reasoning to address challenges related to scalability, consistency, and robustness in AI systems.

Uploaded by

Bill Petrie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Logical Reasoning in Large Language Models: A Survey

Hanmeng Liu1† , Zhizhang Fu1† , Mengru Ding1 , Ruoxi Ning1


Chaoli Zhang2 , Xiaozhang Liu3 and Yue Zhang1∗
1
Westlake University
2
Zhejiang Normal University
3
Hainan University
{liuhanmeng, zhangyue}@westlake.edu.cn, {fuzhizhang.fzz, dingmengru2021}@gmail.com,
[email protected], [email protected], [email protected]
arXiv:2502.09100v1 [cs.AI] 13 Feb 2025

Abstract their reasoning is increasingly vital. As a result, post-training


LLM for reasoning has garnered a surge of interest in both
With the emergence of advanced reasoning models industry and research[OpenAI, 2024; DeepSeek-AI, 2025;
like OpenAI o3 and DeepSeek-R1, large language Muennighoff et al., 2025].
models (LLMs) have demonstrated remarkable rea- Despite growing research, existing surveys [Plaat et al.,
soning capabilities. However, their ability to per- 2024; Sun et al., 2023; Yu et al., 2024] often conflate log-
form rigorous logical reasoning remains an open ical reasoning with general-purpose heuristic strategies like
question. This survey synthesizes recent advance- Chain-of-Thought (CoT) [Xia et al., 2024]. There has been
ments in logical reasoning within LLMs, a critical a lack of a literature review dedicated to LLM and formal
area of AI research. It outlines the scope of logi- symbolic logic. This survey provides a comprehensive re-
cal reasoning in LLMs, its theoretical foundations, view of logical reasoning in large language models (LLMs),
and the benchmarks used to evaluate reasoning pro- with a focus on formal and symbolic logic-based reasoning
ficiency. We analyze existing capabilities across rather than general heuristic approaches. We begin by defin-
different reasoning paradigms — deductive, induc- ing logical reasoning in AI, distinguishing it from general-
tive, abductive, and analogical — and assess strate- purpose reasoning, and categorizing key paradigms, includ-
gies to enhance reasoning performance, including ing deductive, inductive, abductive, and analogical reason-
data-centric tuning, reinforcement learning, decod- ing. Additionally, we analyze existing benchmarks and eval-
ing strategies, and neuro-symbolic approaches. The uation methodologies, identifying gaps in assessing symbolic
review concludes with future directions, emphasiz- inference, consistency, and robustness. We further explore
ing the need for further exploration to strengthen state-of-the-art techniques for enhancing logical reasoning,
logical reasoning in AI systems. such as instruction fine-tuning, logic-informed pre-training,
reinforcement learning, inference-time decoding strategies,
1 Introduction and hybrid neuro-symbolic methods. We examine recent ad-
vances in neuro-symbolic integration, along with applications
Logical reasoning is a fundamental challenge to arti- of theorem provers, logic solvers, and formal verification
ficial intelligence (AI) and natural language processing frameworks in LLMs. Finally, we highlight open challenges
(NLP) [Newell and Simon, 1956; McCarthy and Hayes, in scalability, reasoning consistency, explainability, and effi-
1981; McCarthy, 1959]. While early formal logic-based rea- ciency, proposing future directions for multi-modal reason-
soning approaches faced limitations in scalability and adapt- ing, hybrid architectures, and improved evaluation frame-
ability [Pereira, 1982; Cann, 1993], data-driven models works. The structure of the subsequent chapters is illustrated
became the dominant method since the 1980s [McCarthy, in Figure 1.
1989]. Recently, pre-trained Large Language Models (LLMs)
and their emergent logical reasoning abilities have attracted 2 Logic in Artificial Intelligence
increasing attention [Liu et al., 2023b; Xu et al., 2023].
Logical reasoning integrates LLMs with inference struc- Logical reasoning is a cornerstone of artificial intelligence
turing, enabling multistep deduction and abstraction, and (AI), enabling machines to simulate human thought processes
improving interpretability and reliability [Shi et al., 2021; and solve complex problems. At its core, logical reasoning
Stacey et al., 2022; Rajaraman et al., 2023]. It also strength- applies structured rules to derive conclusions from premises,
ens generalization, helping models handle novel scenarios providing a rigorous framework for decision-making and in-
beyond their training data [Haruta et al., 2020]. As LLMs ference [Sun et al., 2023].
become integral to domains like legal analysis and scien- 2.1 History of Logic Reasoning Research
tific discovery, ensuring the correctness and verifiability of
Logical reasoning can be traced back to ancient Greece,
* Corresponding author. where Aristotle’s syllogisms laid the foundation for classi-

Equal contribution. cal logic. During the Middle Ages, scholars refined these
ConTRoL [Liu et al., 2021a] , FOLIO [Han et al., 2024a] , LogicNLI [Tian et al., 2021] ,
Natural Language Inference (§3.1)
Types & history (§2) RulteTaker [Clark et al., 2021] , LogiBench [Parmar et al., 2023]

Task & Benchmarks (§3) LogiQA [Liu et al., 2023a] , ReClor [Yu et al., 2020] , AR-LSAT [Yu et al., 2020] , CLUTRR [Sinha et al., 2019] ,
Reading Comprehension (§3.2)
GSM [Cobbe et al., 2021; Li et al., 2024a] LINGOLY [Bean et al., 2024]

Benchmarks and test suites (§3.3) GLoRE [liu et al., 2023d] , LogiGLUE [Luo et al., 2024] , LogiTorch [Helwe et al., 2022]

Deductive Reasoning (§4.1) [Saparov et al., 2023] , [Yuan et al., 2023] , [Ryb et al., 2022]

Inductive Reasoning (§4.2) [Yang et al., 2024b] , [Bowen et al., 2024] , [Sullivan, 2024]

Evaluation & Analysis (§4) Abductive Reasoning (§4.3) True Detective [Del and Fishel, 2023] , [Nguyen et al., 2023]
LLM Logical Reasoning

Analogical Reasoning (§4.4) ANALOGICAL [Wijesiriwardene et al., 2023] , [Petersen and van der Plas, 2023] , [Qin et al., 2024]

Overall Analysis & Metrics(§4.5) [Liu et al., 2023b] , [Xu et al., 2023] , [Liu et al., 2024c] , [Gandarela et al., 2024] , [Thatikonda et al., 2025]

FOLIO [Han et al., 2024a] , P-FOLIO [Han et al., 2024b] ,


Expert-Curated Datasets
LeanDojo [Yang et al., 2023] , Symbol-LLM [Xu et al., 2024a]

Data-Centric Approaches (§5.1) Synthetic Datasets RulteTaker [Clark et al., 2021] , FLD×2 [Morishita et al., 2024]

LogiCoT [Liu et al., 2023c] , LogicPro [Jiang et al., 2024] ,


LLM-distilled Datasets
PODA [Wang et al., 2024b]

LogiCoT [Liu et al., 2023c] , LogiPT [Feng et al., 2024] , PGL [Wang et al., 2024a] ,
Instruction Fine-Tuning
Symbol-LLM [Xu et al., 2024a] , TPCL [Wang et al., 2024b] ,

[Jiao et al., 2024] , [Xi et al., 2024] , Marco-o1 [Zhao et al., 2024] ,
Reinforcement Learning
Model-Centric Approaches (§5.2) Deepseek-R1-Zero [DeepSeek-AI, 2025] , Deepseek-R1 [DeepSeek-AI, 2025]
Enhancement Methods (§5)
GoT [Lei et al., 2023] , Chain of Logic [Servantez et al., 2024] ,
Selection-Inference [Creswell et al., 2023] , [Malon et al., 2024] ,
Discussion (§6)
Inference-Time Decoding Maieutic Prompting [Jung et al., 2022] , Logic-of-thought [Liu et al., 2024a] ,
DetermLR [Sun et al., 2024] , Neurologic [Lu et al., 2021] ,
Formal-LLM [Li et al., 2024b]

[Zayyad and Adi, 2024] , LeanDojo [Yang et al., 2023] , LQOT [Liu et al., 2024b] , [Ouyang et al., 2023] ,
External Knowledge Utilization (§5.3)
KnowRA [Mai et al., 2025]

LINC [Olausson et al., 2023] , LOGICLLAMA [Yang et al., 2024a] ,


Neuro-Symbolic Approaches (§5.4) CLOVER [Ryu et al., 2024] , LOGIC-LM [Pan et al., 2023] , Logic Agent [Liu et al., 2024a] ,
LLM-TRes [Toroghi et al., 2024] , SymbCoT [Xu et al., 2024c] , Aristotle [Xu et al., 2024b]

Figure 1: The structure of this survey


theories, and in the 17th century, Leibniz’s universal lan- soning is valid, the conclusion must also be true. For exam-
guage and calculus ratiocinator bridged logic with mathemat- ple, given the premises “All apples are red” and “This fruit is
ics, foreshadowing modern computational logic. The 19th an apple” one can deduce that “This fruit is red” Deductive
century saw George Boole’s Boolean algebra, which trans- reasoning is fundamental in fields such as mathematics and
formed logic into a mathematical framework, laying the foun- formal logic, where certainty and rigor are paramount.
dation for digital computing.
Inductive Reasoning. Unlike deductive reasoning, induc-
The 20th century ushered in modern logic, with Russell
tive reasoning draws general conclusions based on specific
and Whitehead’s Principia Mathematica formalizing complex
observations or evidence. While the conclusions are often
logical systems. By the mid-century, AI pioneers like John
considered probable, they are not guaranteed to be true. For
McCarthy leveraged logic for knowledge representation and
instance, observing that all swans seen so far are white might
automated theorem proving, leading to logic programming
lead to the inductive conclusion that “All swans are white”
and knowledge bases. The 1970s introduced non-monotonic
Inductive reasoning is widely used in scientific discovery and
logic, enabling AI to handle commonsense reasoning. The
data-driven decision-making, where patterns and trends are
1980s saw logical reasoning integrate with knowledge repre-
inferred from empirical data.
sentation, advancing expert systems for real-world applica-
tions. The 1990s saw the rise of knowledge graphs, structur- Abductive Reasoning. This form of reasoning seeks the
ing vast knowledge for complex reasoning tasks. most plausible explanation or cause for a set of observations,
In the 21st century, neuro-symbolic approaches have often in the presence of incomplete information. Abductive
merged deep learning with logical inference, resulting in tools reasoning is particularly useful in diagnostic tasks and real-
like DeepLogic [Cingillioglu and Russo, 2019] and SAT- world problem-solving. For example, seeing wet spots on
Net [Wang et al., 2019]. Logical reasoning remains a cor- the street might lead one to infer that “It has recently rained”
nerstone of AI research, evolving from philosophy to modern While abductive conclusions are not certain, they provide a
computing. As AI advances, logical reasoning continues to practical basis for hypothesis generation and decision-making
shape intelligent systems, ensuring structured, interpretable, under uncertainty.
and robust decision-making.
Analogical Reasoning. Analogical reasoning involves
2.2 Types of Logical Reasoning drawing comparisons between similar situations or domains
to make inferences or solve problems. By identifying par-
Logical reasoning can be broadly categorized into four main allels between different scenarios, this type of reasoning en-
types, each serving distinct purposes and applications: ables creative problem-solving and knowledge transfer. For
Deductive Reasoning. This type of reasoning derives spe- example, understanding that planets orbit the sun in elliptical
cific conclusions from general principles or premises. It op- paths might lead one to analogically reason that other celes-
erates under the rule that if all premises are true and the rea- tial bodies, such as comets, exhibit similar orbital character-
Dataset Language Question Type Size Source
LogiQA Zh/En Multichoice 15,937 Exam-based
Neutral labels. It isolates FOL-based inference from com-
ReClor En Multichoice 6,138 Exam-based monsense reasoning, enabling precise evaluation of reasoning
AR-LSAT En Multichoice 2,064 Exam-based accuracy and generalization.
CLUTRR En Question-answering 6,016 Rule-based
GSM En Math word problems 19K Exam-based ProofWriter [Tafjord et al., 2021] extends Rule-
LINGOLY En Question-answering 1,133 Expert-designed Taker [Clark et al., 2021] by introducing CWA (closed-
ConTRoL En ternary classification 8,325 Exam-based world assumption) and OWA (open-world assumption) to
FOLIO En binary classification 1,351 Expert-designed
LogicNLI En ternary classification 30K Exam-based handle negation and open-world reasoning. It includes Birds-
ProofWriter En binary classification - Exam-based Electricity (handcrafted domain theories) and ParaRules
LogicBench En binary classification 1,270 Rule-based (crowdsourced paraphrased rules) for systematic evaluation
GLoRE Zh/En Miscellaneous 17 tasks Miscellaneous
LogiGLUE En Miscellaneous 24 tasks Miscellaneous of generalization across linguistic variations and real-world
LogiTorch En Miscellaneous 16 tasks Miscellaneous knowledge domains.
BIG-Bench En Miscellaneous 7 tasks Miscellaneous
LogicBench [Parmar et al., 2023] is a GPT-3-generated
Table 1: Main Datasets and Benchmarks of Logical Reasoning Task. dataset covering 25 types of reasoning, including proposi-
tional logic, FOL, and non-monotonic logic. It consists of
istics. Analogical reasoning is particularly valuable in fields 1,270 test entries labeled as Yes or No.
like education, design, and innovation.
3.2 Machine Reading Comprehension (MRC)
3 Tasks and Benchmarks
MRC evaluates logical reasoning by requiring models to an-
Logical reasoning datasets and benchmarks are essential for swer questions based on a given passage. Tasks are com-
evaluating the reasoning capabilities of large language mod- monly formatted as multiple-choice, span extraction, or free
els (LLMs). These datasets can be categorized into three response, with multiple-choice QA being particularly effec-
types based on their data sources: tive due to its standardization.
Rule-based Datasets [Tafjord et al., 2021; Sinha et al., LogiQA [Liu et al., 2023a] is sourced from the Chinese
2019] are automatically generated using logical rules, en- Civil Service Exam, containing 15,937 entries in Chinese and
abling large-scale data collection. However, ensuring diver- English. It targets complex logical reasoning and is widely
sity is crucial to avoid repetitive patterns and comprehen- used for evaluating LLMs.
sively evaluate reasoning capabilities. ReClor [Yu et al., 2020], derived from the GMAT, features
Expert-Designed Datasets [Han et al., 2024a] are con- 6,138 English entries with four-option multiple-choice ques-
structed by domain experts, ensuring high precision and ac- tions.
curacy. Although typically smaller than crowd-sourced cor- AR-LSAT [Wang et al., 2022] is based on the LSAT,
pora, their meticulous design makes them indispensable for containing 2,064 entries spanning ordering games, grouping
in-depth logical reasoning evaluation. games, and allocation games, each with five options.
Exam-Based Datasets [Liu et al., 2021b; Yu et al., 2020;
CLUTRR [Sinha et al., 2019] focuses on inductive reason-
Wang et al., 2022] originate from standardized test questions
ing, requiring models to infer kinship relationships in short
(e.g., Chinese National Civil Service Exam, LSAT, GRE),
narratives. It contains 6,016 entries, combining entity extrac-
offering high-quality, expert-crafted logic problems at scale.
tion and logical inference.
These datasets are widely used to evaluate reasoning in real-
GSM evaluates mathematical reasoning capabilities, com-
world scenarios.
prising two datasets: GSM8K [Cobbe et al., 2021] (8.5K
Table 1 summarizes important datasets for logical reason-
grade school math problems) and GSM-PLUS [Li et al.,
ing, which typically cover tasks such as Natural Language
2024a] (10,552), which is augmented with mathematical per-
Inference (NLI) (§3.1) and Machine Reading Comprehension
turbations for robustness evaluation.
(MRC) (§3.2).
LINGOLY [Bean et al., 2024] uses Linguistic Olympiad
3.1 Natural Language Inference (NLI) puzzles to evaluate in-context pattern identification and gen-
eralization in low-resource or extinct languages. It contains
NLI evaluates whether a hypothesis logically follows from
1,133 problems across 6 formats and 5 difficulty levels, cov-
a premise, directly assessing a model’s reasoning ability. La-
ering over 90 languages.
bels typically fall into binary (Entailment, Non-entailment) or
ternary (Entailment, Contradiction, Neutral) classifications.
3.3 Benchmark Suites
Some datasets use True and False labels instead.
ConTRoL [Liu et al., 2021a] is derived from recruitment Benchmark suites standardize evaluation and facilitate model
exams (e.g., bank entry, U.S. police selection), containing comparison in logical reasoning research.
8,325 entries with Correct, Incorrect, and Can’t Say labels, GLoRE [liu et al., 2023d] is a few-shot and zero-shot test-
corresponding to Entailment, Contradiction, and Neutral. ing platform, including 17 test-only datasets to assess gener-
FOLIO [Han et al., 2024a] is an expert-constructed dataset alization in low-data scenarios.
for First-Order Logic (FOL) reasoning, consisting of 1,351 LogiGLUE [Luo et al., 2024] consists of 24 logical
entries labeled as True or False, making it a rigorous bench- reasoning tasks, standardizing datasets into a sequence-to-
mark for formal logical inference. sequence format for uniform input processing. It provides
LogicNLI [Tian et al., 2021] contains 30K entries gener- both test and training sets, enabling extensive model training
ated using logical rules, with Entailment, Contradiction, and and targeted evaluations.
LogiTorch [Helwe et al., 2022] is a PyTorch-based library ical sub-proofs without examples, generalization, and sensi-
for natural language logical reasoning, offering 16 datasets, tivity to syntactic variations [Saparov et al., 2023; Yuan et al.,
model architectures, and an accessible API for quick evalua- 2023; Ryb et al., 2022].
tion.
BIG-bench [Srivastava et al., 2022] is a collaborative 4.2 Inductive Reasoning
benchmark with 7 tasks dedicated to logical reasoning, such Inductive reasoning, which generalizes from specific in-
as Logic Grid Puzzle and Logical Fallacy Detection. stances to broader rules, is essential for tasks like hypothesis
generation and pattern recognition. While Yang et al. [2024b]
find that pre-trained models can serve as effective “reason-
ers,” Bowen et al. [2024] show that even advanced LLMs
struggle with simple inductive tasks in their symbolic set-
tings. Similarly, Sullivan [2024] demonstrates that Trans-
former models, even after fine-tuning, fail to learn fundamen-
tal logical principles, indicating limited inductive reasoning
capabilities.

4.3 Abductive Reasoning


Abductive reasoning, which seeks the most plausible expla-
nations for observed phenomena, is crucial in fields like law
and medicine. Del and Fishel [2023] highlights the chal-
lenges LLMs face in generating plausible hypotheses from
incomplete information. In the legal domain, Nguyen et
al. [2023] show that despite strong performance, models
struggle with abductive reasoning, underscoring the complex-
ity of this paradigm.

4.4 Analogical Reasoning


Analogical reasoning, which infers unknown information by
comparing it with known information, is vital for tasks re-
quiring creativity and knowledge transfer. Wijesiriwardene et
(a) A multi-choice reading comprehension example from the
LogiQA dataset.
al. [2023] introduced ANALOGICAL, a benchmark for long-
text analogical reasoning. They find that as analogy complex-
ity increases, LLMs struggle to recognize analogical pairs.
Petersen and van der Plas [2023] show that models can learn
analogical reasoning with minimal data, approaching human
performance. However, Qin et al. [2024] question whether
LLMs truly rely on analogical reasoning, discovering that
random examples in prompts often achieve comparable per-
formance to relevant examples.

4.5 Overall Analysis and Metrics


Liu et al. [2023b] evaluate GPT-4 and ChatGPT on bench-
(b) An NLI example from the ConTRoL dataset. marks like LogiQA and ReClor, showing that while GPT-
Figure 2: Example tests of Logical reasoning in NLP tasks.
4 outperforms ChatGPT, both of them struggle with out-of-
distribution tasks. Xu et al. [2023] introduce the NeuLR
4 Evaluations dataset and propose a framework evaluating LLMs across six
The rapid development of pre-trained language models dimensions: correctness, rigor, self-awareness, proactivity,
(PLMs) necessitates rigorous evaluation of their logical rea- guidance, and absence of hallucinations.
soning capabilities. This section examines four reasoning Metrics for Evaluating Logical Reasoning. Traditional
paradigms—deductive, inductive, abductive, and analogi- metrics like accuracy and F1 score are insufficient for as-
cal—while analyzing evaluation approaches and metrics. sessing logical reasoning. Recent studies have introduced
nuanced metrics such as consistency (invariance to logi-
4.1 Deductive Reasoning cally equivalent inputs), generalization (performance on out-
Deductive reasoning, deriving specific conclusions from gen- of-distribution data), and explainability (clarity of reason-
eral premises, is crucial for automated theorem proving. De- ing steps). Thatikonda et al. [2025] find that combining
spite LLMs performing well on tasks like compositional BERTScore with traditional metrics improves alignment with
proofs, standard benchmarks, and encoding entailment rela- human judgments. Liu et al. [2024c] propose a framework
tionships, they struggle with extended reasoning, hypothet- for measuring logical consistency, showing that BERTScore
aligns better with human rankings than LLM-based evalua- 5.2 Model-Centric Approaches
tors like GPT-4. Gandarela et al. [2024] emphasizes the need Model-Centric approaches enhance LLMs’ reasoning capa-
for metrics that account for the expressivity of logical theo- bilities by optimizing model parameters and decoding strate-
ries, particularly in inductive reasoning. gies. The formal objective is:

5 Enhancement Methods (θ∗ , S ∗ ) = arg max R(Mθ , S) (2)


θ,S
Enhancing LLMs’ logical reasoning remains crucial. This
where:
section focuses on core strategies: Data-Centric Approaches
(§5.1), Model-Centric Approaches (§5.2), External Knowl- • θ: learnable model parameters.
edge Utilization (§5.3), and Neuro-Symbolic Reasoning • Mθ : model with parameters θ.
(§5.4).
• S: decoding strategy (e.g., chain-of-thought prompting,
5.1 Data-Centric Approaches verification-based decoding).
Data-centric approaches enhance LLMs’ reasoning capabili- • R: reasoning performance metric.
ties by utilizing meticulously curated training datasets. For- This formulation highlights the joint optimization of model
mally, this can be expressed as: parameters θ and decoding strategy S. Practical implementa-
tions can be categorized as:
D∗ = arg max R(MD ) (1) • Instruction Fine-Tuning: optimizing θ.
D
where: • Reinforcement Learning: optimizing θ.
• D: training datasets. • Inference-Time Decoding: optimizing S.
• MD : model trained on D. Model-Centric approaches focus on directly improving
the model’s reasoning capabilities by optimizing its internal
• R: performance evaluator (e.g., LLM-as-a-judge, rule- mechanisms and decoding strategies, making them comple-
based metrics). mentary to data-centric approaches.
This formulation highlights the central role of dataset op-
Instruction Fine-Tuning
timization in data-centric approaches. In practice, data-
centric methods typically involve three types of datasets: Instruction Fine-Tuning (IFT) adapts LLMs through super-
expert-curated datasets, synthetic datasets, and LLM-distilled vised learning on task-specific instructions. For example, Liu
datasets. et al. [2023c] design multi-grained instructions spanning di-
verse levels of abstraction and complexity. Similarly, Feng
Expert-Curated Datasets. The FOLIO series [Han et al., et al. [2024] IFT models to mimic logical solvers by repli-
2024a; Han et al., 2024b] establish formal verification cating formal deduction reasoning processes. In addition,
through FOL annotations, with P-FOLIO extending the com- Xu et al. [2024a] implement two-stage symbolic fine-tuning
plexity of reasoning chains for enhanced training. Lean- through Injection (injecting symbolic knowledge) and Infu-
Dojo [Yang et al., 2023] provides 98k+ human-proven theo- sion (balancing symbol and NL reasoning).
rem pairs for mathematical reasoning. Additionally, Symbol- To overcome IFT’s over-fitting limitations, Wang et
LLM [Xu et al., 2024a] systematically organizes 34 symbolic al. [2024b] enforce contrastive learning between fac-
reasoning tasks to capture inter-symbol relationships across tual/counterfactual paths with IFT. Further, Wang et
20 distinct symbolic families. al. [2024a] augment Llamas with a Program-Guided Learn-
Synthetic Datasets. Rule-based synthetic data remains ing Framework and logic-specific architecture adjustments.
fundamental for data generation. RuleTaker [Clark et al., Recently, Muennighoff et al. [2025] propose s1, achieving
2021] formalizes this through a three-phase pipeline: behav- test-time scaling through IFT on 1,000 meticulously crafted
ior formalization, example synthesis and linguistic equiva- long CoT samples. Combined with budget-forcing tech-
lents generation. Similarly, Morishita et al. [2024] devel- nique, it significantly enhances the reasoning capability of a
ops Formal Logic Deduction Diverse (FLD×2 ), a synthetic Qwen2.5-32B-Instruct model, allowing extrapolating beyond
dataset based on symbolic theory and previous empirical in- its performance without test-time intervention.
sights. Reinforcement Learning
LLM-Distilled Datasets. Researchers employ advanced Reinforcement learning (RL) has become pivotal in optimiz-
models such as GPT-4 for intermediate reasoning step dis- ing large language models (LLMs), particularly since the
tillation. LogiCoT [Liu et al., 2023c] augments existing breakthrough of Reinforcement Learning from Human Feed-
datasets with GPT4-generated reasoning chains, while Log- back (RLHF). Jiao et al. [2024] leverage RL for planning-
icPro [Jiang et al., 2024] combines algorithmic problems with based reasoning optimization, while Xi et al. [2024] develop
code solutions to create variable-guided reasoning data. To R3 , achieving process supervision benefits through outcome-
advance, Wang et al. [2024b] propose PODA, which gener- only supervision.
ates contrastive analyses of correct/incorrect options through The success of large-scale RL in OpenAI-o1 [OpenAI,
premise-oriented augmentation, enabling reasoning path dif- 2024] has inspired numerous studies. RL algorithms train o1-
ferentiation via contrastive learning. style models to enhance Chain-of-Thought (CoT) reasoning,
addressing issues like formulaic outputs and limited long-
form reasoning. For instance, Zhao et al. [2024] integrate (M ∗ , K ∗ ) = arg max R(M, K) (3)
M,K
CoT instruction fine-tuning with Monte Carlo Tree Search
(MCTS) decoding for multi-path reasoning exploration. In where:
contrast, Zhang et al. [2024] employ MCTS to generate code- • M : the neural model, which includes both the model’s
reasoning data for instruction fine-tuning (IFT) and Direct parameters and its decoding strategies (generally, the
Preference Optimization (DPO). model’s parameters remain unchanged).
A significant breakthrough comes from DeepSeek-
R1 [DeepSeek-AI, 2025], which pioneers a novel RL strategy • K: knowledge integration strategy, including knowl-
to enhance logical reasoning. DeepSeek-R1-Zero, trained edge source curation, structured knowledge representa-
purely through RL without IFT, demonstrates impressive rea- tion, retrieval-augmented mechanisms, etc.
soning capabilities but faces challenges in readability and • R: reasoning performance evaluator (e.g., factual accu-
language consistency. To address this, DeepSeek-R1 intro- racy, logical consistency).
duces minimal long-CoT IFT data as a cold start before RL, Zayyad and Adi [2024] and Yang et al. [2023] extract data
achieving a balance between usability and reasoning per- from Lean, a mathematical proof tool, to aid theorem prov-
formance. By iteratively synthesizing high-quality reason- ing. In contrast, “Logic-Query-of-Thoughts” (LQOT) [Liu et
ing data through RL, DeepSeek-R1 overcomes limitations al., 2024b] decomposes complex logical problems into easier
imposed by human annotators, addressing issues such as sub-questions before integrating knowledge graphs.
mechanistic responses, repetitive patterns, and insufficient In reading comprehension, Ouyang et al. [2023] con-
long-chain reasoning. This approach represents a potential struct supergraphs to address complex contextual reasoning,
paradigm shift in logical reasoning optimization, pushing the while KnowRA [Mai et al., 2025] autonomously determines
boundaries of what LLMs can achieve in structured reasoning whether to accept external knowledge to assist document-
tasks. level relation extraction.
Inference-Time Decoding 5.4 Neuro-Symbolic Approaches
We categorize logical reasoning enhancement methods dur- Neural-symbolic hybrid methods represent a burgeoning re-
ing inference-time into inference-time scaling and con- search area that aims to combine the powerful representa-
strained decoding. tional capabilities of deep learning with the precision and in-
Inference-time scaling employs computational augmenta- terpretability of symbolic reasoning.
tion without parameter updates. One common approach is Formally, a neural-symbolic hybrid system aims to opti-
decoding with structured outputs and modular workflows. mize both the neural model M and the symbolic solver P
GoT [Lei et al., 2023] creates structured reasoning nodes (where P represents the symbolic reasoning process) to max-
to improve complex multi-step logical reasoning. Simi- imize logical reasoning performance. The overall objective
larly, Chain of Logic [Servantez et al., 2024] introduces a can be expressed as:
Decomposition-Recomposition structure for legal reasoning.
In other contexts, researchers design more complex modu- (M ∗ , P ∗ ) = arg max R(P (M (x))),
lar workflows for better performance [Creswell et al., 2023; M,P
Malon et al., 2024]. where:
Another inference-time scaling approach involves stimu-
• M : The neural model, which includes both the model’s
lating autonomous reasoning, guiding LLMs to iteratively
parameters and its decoding strategies. It maps the input
refine their answers. Maieutic Prompting [Jung et al.,
x (e.g., natural language) into a symbolic representation
2022] eliminates contradictions through recursive reasoning.
z within a formal language L:
Similarly, Logic-of-Thoughts [Liu et al., 2024a] and De-
termLR [Sun et al., 2024] progressively approach the answers z = M (x), z ∈ L.
in an iterative style.
Constrained decoding methods, on the other hand, focus on • P : The symbolic solver, which operates on the sym-
improving the controllability and reliability of reasoning pro- bolic representation z produced by M to generate the
cesses. Neurologic [Lu et al., 2021] enforces predicate logic final output y:
constraints, while Formal-LLM [Li et al., 2024b] integrates y = P (z).
automata for constraining plan generation.
• R: The reasoning performance metric, which evaluates
the ability to perform logical reasoning tasks.
5.3 External Knowledge Utilization
The optimization process involves two key directions:
LLMs often generate incorrect answers due to hallucinations
when performing complex tasks such as logical reasoning, • Improving M : including refining the model’s parame-
making it necessary to incorporate external knowledge to as- ters and decoding strategies to produce symbolic repre-
sist in producing accurate responses. Formally, the optimal sentations that are both accurate and compatible with P .
integration of external knowledge can be formulated as a joint • Enhancing P : involving improving the symbolic
optimization problem: solver’s ability to process.
By jointly optimizing M and P , neural-symbolic hybrid bottlenecks with large knowledge bases or complex rule de-
systems aim to leverage the strengths of both neural networks pendencies. Conversely, data-driven methods (e.g., instruc-
and symbolic reasoning to achieve superior logical reasoning tion tuning on LogicBench [Parmar et al., 2024]) achieve
capabilities. It is worth noting that in earlier neural-symbolic broader task coverage but fail to generalize beyond syntactic
pipelines, P is often implemented as a fixed external logical patterns. How can we reconcile transparent reasoning with
reasoning engine, and thus is generally not optimized. How- black-box model performance? Hybrid architectures offer
ever, in advanced practice, LLMs are increasingly being used promise but introduce computational overhead, limiting prac-
to perform the role of P , enabling diverse optimization. tical deployment.
Fundamentally, these methods involve translating prob- Evaluation Rigor. Existing benchmarks like LogiQA [Liu
lems into symbolic representations with LLMs, and ex- et al., 2021b] and ReClor [Yu et al., 2020] conflate reason-
ternal symbolic solvers solving them. For example, in ing ability with pattern recognition through multiple-choice
LINC [Olausson et al., 2023], LLMs convert natural language formats. While efforts like NeuLR [Xu et al., 2023] curate
(NL) into first-order logic (FOL) expressions, and utilize an “neutral” content to isolate reasoning from domain knowl-
external theorem prover for symbolic deductive inference. edge, they lack scope for holistic evaluation. Current metrics
Further efforts focus on improving NL-to-symbolic trans- (e.g., accuracy, BLEU) fail to assess consistency (invariance
lation. One prevailing approach is directly optimizing trans- to logically equivalent inputs) or soundness (adherence to for-
lation through training [Yang et al., 2024a] or decoding mal proof structures). What defines a gold standard for log-
strategies [Ryu et al., 2024], while the other depends on ical reasoning evaluation? Benchmarks must prioritize sys-
verification or correction mechanisms [Yang et al., 2024a; tematic testing of core principles (e.g., transitivity, contrapo-
Pan et al., 2023]. sition) over task-specific performance.
Building upon these, recent advancements address the tra-
ditional pipeline limitations by fully integrating LLMs into Future Directions. Addressing these challenges requires
reasoning processes. Logic Agent (LA) [Liu et al., 2024a] hybrid architectures that dynamically integrate neural and
replaces external solvers with rule-guided LLM inference symbolic components, such as differentiable theorem
chains, while LLM-TRes [Toroghi et al., 2024] implements provers, to balance scalability and precision. Equally impor-
self-contained verifiable reasoning without external symbolic tant is the development of evaluation frameworks that stress-
solvers. SymbCoT [Xu et al., 2024c] coordinates translation, test models on perturbed logical statements (e.g., negated
planning, solving and verification entirely through LLMs. Xu premises, swapped quantifiers) to isolate reasoning from
et al. [2024b] propose Aristotle, which further systematizes memorization. Multimodal reasoning, which grounds infer-
the symbolic reasoning pipeline through three LLM-driven ence in diverse modalities (text, images, code), presents un-
components: Logical Decomposer, Logical Search Router, tapped potential for enhancing robustness and interpretabil-
and Logical Resolver. ity. Finally, interdisciplinary collaboration—leveraging in-
sights from formal logic, cognitive science, and machine
6 Discussion learning—will be essential to design systems that reason
with and about uncertainty. Until LLMs reliably disentangle
The integration of logical reasoning into large language mod- logic from lexicon, their deployment in high-stakes domains
els (LLMs) remains a critical challenge, marked by persistent will remain precarious. Bridging this gap demands rigorous
gaps between heuristic performance and formal logical rigor. benchmarks, scalable hybrid methods, and a redefinition of
Below, we analyze three unresolved tensions dominating the evaluation paradigms.
field and outline future directions.
Robustness vs. Generalization. LLMs exhibit inconsis- 7 Conclusion
tent performance in structured reasoning tasks such as deduc- This survey synthesizes the rapid advancements and persis-
tive inference and abductive hypothesis generation. While tent challenges in logical reasoning for large language models
models fine-tuned on datasets like FOLIO [Han et al., 2024a] (LLMs). While LLMs demonstrate impressive heuristic rea-
excel in controlled settings, they struggle with adversarial soning, rigorous logical inference—spanning deductive, in-
perturbations or semantically equivalent rephrasings. This ductive, abductive, and analogical paradigms—remains in-
inconsistency arises from their reliance on surface-level sta- consistent due to limitations in robustness, generalization,
tistical correlations rather than causal relationships, coupled and interpretability. We analyzed strategies to enhance rea-
with limited out-of-distribution generalization. A key ques- soning, including neuro-symbolic integration, data-centric
tion persists: can LLMs achieve human-like robustness with- tuning, reinforcement learning, test-time scaling, and other
out sacrificing cross-domain adaptability? Current methods improved decoding methods, and highlighted benchmarks
prioritize narrow task performance, leaving real-world appli- like FOLIO and LogiQA for systematic evaluation. Future
cability uncertain. progress hinges on hybrid architectures that unify neural and
Interpretability vs. Performance. A central tension lies in symbolic reasoning, robust evaluation frameworks, and scal-
balancing neural scalability with symbolic precision. Neuro- able methods for cross-domain and multimodal inference.
symbolic approaches like Logic-LM [Pan et al., 2023] and Addressing these challenges will advance LLMs toward re-
Symbol-LLM [Xu et al., 2024a] embed formal logic solvers liable, interpretable reasoning critical for real-world applica-
into neural architectures, improving interpretability through tions.
step-by-step proofs. However, these methods face scalability
References [Liu et al., 2021b] Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile
[Bean et al., 2024] Andrew M Bean, Simi Hellsten, Harry Mayne, Jabez Wang, and Yue Zhang. Logiqa: a challenge dataset for machine reading
Magomere, Ethan A Chi, et al. Lingoly: A benchmark of olympiad-level comprehension with logical reasoning. 2021.
linguistic reasoning puzzles in low-resource and extinct languages. arXiv [Liu et al., 2023a] Hanmeng Liu, Jian Liu, Leyang Cui, Zhiyang Teng, Nan
preprint arXiv:2406.06196, 2024. Duan, et al. Logiqa 2.0—an improved dataset for logical reasoning in nat-
[Bowen et al., 2024] Chen Bowen, Rune Sætre, and Yusuke Miyao. A com- ural language understanding. IEEE/ACM Transactions on Audio, Speech,
prehensive evaluation of inductive reasoning capabilities and problem solv- and Language Processing, pages 2947–2962, 2023.
ing in large language models. In Proc. of ACL Findings, pages 323–339, [Liu et al., 2023b] Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji
2024. Zhou, and Yue Zhang. Evaluating the logical reasoning ability of chatgpt
[Cann, 1993] Ronnie Cann. Formal semantics: an introduction. Cambridge and gpt-4, 2023.
University Press, United States, 1993. [Liu et al., 2023c] Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang,
[Cingillioglu and Russo, 2019] Nuri Cingillioglu and Alessandra Russo. Qiji Zhou, and Yue Zhang. Logicot: Logical chain-of-thought instruction
Deeplogic: Towards end-to-end differentiable logical reasoning, 2019. tuning. In Proc. of EMNLP Findings, pages 2908–2921, 2023.
[Clark et al., 2021] Peter Clark, Oyvind Tafjord, and Kyle Richardson. Trans- [liu et al., 2023d] Hanmeng liu, Zhiyang Teng, Ruoxi Ning, Jian Liu, Qiji
formers as soft reasoners over language. In Proc. of IJCAI, 2021. Zhou, and Yue Zhang. Glore: Evaluating logical reasoning of large lan-
[Cobbe et al., 2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, guage models, 2023.
Mark Chen, Heewoo Jun, et al. Training verifiers to solve math word prob- [Liu et al., 2024a] Hanmeng Liu, Zhiyang Teng, Chaoli Zhang, and Yue
lems. arXiv preprint arXiv:2110.14168, 2021. Zhang. Logic agent: Enhancing validity with logic rule invocation, 2024.
[Creswell et al., 2023] Antonia Creswell, Murray Shanahan, and Irina Hig- [Liu et al., 2024b] Lihui Liu, Zihao Wang, Ruizhong Qiu, Yikun Ban, Eunice
gins. Selection-inference: Exploiting large language models for inter- Chan, et al. Logic query of thoughts: Guiding large language models to
pretable logical reasoning. In Proc. of ICLR, 2023. answer complex logic queries with knowledge graphs, 2024.
[DeepSeek-AI, 2025] DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning [Liu et al., 2024c] Yinhong Liu, Zhijiang Guo, Tianya Liang, Ehsan Shareghi,
Capability in LLMs via Reinforcement Learning. Technical report, 2025. Ivan Vulić, and Nigel Collier. Aligning with logic: Measuring, evaluating
[Del and Fishel, 2023] Maksym Del and Mark Fishel. True detective: A deep and improving logical consistency in large language models. arXiv preprint
abductive reasoning benchmark undoable for GPT-3 and challenging for arXiv:2410.02205, 2024.
GPT-4. In Proceedings of the 12th Joint Conference on Lexical and Com- [Lu et al., 2021] Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras,
putational Semantics (*SEM 2023), pages 314–322, 2023. Chandra Bhagavatula, and Yejin Choi. NeuroLogic decoding:
[Feng et al., 2024] Jiazhan Feng, Ruochen Xu, Junheng Hao, Hiteshi Sharma, (un)supervised neural text generation with predicate logic constraints. In
Proc. of NAACL, pages 4288–4299, 2021.
Yelong Shen, et al. Language models can be deductive solvers. In Proc. of
ACL Findings, pages 4026–4042, 2024. [Luo et al., 2024] Man Luo, Shrinidhi Kumbhar, Ming shen, Mihir Parmar,
[Gandarela et al., 2024] João Pedro Gandarela, Danilo S Carvalho, and André Neeraj Varshney, et al. Towards logiglue: A brief survey and a benchmark
Freitas. Inductive learning of logical theories with llms: A complexity- for analyzing logical reasoning capabilities of language models, 2024.
graded analysis. arXiv preprint arXiv:2408.16779, 2024. [Mai et al., 2025] Chengcheng Mai, Yuxiang Wang, Ziyu Gong, Hanxiang
[Han et al., 2024a] Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Wang, and Yihua Huang. Knowra: Knowledge retrieval augmented method
Qi, Martin Riddell, et al. FOLIO: Natural language reasoning with first- for document-level relation extraction with comprehensive reasoning abili-
order logic. In Proc. of EMNLP, pages 22017–22031, 2024. ties, 2025.
[Han et al., 2024b] Simeng Han, Aaron Yu, Rui Shen, Zhenting Qi, Martin [Malon et al., 2024] Christopher Malon, Martin Min, Xiaodan Zhu, et al. Ex-
Riddell, et al. P-FOLIO: Evaluating and improving logical reasoning with ploring the role of reasoning structures for constructing proofs in multi-
abundant human-written reasoning chains. In Proc. of EMNLP Findings, step natural language reasoning with large language models. In Proc. of
pages 16553–16565, 2024. EMNLP, pages 15299–15312, 2024.
[Haruta et al., 2020] Izumi Haruta, Koji Mineshima, and Daisuke Bekki. Log- [McCarthy and Hayes, 1981] J. McCarthy and P.J. Hayes. Some philosophi-
ical inferences with comparatives and generalized quantifiers. In Proc. of cal problems from the standpoint of artificial intelligence. In Readings in
ACL, pages 263–270, 2020. Artificial Intelligence, pages 431–450. 1981.
[Helwe et al., 2022] Chadi Helwe, Chloé Clavel, and Fabian Suchanek. Log- [McCarthy, 1959] John McCarthy. Programs with common sense. In Pro-
itorch: A pytorch-based library for logical reasoning on natural language. ceedings of the Teddington Conference on the Mechanization of Thought
In Proc. of EMNLP, 2022. Processes, 1959.
[Jiang et al., 2024] Jin Jiang, Yuchen Yan, Yang Liu, Yonggang Jin, Shuai [McCarthy, 1989] John McCarthy. Artificial intelligence, logic and formaliz-
Peng, et al. Logicpro: Improving complex logical reasoning via program- ing common sense. Philosophical Logic and Artificial Intelligence, pages
guided learning. arXiv preprint arXiv:2409.12929, 2024. 161–190, 1989.
[Jiao et al., 2024] Fangkai Jiao, Chengwei Qin, Zhengyuan Liu, Nancy F. [Morishita et al., 2024] Terufumi Morishita, Gaku Morio, Atsuki Yamaguchi,
Chen, and Shafiq Joty. Learning planning-based reasoning by trajectories and Yasuhiro Sogawa. Enhancing reasoning capabilities of llms via prin-
collection and process reward synthesizing. In Proc. of EMNLP, pages cipled synthetic logic corpus. In Proc. of NeurIPS, pages 73572–73604,
334–350, 2024. 2024.
[Muennighoff et al., 2025] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi-
[Jung et al., 2022] Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman,
ang Lisa Li, Li Fei-Fei, et al. s1: Simple test-time scaling. arXiv preprint
Chandra Bhagavatula, et al. Maieutic prompting: Logically consistent rea- arXiv:2501.19393, 2025.
soning with recursive explanations. In Proc. of EMNLP, pages 1266–1279,
2022. [Newell and Simon, 1956] A. Newell and H. Simon. The logic theory
[Lei et al., 2023] Bin Lei, Chunhua Liao, Caiwen Ding, et al. Boosting logical machine–a complex information processing system. IRE Transactions on
reasoning in large language models through a new framework: The graph Information Theory, 1956.
of thought. arXiv preprint arXiv:2308.08614, 2023. [Nguyen et al., 2023] Ha-Thanh Nguyen, Randy Goebel, Francesca Toni,
[Li et al., 2024a] Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, Kostas Stathis, and Ken Satoh. How well do sota legal reasoning models
and Wei Bi. GSM-plus: A comprehensive benchmark for evaluating the support abductive reasoning?, 2023.
robustness of LLMs as mathematical problem solvers. In Proc. of ACL, [Olausson et al., 2023] Theo Olausson, Alex Gu, Ben Lipkin, Cedegao
pages 2961–2984, 2024. Zhang, Armando Solar-Lezama, et al. LINC: A neurosymbolic approach
[Li et al., 2024b] Zelong Li, Wenyue Hua, Hao Wang, He Zhu, and Yongfeng for logical reasoning by combining language models with first-order logic
Zhang. Formal-llm: Integrating formal language and natural language for provers. In Proc. of EMNLP, pages 5153–5176, 2023.
controllable llm-based agents. arXiv preprint arXiv:2402.00798, 2024. [OpenAI, 2024] OpenAI. Learning to reason with LLMs. Technical report,
[Liu et al., 2021a] Hanmeng Liu, Leyang Cui, Jian Liu, and Yue Zhang. Nat- 2024.
ural language inference in context - investigating contextual reasoning over [Ouyang et al., 2023] Siru Ouyang, Zhuosheng Zhang, and Hai Zhao. Fact-
long texts. Proc. of AAAI, pages 13388–13396, 2021. driven logical reasoning for machine reading comprehension, 2023.
[Pan et al., 2023] Liangming Pan, Alon Albalak, Xinyi Wang, and William [Tian et al., 2021] Jidong Tian, Yitian Li, Wenqing Chen, Liqiang Xiao, Hao
Wang. Logic-LM: Empowering large language models with symbolic He, and Yaohui Jin. Diagnosing the first-order logical reasoning ability
solvers for faithful logical reasoning. In Proc. of EMNLP Findings, pages through LogicNLI. In Proc. of EMNLP, pages 3738–3747, 2021.
3806–3824, 2023. [Toroghi et al., 2024] Armin Toroghi, Willis Guo, Ali Pesaranghader, and
[Parmar et al., 2023] Mihir Parmar, Neeraj Varshney, Nisarg Patel, Santosh Scott Sanner. Verifiable, debuggable, and repairable commonsense logi-
Mashetty, Man Luo, et al. Logicbench: A benchmark for evaluation of cal reasoning via llm-based theory resolution. In Proc. of EMNLP, pages
logical reasoning, 2023. 6634–6652, 2024.
[Parmar et al., 2024] Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi [Wang et al., 2019] Po-Wei Wang, Priya L. Donti, Bryan Wilder, and Zico
Nakamura, Man Luo, et al. Logicbench: Towards systematic evaluation of Kolter. Satnet: Bridging deep learning and logical reasoning using a differ-
logical reasoning ability of large language models. In Proc. of ACL, pages entiable satisfiability solver, 2019.
13679–13707, 2024. [Wang et al., 2022] Siyuan Wang, Zhongkun Liu, Wanjun Zhong, Ming
[Pereira, 1982] Fernando Carlos Neves Pereira. Logic for natural language Zhou, Zhongyu Wei, et al. From lsat: The progress and challenges of com-
analysis. 1982. plex reasoning. IEEE/ACM Transactions on Audio, Speech, and Language
[Petersen and van der Plas, 2023] Molly Petersen and Lonneke van der Plas. Processing, 2022.
Can language models learn analogical reasoning? investigating training ob- [Wang et al., 2024a] Chen Wang, Xudong Li, Haoran Liu, Xinyue Wu, and
jectives and comparisons to human performance. In Proc. of EMNLP, pages Wanting He. Efficient logical reasoning in large language models through
16414–16425, 2023. program-guided learning. Authorea Preprints, 2024.
[Plaat et al., 2024] Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, [Wang et al., 2024b] Chenxu Wang, Ping Jian, and Zhen Yang. Thought-
Niki van Stein, and Thomas Back. Reasoning with large language models, path contrastive learning via premise-oriented data augmentation for logical
a survey. arXiv preprint arXiv:2407.11511, 2024. reading comprehension. arXiv preprint arXiv:2409.14495, 2024.
[Qin et al., 2024] Chengwei Qin, Wenhan Xia, Tan Wang, Fangkai Jiao, [Wijesiriwardene et al., 2023] Thilini Wijesiriwardene, Ruwan Wickrama-
Yuchen Hu, et al. Relevant or random: Can llms truly perform analogi- rachchi, Bimal Gajera, Shreeyash Gowaikar, Chandan Gupta, et al. ANA-
cal reasoning?, 2024. LOGICAL - a novel benchmark for long text analogy evaluation in large
[Rajaraman et al., 2023] Kanagasabai Rajaraman, Saravanan Rajamanickam, language models. In Proc. of ACL Findings, pages 3534–3549, 2023.
and Wei Shi. Investigating transformer-guided chaining for interpretable [Xi et al., 2024] Zhiheng Xi, Wenxiang Chen, Boyang Hong, Senjie Jin, Rui
natural logic reasoning. In Proc. of ACL Findings, pages 9240–9253, 2023. Zheng, et al. Training large language models for reasoning through reverse
[Ryb et al., 2022] Samuel Ryb, Mario Giulianelli, Arabella Sinclair, and curriculum reinforcement learning. In Proc. of ICML, 2024.
Raquel Fernández. AnaLog: Testing analytical and deductive logic learn- [Xia et al., 2024] Yu Xia, Rui Wang, Xu Liu, Mingyan Li, Tong Yu, et al.
ability in language models. In Proceedings of the 11th Joint Conference on Beyond chain-of-thought: A survey of chain-of-x paradigms for llms. arXiv
Lexical and Computational Semantics, pages 55–68, 2022. preprint arXiv:2404.15676, 2024.
[Ryu et al., 2024] Hyun Ryu, Gyeongman Kim, Hyemin S Lee, and Eunho [Xu et al., 2023] Fangzhi Xu, Qika Lin, Jiawei Han, Tianzhe Zhao, Jun Liu,
Yang. Divide and translate: Compositional first-order logic transla- and Erik Cambria. Are large language models really good logical reason-
tion and verification for complex logical reasoning. arXiv preprint ers? a comprehensive evaluation and beyond, 2023.
arXiv:2410.08047, 2024.
[Xu et al., 2024a] Fangzhi Xu, Zhiyong Wu, Qiushi Sun, Siyu Ren, Fei Yuan,
[Saparov et al., 2023] Abulhair Saparov, Richard Yuanzhe Pang, Vishakh et al. Symbol-LLM: Towards foundational symbol-centric interface for
Padmakumar, Nitish Joshi, Mehran Kazemi, et al. Testing the general de- large language models. In Proc. of ACL, pages 13091–13116, 2024.
ductive reasoning capacity of large language models using ood examples.
In Proc. of NeurIPS, pages 3083–3105, 2023. [Xu et al., 2024b] Jundong Xu, Hao Fei, Meng Luo, Qian Liu, Liangming
Pan, et al. Aristotle: Mastering logical reasoning with a logic-complete
[Servantez et al., 2024] Sergio Servantez, Joe Barrow, Kristian Hammond, decompose-search-resolve framework. arXiv preprint arXiv:2412.16953,
and Rajiv Jain. Chain of logic: Rule-based reasoning with large language 2024.
models. In Proc. of ACL Findings, pages 2721–2733, 2024. [Xu et al., 2024c] Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li
[Shi et al., 2021] Jihao Shi, Xiao Ding, Li Du, Ting Liu, and Bing Qin. Neural Lee, and Wynne Hsu. Faithful logical reasoning via symbolic chain-of-
natural logic inference for interpretable question answering. In Proc. of thought. In Proc. of ACL, pages 13326–13365, 2024.
EMNLP, pages 3673–3684, 2021. [Yang et al., 2023] Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chala-
[Sinha et al., 2019] Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, mala, Peiyang Song, et al. Leandojo: theorem proving with retrieval-
and William L. Hamilton. Clutrr: A diagnostic benchmark for inductive augmented language models. In Proc. of ICONIP, 2023.
reasoning from text. Empirical Methods of Natural Language Processing [Yang et al., 2024a] Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi,
(EMNLP), 2019.
and Faramarz Fekri. Harnessing the power of large language models for
[Srivastava et al., 2022] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, natural language to first-order logic translation. In Proc. of ACL, pages
Abu Awal Md Shoeb, Abubakar Abid, et al. Beyond the imitation game: 6942–6959, 2024.
Quantifying and extrapolating the capabilities of language models. arXiv [Yang et al., 2024b] Zonglin Yang, Li Dong, Xinya Du, Hao Cheng, Erik
preprint arXiv:2206.04615, 2022. Cambria, et al. Language models as inductive reasoners. In Proc. of EACL,
[Stacey et al., 2022] Joe Stacey, Pasquale Minervini, Haim Dubossarsky, and pages 209–225, 2024.
Marek Rei. Logical reasoning with span-level predictions for interpretable [Yu et al., 2020] Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Re-
and robust NLI models. In Proc. of EMNLP, pages 3809–3823, 2022. clor: A reading comprehension dataset requiring logical reasoning. In Proc.
[Sullivan, 2024] Michael Sullivan. It is not true that transformers are inductive of ICLR, 2020.
learners: Probing NLI models with external negation. In Proc. of EACL, [Yu et al., 2024] Fei Yu, Hongbo Zhang, Prayag Tiwari, and Benyou Wang.
pages 1924–1945, 2024.
Natural language reasoning, a survey. ACM Computing Surveys, pages 1–
[Sun et al., 2023] Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, 39, 2024.
Ruihang Chu, et al. A survey of reasoning with foundation models. arXiv [Yuan et al., 2023] Zhangdie Yuan, Songbo Hu, Ivan Vulić, Anna Korhonen,
preprint arXiv:2312.11562, 2023. and Zaiqiao Meng. Can pretrained language models (yet) reason deduc-
[Sun et al., 2024] Hongda Sun, Weikai Xu, Wei Liu, Jian Luan, Bin Wang, tively? In Proc. of EACL, pages 1447–1462, 2023.
et al. Determlr: Augmenting llm-based logical reasoning from indetermi- [Zayyad and Adi, 2024] Majd Zayyad and Yossi Adi. Formal language
nacy to determinacy. In Proc. of ACL, pages 9828–9862, 2024. knowledge corpus for retrieval augmented generation, 2024.
[Tafjord et al., 2021] Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. [Zhang et al., 2024] Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu,
ProofWriter: Generating implications, proofs, and abductive statements Jinlin Xiao, et al. o1-coder: an o1 replication for coding. arXiv preprint
over natural language. In Proc. of ACL Findings, pages 3621–3634, 2021. arXiv:2412.00154, 2024.
[Thatikonda et al., 2025] Ramya Keerthy Thatikonda, Wray Buntine, and [Zhao et al., 2024] Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi,
Ehsan Shareghi. Assessing the alignment of fol closeness metrics with hu- et al. Marco-o1: Towards open reasoning models for open-ended solutions.
man judgement. arXiv preprint arXiv:2501.08613, 2025. arXiv preprint arXiv:2411.14405, 2024.

You might also like