Deductive Reasoning vs Inductive Reasoning in LLMs
Deductive Reasoning vs Inductive Reasoning in LLMs
Ruirui Li2 , Shiyang Li2 , Zheng Li2 , Yifan Gao2 , Xian Li2 , Bing Yin2 , Yizhou Sun1
1 University of California, Los Angeles 2 Amazon
Deductive Inductive
Reasoning Reasoning
Deductive Setting (mapping function is provided) Inductive Setting (mapping function is not provided)
(a) Zero-shot (b) Few-shot IO w/ Mapping Function (c) Few-shot IO w/o Mapping Function (d) SolverLearner
Q: Assuming that all numbers are in base-8 where Q: You are asked to add two numbers, the base of which is unknown, what is 57+27?
Q: Assuming that all numbers are in Q: You are asked to add two numbers, the
the digits are "01234567", what is 57+27? A: The result for 57+27 is 106.
base-8 where the digits are base of which is unknown, what is 57+27?
A: The result for 57+27 is 106. Q: What is the function to map the input to the output?
"01234567", what is 36+33? A: The result for 57+27 is 106.
Q: Assuming that all numbers are in base-8 where A:
A: Q: You are asked to add two numbers, the
the digits are "01234567", what is 36+33? base of which is unknown, what is 36+33? (Output) def solver(n1: str, n2: str) -> str:\n # Let's analyze the given examples to find
A: A: the base\n # 57 + 27 = 106\n # It seems like the base is 8 (octal)\n\n # Convert the
input strings to integers in base 8\n num1 = int(n1, 8)\n num2 = int(n2, 8)\n\n #
(Output) 71 (Output) The result for 36+33 is 71. (Output) The result for 36+33 is 71. Calculate the sum\n result = num1 + num2\n\n # Convert the result back to a string in
base 8\n return oct(result)[2:]
Deductive Reasoning Integrate the deductive reasoning with few-shot Traditional IO prompting for inductive Completely decouple inductive reasoning from deductive reasoning
examples reasoning
Figure 1: We have designed a set of comparative experiments that utilize a consistent task across different contexts, each
emphasizing either deductive (i.e., methods (a) and (b)) or inductive reasoning (i.e., methods (c) and (d)). As we move from left
to right across the figure, the methods gradually transition their primary focus from deductive reasoning to inductive reasoning.
Specifically, method (a) is designed to demonstrate the LLMs’ deductive reasoning in its pure form. Conversely, method (c)
utilizes Input-Output (IO) prompting strategies, which are prevalent for probing the inductive reasoning skills of LLMs. However,
we can observe that methods (c) cannot fully disentangle inductive reasoning from deductive reasoning as their learning process
directly moves from observations to specific instances, blurring the lines between the two. To exclusively focus on and examine
inductive reasoning, we introduce a novel framework called SolverLearner, positioned at the far right of the spectrum.
either deductive (i.e., methods (a) and (b)) or induc- the underlying mapping function. The models
tive reasoning (i.e., methods (c) and (d)), as depicted are then evaluated based on their ability to han-
in Fig 1. For instance, in an arithmetic task, the pro- dle unseen examples, as illustrated in method (c).
ficiency of a LLM in deductive reasoning depends These studies often find LLMs facing difficulties
on its ability to apply a given input-output mapping with inductive reasoning. Our research suggests
function to solve problems when this function is that the use of IO prompting might not effectively
explicitly provided. Conversely, an LLM’s skill separate LLMs’ deductive reasoning skills from
in inductive reasoning is measured by its ability their inductive reasoning abilities. This is because
to infer these input-output mapping functions (i.e., the approach moves directly from observations to
𝑦 = 𝑓𝑤 (𝑥)), that maps input data points (𝑥) to their specific instances, obscuring the inductive reason-
corresponding output values (𝑦), based solely on ing steps. Consequently, the underperformance in
in-context examples. The base system often serves the context of inductive reasoning tasks may be
as the input-output mapping function in an arith- attributed to poor deductive reasoning capabilities,
metic task. In line with the aforementioned setup, i.e., the ability of LLMs to execute tasks, rather than
we employ four methods to delve into the reason- being solely indicative of their inductive reasoning
ing capacity of LLMs. As we move from left to capability.
right across Fig. 1, the methods gradually transition To disentangle inductive reasoning from deduc-
their primary focus from deductive reasoning to tive reasoning, we propose a novel model, referred
inductive reasoning. Method (a), at the far left of to as SolverLearner. Given our primary focus on in-
the figure, aims to explore the deductive reasoning ductive reasoning, SolverLearner follows a two-step
capabilities of LLMs in its pure form, where no in- process to segregate the learning of input-output
context-learning examples are provided (zero-shot mapping functions from the application of these
settings). While exploring deductive reasoning in functions for inference. Specifically, functions are
its pure form appears relatively straightforward in applied through external interpreters, such as code
zero-shot settings, untangling inductive reasoning interpreters, to avoid incorporating LLM-based
poses more significant challenges. Recent studies deductive reasoning.
have investigated the inductive reasoning abilities We evaluate the performance of several LLMs
of LLMs (Yang et al., 2022; Gendron et al., 2023; across various tasks. LLMs consistently demon-
Xu et al., 2023b), they have primarily used Input- strate remarkable inductive reasoning capabilities
Output (IO) prompting (Mirchandani et al., 2023), through SolverLearner, achieving near-perfect per-
which involves providing models with a few 〈in- formance with ACC of 1 in most cases. Surprisingly,
put, output〉 as demonstrations without providing despite their strong inductive reasoning abilities,
LLMs tend to exhibit weaker deductive capabilities, models with a few 〈input, output〉 demonstrations
particularly in terms of “counterfactual” reasoning. and then evaluating their performance on unseen
This finding, though unexpected, aligns with the examples, as depicted in method (c) in Fig. 1. Our
previous research (Wu et al., 2023). In a zero-shot research suggests that the use of IO prompting and
scenario, the ability of an LLM to correctly exe- directly evaluating the final instance performance
cute tasks by applying principles (i.e. deductive might not effectively separate LLMs’ deductive
reasoning) heavily relies on the frequency with reasoning skills from their inductive reasoning abil-
which the model was exposed to the tasks during ities. This is because the approach moves directly
its pre-training phase. from observations to specific instances, obscuring
the inductive reasoning steps. To better disentangle
2 Task Definition inductive reasoning, we propose a novel framework,
Our research is focused on a relatively unexplored SolverLearner. This framework enables LLMs to
question: Which presents a greater challenge to learn the function (i.e., 𝑦 = 𝑓𝑤 (𝑥)), that maps in-
LLMs - deductive reasoning or inductive reasoning? put data points (𝑥) to their corresponding output
To explore this, we designed a set of comparative values (𝑦), using only in-context examples. By
experiments that apply a uniform task across var- focusing on inductive reasoning and setting aside
ious contexts, each emphasizing either deductive LLM-based deductive reasoning, we can isolate and
or inductive reasoning. The primary distinction investigate inductive reasoning of LLMs in its pure
between the deductive and inductive settings is form via SolverLearner. SolverLearner includes
whether we explicitly present input-output map- two-stages as illustrated in Fig. 2:
pings to the models. Informally, we can describe • Function Proposal: In this initial phase, we
these mappings as a function 𝑓𝑤 : 𝑋 → 𝑌 , where propose a function, that could be used to map
an input 𝑥 ∈ 𝑋 is transformed into an output 𝑦 ∈ 𝑌 . input data points (𝑥) to their corresponding output
We distinguish between the deductive and inductive values (𝑦). This is corresponding to the inductive
settings as follows: reasoning process.
• Deductive setting: we provide the models with • Function Execution: In the second phase, the
direct input-output mappings (i.e., 𝑓𝑤 ). proposed function is applied through external
code interpreters to solve the test queries for
• Inductive setting: we offer the models a few evaluation purposes. This phase ensures that
examples (i.e., (𝑥, 𝑦) pairs) while intentionally the LLM is fully prevented from engaging in
leaving out input-output mappings (i.e., 𝑓𝑤 ). deductive reasoning.
For example, consider arithmetic tasks, where the 3.1 Framework
base system is the input-output mapping function. In this subsection, we will take the arithmetic task
The two approaches on the left side of Fig. 1 (i.e., as a case study to demonstrate the entire process.
method (a) and (b)) follow the deductive setting, Function Proposal: Given the in-context ex-
illustrating the case where the arithmetic base is amples, the primary goal of LLMs is to learn a
explicitly provided. In contrast, the two methods function that can map input data points (𝑥) to their
(i.e., method (c) and (d)) on right side of Fig. 1 corresponding output values (𝑦). This process of
adhere to the inductive setting, depicting the sce- learning the mapping between inputs and outputs
nario characterized by the absence of a specified is akin to inductive reasoning, while employing
arithmetic base, while a few input-output examples the learned function to address unseen queries
are provided for guidance. aligns with deductive reasoning. In order to sepa-
rate inductive reasoning from deductive reasoning,
3 Our Framework for Inductive
the execution of the learned function should be
Reasoning: SolverLearner
completely detached from LLMs. To achieve this
While recent studies have explored the inductive separation, external tools such as code interpreters
reasoning abilities of LLMs (Yang et al., 2022; Gen- serve as efficient way to execute these functions in-
dron et al., 2023; Xu et al., 2023b), they have primar- dependently. By encapsulating the learned function
ily relied on Input-Output (IO) prompting (Mirchan- within Python code, we can effectively detach the
dani et al., 2023). This method involves providing duty of deductive reasoning from LLMs, assigning
Test Queries
Python Function 14+57
44+45
8 Shot Examples ...
61+23
The result for 71+44 is 135. 22+77
The result for 42+70 is 132.
The result for 50+45 is 115.
The result for 61+55 is 136. ① Function
The result for 63+22 is 105. Proposal
The result for 72+62 is 154.
The result for 57+27 is 106. ② Function
The result for 52+76 is 150. Execution
Figure 2: An overview of our framework SolverLearner for inductive reasoning. SolverLearner follows a two-step process to
segregate the learning of input-output mapping functions from the application of these functions for inference. Specifically,
functions are applied through external code interpreters, to avoid incorporating LLM-based deductive reasoning.
it solely to these external executors. For instance, terfactual reasoning” tasks. Secondly, in the context
in function proposal stage for an arithmetic task, of inductive reasoning, where only a few in-context
we have: examples are available without the mapping func-
“You are an expert mathematician and program- tion, our objective is to learn the function that
mer. You are asked to add two numbers, the base maps inputs to outputs based on this restricted
of which is unknown. Below are some provided dataset. To achieve this, we choose tasks that are
examples: The result for 76+76 is 174. well-constrained, ensuring the existence of a single,
Please identify the underlying pattern to determine unique function capable of fitting this limited data.
the base being used and implement a solver() func- Detailed descriptions of each task and the prompts
tion to achieve the goal. used can be found in Appendix A.1 and A.2.
def solver(n1: str, n2: str) -> str:
Arithmetic In this study, we focus on the two-
# Let’s write a Python program step by step
digit addition task previously explored in the work
# Each input is a number represented as a string.
of Wu et al. (2023). We investigate multiple
# The function computes the sum of these numbers
numerical bases, specifically base-8, 9, 10, 11, and
and returns it as a string. ”
16 where base 10 corresponds to the commonly
Function Execution: In the second phase, func- observed case during pretraining. In the context of
tions are executed through external code interpreters deductive reasoning, the base is explicitly provided
to solve the test cases for evaluation purposes. These without any accompanying in-context examples,
code interpreters act as “oracle” deductive reason- and the LLM is expected to perform the addition
ers, fully preventing the LLM from involving deduc- computation by relying on its inherent deductive
tive reasoning. This ensures that the final results reasoning abilities. Conversely, in the context of
reflect only the inductive reasoning capability of the inductive reasoning, instead of explicitly providing
LLM. To further decouple the LLM’s influence in the base information to LLMs, we provide LLMs
this phase, test cases are generated using a template solely with few-shot examples and require them
without involving the LLM. More details can be to induce the base through these examples and
found in Appendix A.1.3. subsequently generate a function to solve arithmetic
problems.
4 Tasks
Basic Syntactic Reasoning In this setting, we
In this section, we provide a brief overview of the concentrate on tasks related to syntactic recognition
tasks under consideration. Our focus is on inves- previously explored by Wu et al. (2023). Our
tigating the reasoning abilities of LLMs in both objective is to evaluate LLMs using artificially
deductive and inductive reasoning scenarios. To constructed English sentences that vary from the
ensure a robust evaluation, we carefully select tasks conventional subject-verb-object (SVO) word order.
that lend themselves well to comparison. Firstly, to For deductive reasoning, we directly provide the
prevent LLMs from reciting tasks seen frequently new word order to LLMs without any contextual
during pre-training, which could artificially inflate examples, challenging them to identify the subject,
performance in deductive reasoning, a significant verb, and object within this artificial language. In
portion of the tasks falls into the category of “coun- contrast, for inductive reasoning, we do not give
explicit instructions on the changes in word order. other settings using two different models, gpt-3.5-
Instead, we introduce sentence pairs where one turbo-1106 and gpt-4-1106-preview, which are de-
sentence follows the standard word order, and the noted as GPT-3.5 and GPT-4 respectively. Since
other follows a modified sequence. Through this both methods are closed-source, we do not provide
setting, LLMs are expected to learn the specific specific information about their size, architecture,
changes made to the word order and then apply this and pre-training particulars. Our experiments pri-
learned rule to identify the subject, verb, and object marily focus on investigating the reasoning abilities
within new sentences. of LLMs in both deductive and inductive reasoning
Spatial Reasoning In this task, we delve into the scenarios. Therefore, we structure our evaluation
spatial reasoning previously investigated by Wu across two distinct settings to highlight each type
et al. (2023). Our specific focus is on modifying of reasoning. The formal definition of each setting
the direction-unit vector mapping and determining is provided in Sec. 2. For the deductive setting, two
the object coordinates in this revised system. We methods are proposed for investigation:
explore multiple systems, starting with the com-
monly observed case during pretraining, where up • Zero-shot evaluates deductive reasoning ability
corresponds to north, down to south, left to west, of the LLMs in its pure form. It tests the LLM’s
and right to east. This is compared to coordinate ability to conclude information about specific
systems with swapped, rotated, and randomly per- individuals based solely on instructions, without
muted axes. For deductive reasoning, we directly relying on examples.
provide the direction-unit vector mapping without • 8-IO w/ Mapping Function (MF) follows the
any contextual examples, requiring LLMs to com- deductive setting but enhances LLM reasoning
pute the object coordinates within these systems. further by incorporating in-context examples. It
Conversely, in the context of inductive reasoning, in- aligns with the most commonly used prompt
stead of directly explaining the changes made to the methods for enabling LLM reasoning. With the
direction-unit vector mapping to LLMs, we present inclusion of in-context examples, this approach
LLMs with a few example shots and challenge them can be seen as leveraging inductive reasoning to
to infer the changes made to the mapping. They augment deductive reasoning.
are then expected to apply this learned function to
determine the object coordinates in the system. For the inductive setting, we propose two methods
Cipher Decryption Under this scenario, we ex- for evaluation:
plore an innovative task that we have created, con-
• 8-IO w/o Mapping Function (MF) aligns with
centrating on the decryption of strings encrypted
traditional input-output (IO) prompting methods
using specific cipher systems. We have incorpo-
widely used to investigate the inductive reasoning
rated three particular cipher systems for this ex-
capability of LLMs. However, as this method
ploration: the Alphabetically Sorting Cipher the
proceeds directly from a set of observations to
Caesar Cipher and the Morse Cipher. For deduc-
specific target instances, it remains intertwined
tive reasoning, we directly inform LLMs about the
with LLM-based deductive reasoning.
cipher system being used, yet we do not offer any
contextual examples. The objective for LLMs is to • 8-shot SolverLearner corresponds to our pro-
decode strings according to these cipher systems. posed framework for inductive reasoning, capable
Conversely, in the inductive reasoning scenario, our of evaluating inductive reasoning ability of the
task involves providing LLMs with several exam- LLMs in its pure form. It segregates the learning
ples, each consisting of an encrypted string and of input-output mapping functions from the ap-
its decrypted version. The main challenge for the plication of these functions for inference, thereby
models in this scenario is first to identify what ci- preventing the blend of LLM-based deductive
pher system was used and then to apply that cipher reasoning into the process.
system to decrypt an unseen string.
Besides using 8-shot examples, our study also in-
5 Results cludes experiments with 16-shot examples to assess
how changes in the number of in-context examples
For each task, we evaluate our proposed Solver- impact the results. Experimental results are given
Learner for pure LLM inductive reasoning and in the Appendix A.3. Generally, the results indicate
Arithmetic Basic Syntax Spatial Cipher Decryption
GPT-3.5 GPT-3.5 1.0
GPT-3.5 GPT-3.5
1.0 0-shot 1.0 0-shot 0-shot 0.6 0-shot
8-IO w/ MF 8-IO w/ MF 8-IO w/ MF 8-IO w/ MF
0.8 0.8 0.8 0.5
0.4
0.6 0.6 0.6
ACC
ACC
ACC
ACC
0.3
0.4 0.4 0.4
0.2
0.2 0.2 0.2 0.1
ACC
ACC
ACC
0.4 0.4 0.4 0.4
Figure 3: Comparison of the deductive reasoning abilities of LLMs across various tasks. Different methods are illustrated
through color-coded bars: blue bars indicate the results achieved using Zero-shot, while orange bars show the performance of
8-IO w/ Mapping Function (MF).
ACC
ACC
ACC
0.4 0.4 0.4 0.4
ACC
ACC
ACC
Figure 4: Comparison of the inductive reasoning abilities of LLMs across various tasks. Different methods are illustrated
through color-coded bars: blue bars indicate the results achieved using our proposed SolverLearner, while orange bars show the
performance of 8-IO w/o Mapping Function (MF).
ACC
ACC
ACC
ACC
ACC
ACC
Figure 5: Comparison of the inductive reasoning abilities versus deductive reasoning abilities of LLMs across various
tasks. Different methods are illustrated through color-coded bars: blue bars indicate the results achieved using our proposed
SolverLearner for inductive reasoning, while orange bars show the performance of Zero-shot for deductive reasoning.
that an increase in the number of in-context exam- tive and inductive reasoning skills. By completely
ples yields only slight improvements across both disentangling the inductive reasoning of LLMs,
deductive and inductive reasoning scenarios. Fur- our proposed SolverLearner shows the remarkable
thermore, we conduct an ablation study concerning inductive reasoning capabilities inherent in LLMs.
our proposed SolverLearner in Appendix A.5 for It is also noteworthy that the efficacy of LLMs’
deeper insights into its functionality. inductive reasoning capability heavily depends on
the foundational model, with GPT-4 consistently
5.1 Main Results outperforming GPT-3.5.
Deductive reasoning presents a greater chal-
The results for all tasks are presented from Fig. 3 lenge than inductive reasoning for LLMs. To
through Fig. 5. Specifically, Fig. 3 concentrates on compare the challenges of the deductive reasoning
comparing performances in the deductive setting, capability with the inductive reasoning capability
while Fig. 4 examines comparisons in the inductive of LLMs, we include two methods in Fig. 1, Solver-
setting. Additionally, Fig. 5 focuses on contrasting Learner and Zero-shot, demonstrating pure induc-
the models’ capabilities across deductive and induc- tive and deductive reasoning abilities. Since the
tive setting. For further reference, the prompts used entire reasoning involves two steps: first, obtaining
for all tasks are included in Appendix A.2, and the the input-output function ( 𝑓𝑤 ), which corresponds
full numerical results can be found in Appendix A.3. to inductive reasoning, and second, applying the
LLMs exhibit poor deductive reasoning capa- function for inference, which corresponds to deduc-
bilities, particularly in “counterfactual” tasks. tive reasoning. Once both steps are successfully
We include two methods in Fig. 3, Zero-shot and completed, perfect performance is observed, as
8-IO w/ Mapping Function (MF), to illustrate the indicated by the dotted line in the figure. Zero-
deductive reasoning capability of LLMs. Our obser- shot can be seen as replacing the first step with
vations reveal that LLMs exhibit relatively weaker an oracle, with deductive reasoning capability of
deductive capabilities, especially in “counterfac- LLMs to be studied, while SolverLearner can be
tual” tasks, while showing prowers in standard seen as replacing the second step with an oracle,
tasks like base-10 arithmetic. This aligns with with inductive reasoning capability of LLMs to be
findings reported in (Wu et al., 2023). Integration studied. By comparing the gaps of SolverLearner
of in-context examples notably enhances LLMs’ and Zero-shot towards perfect reasoning, we can
performance in various scenarios, suggesting that observe that in most cases, LLMs can complete the
their improvement stems from the acquisition of inductive step perfectly, while they rarely achieve
knowledge through inductive reasoning from these perfect performance on the deductive step. This in-
examples. This further confirms the exceptional dicates that in LLM reasoning, deductive reasoning
inductive reasoning abilities of LLMs. This com- presents a greater challenge. Note that we avoid to
bined evidence suggests that LLMs face challenges phrasing it as directly comparing inductive and de-
in precisely following instructions and executing ductive reasoning capabilities. Instead, we examine
commands, especially when those instructions are whether the gaps mainly come from inductive or
relate to scenarios rarely encountered during their inductive reasoning, considering that LLMs could
pre-training phase. not achieve perfect counterfactual reasoning.
LLMs demonstrate remarkable inductive rea-
soning capabilities through SolverLearner. We 5.2 More Results over Additional LLMs
include two methods in Fig. 4, SolverLearner (Ours) To validate the generalizability of our conclusion,
and 8-IO w/o Mapping Function (MF), to illustrate we have included results over additional LLMs,
the inductive reasoning capability of LLMs. While claude-3-sonnet-20240229-v1:0, which is denoted
8-IO w/o Mapping Function (MF) struggles with as Claude3. Due to space limitations, the full
inductive reasoning, SolverLearner consistently numerical results are provided in Appendix A.4.
achieves perfect performance with an accuracy of
1 across all the cases with GPT-4 and succeeds in 5.3 Ablation Study
most cases when used with GPT-3.5. This discrep- We conducted several experiments to gain a deeper
ancy arises because the utilization of IO prompting understanding of our framework, detailed in the ab-
to directly reach conclusions on target instances may lation studies in Appendix A.5. These experiments
not effectively distinguish between LLMs’ deduc- include investigating the effects of programs exe-
cuted by a Python interpreter v.s. natural language reasoning or simply following patterns learned dur-
executed by an LLM and examining the impact of ing training (Wei et al., 2022b; Valmeekam et al.,
the number of in-context learning examples. 2022). Additionally, there’s a debate regarding
whether LLMs are symbolic reasoners (Tang et al.,
6 Related Works 2023) or possess strong abstract reasoning capa-
bilities (Gendron et al., 2023). In light of these
6.1 In-Context Learning
seemingly contradictory conclusions, our research
GPT-3 (Brown et al., 2020) has demonstrated its aims to delve deeper into the reasoning capabili-
effectiveness in learning from a few demonstration ties of LLMs. We intend to dissect the nuances
examples and solve previously unseen tasks with- of inductive and deductive reasoning within the
out requiring updates to its model parameters (Wei context of LLMs, identifying which form of reason-
et al., 2022a). This remarkable capability is com- ing presents a more significant challenge to their
monly referred to as the “in-context learning ability” reasoning abilities.
of language models. It implies that the LLMs can
leverage its existing knowledge and generalize from 6.3 Equipping LLMs with External Tools
a few demonstration examples to solve new, related Large Language Models (LLMs) have made signifi-
tasks (Dong et al., 2022; Liu et al., 2021; Rubin et al., cant progress in utilizing tools through frameworks
2021; Gonen et al., 2022). Some notable works like CREATOR (Qian et al., 2023) and LATM (Cai
include chain-of-thought (CoT) prompting (Wei et al., 2023), which allow LLMs to create tools
et al., 2022b), which elicits reasoning with inter- using documentation and code. Logic-LM (Pan
mediate steps in few-shot exemplars. Built upon et al., 2023) integrates LLMs with symbolic solvers
the CoT framework, several works expand CoT by to improve logical problem-solving, However, these
organizing and processing thoughts using more approaches focus exclusively on deductive reason-
complex structures, such as trees (Yao et al., 2023) ing, aiming to enable LLMs to derive correct an-
and graphs (Besta et al., 2023) or breaking a prob- swers for specific questions without incorporating
lem into sub problems and then proceeds to solve the capacity for inductive reasoning to infer underly-
each one independently (Zhou et al., 2022). While ing mapping function shared by few-shot examples.
these studies have effectively improved the reason- In contrast, our primary objective is not to propose
ing capability of LLMs, they have failed to clearly a new framework for using tools to enhance the
distinguish between inductive and deductive reason- problem-solving capabilities of LLMs. Instead, we
ing, let alone investigate which represents a more aim to differentiate between deductive and inductive
critical limitation for LLM reasoning capabilities: reasoning within LLMs and explore which presents
deductive reasoning or inductive reasoning. a greater challenge to their reasoning abilities.
6.2 Exploring LLMs’ Reasoning Skills 7 Conclusion
Despite the impressive achievements of LLMs in This study aims to explore a less-investigated aspect
various reasoning tasks, the underlying mechanisms of LLMs: within LLM reasoning, which presents
of their reasoning capabilities remain a subject of a greater challenge — deductive or inductive rea-
debate. The question of whether LLMs genuinely soning? To delve into the inductive reasoning
reason in a manner akin to human cognitive pro- capacities of LLMs, we introduce a novel frame-
cesses or merely simulate aspects of reasoning work called SolverLearner. By concentrating on
without true comprehension is still open (Huang inductive reasoning while setting aside LLM-based
and Chang, 2022). For instance, Kojima et al. deductive reasoning, SolverLearner can scrutinize
have suggested that LLMs exhibit commendable the pure form of inductive reasoning in LLMs.
zero-shot reasoning abilities, implying that these Our findings unveil remarkable inductive reasoning
models can draw logical conclusions in scenarios prowers in LLMs through SolverLearner, achieving
they have not been explicitly trained on (Kojima near-perfect performance with an ACC of 1 in most
et al., 2022). However, some researchers cast doubt cases. Surprisingly, despite their strong inductive
on the reasoning capability of LLMs. While ap- reasoning abilities, LLMs often exhibit weaker de-
proaches like the chain-of-thought method may ductive capabilities, particularly in tasks involving
mimic human-like thought processes, it remains “counterfactual” scenarios.
uncertain whether LLMs are genuinely engaging in
Limitations References
LLMs cannot perform inductive reasoning over Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger-
stenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz
all the tasks In our inductive learning setting, LLMs
Lehmann, Michal Podstawski, Hubert Niewiadomski,
are provided with only a limited number of contex- Piotr Nyczyk, et al. 2023. Graph of thoughts: Solv-
tual examples. The goal is to infer the function that ing elaborate problems with large language models.
accurately maps inputs to outputs based solely on arXiv preprint arXiv:2308.09687.
this constrained dataset. In order to solve this prob-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
lem, it is significant that we can find a unique func- Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
tion satisfied given these examples. For instance, a Neelakantan, Pranav Shyam, Girish Sastry, Amanda
linear function can be precisely determined given Askell, et al. 2020. Language models are few-shot
just two data points, as it has a singular solution. learners. Advances in neural information processing
systems, 33:1877–1901.
However, attempting to deduce a quadratic curve
from two points poses an insurmountable challenge Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen,
due to the existence of infinite functions capable and Denny Zhou. 2023. Large language models as
of passing through those specific points. Addition- tool makers. arXiv preprint arXiv:2305.17126.
ally, LLMs might struggle to discern the correct Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zheng-
mapping function when the search space of the nan Xie, Hannah Smith, Leighanna Pipatanangkura,
problem expands excessively. Consider the case and Peter Clark. 2021. Explaining answers with
entailment trees. arXiv preprint arXiv:2104.08661.
of arithmetic tasks; without limiting the search
space to finding a suitable base that aligns with Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong
the observations, the task becomes overwhelmingly Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang
complex. This is because the search space could en- Sui. 2022. A survey for in-context learning. arXiv
preprint arXiv:2301.00234.
compass any conceivable rule that accommodates
the observations. Gaël Gendron, Qiming Bao, Michael Witbrock, and
The effectiveness of LLMs’ inductive reason- Gillian Dobbie. 2023. Large language models are not
ing capability is heavily reliant on the founda- abstract reasoners. arXiv preprint arXiv:2305.19555.
tional model While GPT-4 consistently showcase Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith,
impressive inductive reasoning abilities through and Luke Zettlemoyer. 2022. Demystifying prompts
SolverLearner and achieve perfect performance in language models via perplexity estimation. arXiv
with ACC of 1 across all the tasks, GPT-3.5 strug- preprint arXiv:2212.04037.
gle to learn the correct input-output mapping func- Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting
tion in several cases. This observation suggests Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekate-
that the inductive reasoning potential of LLMs is rina Zubova, Yujie Qiao, Matthew Burtell, et al. 2022.
significantly constrained by the underlying model. Folio: Natural language reasoning with first-order
logic. arXiv preprint arXiv:2209.00840.
Chain of Thought (COT) has not been incor-
porated into the comparison Chain of Thought Jie Huang and Kevin Chen-Chuan Chang. 2022. To-
(COT) is a significant prompting technique designed wards reasoning in large language models: A survey.
for use with LLMs. Rather than providing a direct arXiv preprint arXiv:2212.10403.
answer, COT elicits reasoning with intermediate Takeshi Kojima, Shixiang Shane Gu, Machel Reid,
steps in few-shot exemplars. This method was not Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large
incorporated into our comparison as it is viewed language models are zero-shot reasoners. Advances
as a technique to improve the deductive reasoning in neural information processing systems, 35:22199–
22213.
capabilities of LLMs. Although COT has proven to
be effective across various tasks, numerous studies Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan,
highlight a significant performance gap that COT Lawrence Carin, and Weizhu Chen. 2021. What
still needs to bridge to achieve flawless execution. makes good in-context examples for gpt-3? arXiv
preprint arXiv:2101.06804.
Ethical Considerations Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter,
Danny Driess, Montserrat Gonzalez Arenas, Kan-
The authors foresee no ethical concerns with the ishka Rao, Dorsa Sadigh, and Andy Zeng. 2023.
research presented in this paper. Large language models as general pattern machines.
arXiv preprint arXiv:2307.04721.
OpenAI. 2023. Gpt-4 technical report. ArXiv, and the importance of object-based representations.
abs/2303.08774. arXiv preprint arXiv:2305.18354.
Liangming Pan, Alon Albalak, Xinyi Wang, and Zonglin Yang, Li Dong, Xinya Du, Hao Cheng, Erik
William Yang Wang. 2023. Logic-lm: Empower- Cambria, Xiaodong Liu, Jianfeng Gao, and Furu Wei.
ing large language models with symbolic solvers 2022. Language models as inductive reasoners. arXiv
for faithful logical reasoning. arXiv preprint preprint arXiv:2212.10923.
arXiv:2305.12295.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Thomas L Griffiths, Yuan Cao, and Karthik
Liu, and Heng Ji. 2023. Creator: Tool creation for Narasimhan. 2023. Tree of thoughts: Deliberate
disentangling abstract and concrete reasoning of large problem solving with large language models. arXiv
language models. In Findings of the Association preprint arXiv:2305.10601.
for Computational Linguistics: EMNLP 2023, pages
6922–6939. Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi
Feng. 2020. Reclor: A reading comprehension
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. dataset requiring logical reasoning. arXiv preprint
2021. Learning to retrieve prompts for in-context arXiv:2002.04326.
learning. arXiv preprint arXiv:2112.08633.
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,
Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, Nathan Scales, Xuezhi Wang, Dale Schuurmans,
and William L Hamilton. 2019. Clutrr: A diagnostic Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022.
benchmark for inductive reasoning from text. arXiv Least-to-most prompting enables complex reason-
preprint arXiv:1908.06177. ing in large language models. arXiv preprint
arXiv:2205.10625.
Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng,
Song-Chun Zhu, Yitao Liang, and Muhan Zhang.
2023. Large language models are in-context seman-
tic reasoners rather than symbolic reasoners. arXiv
preprint arXiv:2305.14825.
Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan,
and Subbarao Kambhampati. 2022. Large language
models still can’t plan (a benchmark for llms on
planning and reasoning about change). arXiv preprint
arXiv:2206.10498.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al.
2022a. Emergent abilities of large language models.
arXiv preprint arXiv:2206.07682.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. 2022b. Chain-of-thought prompting elicits
reasoning in large language models. Advances in
Neural Information Processing Systems, 35:24824–
24837.
Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek,
Boyuan Chen, Bailin Wang, Najoung Kim, Jacob
Andreas, and Yoon Kim. 2023. Reasoning or reciting?
exploring the capabilities and limitations of language
models through counterfactual tasks. arXiv preprint
arXiv:2307.02477.
Fangzhi Xu, Qika Lin, Jiawei Han, Tianzhe Zhao, Jun
Liu, and Erik Cambria. 2023a. Are large language
models really good logical reasoners? a compre-
hensive evaluation from deductive, inductive and
abductive views. arXiv preprint arXiv:2306.09841.
Yudong Xu, Wenhao Li, Pashootan Vaezipoor, Scott
Sanner, and Elias B Khalil. 2023b. Llms and the
abstraction and reasoning corpus: Successes, failures,
A Appendix of 100 pairs of strings (e.g., “Mrxuqhb -> Journey”
for Caesar Cipher) for each of three cipher systems,
A.1 Full Setups including the Alphabetically Sorting Cipher the
SolverLearner is a prompting based reasoning ap- Caesar Cipher and the Morse Cipher. Each pair
proach, and we only need to perform inference with comprises an encrypted string (e.g., “Mrxuqhb”)
LLMs. and its corresponding decrypted version (e.g., “Jour-
ney”). By providing LLMs with several examples,
A.1.1 Settings for Each Task
each containing an encrypted string alongside its
Arithmetic The arithmetic dataset introduced in corresponding decrypted counterpart, the primary
Wu et al.’s paper (Wu et al., 2023) comprises 1,000 task is to accurately determine the cipher system
randomly selected addition expressions, each in- employed in an open-world context.
volving two-digit numbers. These expressions are
drawn from bases 8, 9, 10, 11, and 16, with sepa- A.1.2 Few shot Example Generation
rate sampling for each base. Importantly, all the The preparation of examples for few-shot learning
expressions have been carefully chosen to yield follows a straightforward process. We divide all the
distinct results when evaluated in their respective data into a training set and a test set, from which few-
bases, thereby distinguishing them from one another shot examples are extracted from the training set.
during the process of rule learning. These few-shot examples are automatically prepared
Basic Syntactic Reasoning In accordance with by associating queries with their corresponding
the methodology outlined in Wu et al.’s work (Wu ground truth answers using a pre-defined template.
et al., 2023), we have generated a set of 100 simple
three-word sentences (e.g., “bob likes bananas”) A.1.3 Test Case Generation
with five different word order variations (e.g., “ba- In the function execution phase, the test cases are
nanas bob likes” in OSV format). Subsequently, generated using a template without involving LLM.
we tasked LLMs with learning how to manipulate In particular, the test cases are drawn from the test
sentence order. It’s noteworthy that we took great data files, containing all the queries along with
care in selecting words to ensure that each word in their correct answers (e.g., “76+76 = 174”). When
a sentence can only fulfill one specific role, such as the LLM is used for generating code, we specify
subject, object, or verb. For instance, we ensured a function interface, such as def solver(n1: str,
that sentences like “bob likes anna” were excluded, n2: str) -> str. Then, using the query examples
as both “bob” and “anna” could potentially serve as provided, like “76+76 = 174”, we create test cases
both subjects and objects, violating this constraint. by applying this function interface to the query (e.g.,
Spatial Reasoning The spatial reasoning dataset solver(76,76)), thereby eliminating any reliance on
introduced in Wu et al.’s paper (Wu et al., 2023) LLM for this process. This method ensures that our
consists of 100 rooms that were randomly selected, test case generation is 100% correct.
and each room contains three distinct objects. The
A.2 Full Prompts
spatial directions within these rooms are represented
using unit vectors. For instance, north is represented We provide the prompts that we used to query the
as (0, 1), south as (0, -1), east as (1, 0), and west LLMs for all tasks in Tables 1 to 4. We do not use
as (-1, 0), with a y-axis pointing upward serving the system message field for any model.
as the default orientation. In our study, we have
modified the mapping between directions and unit A.3 Full Results
vectors and tasked LLMs with learning this new We show the full numerical results in Tables 5 to 8.
direction-to-unit vector relationship. We explore In addition to using 8-shot examples, these results
two direction-swapped scenarios (north-south and also include experiments with 16-shot examples
east-west), three rotated scenarios (by 90°, 180°, to assess how changes in the number of in-context
and 270°), and a randomly permuted scenario. The examples impact the results.
primary metric we report is instance-level accuracy,
which necessitates that all three objects within a A.4 More Results on Additional LLMs
room must be correctly positioned in order to be To validate the generalizability of our conclu-
considered accurate. sion, we have included additional LLMs, claude-
Cipher Decryption We’ve generated a collection 3-sonnet-20240229-v1:0, which is denoted as
Claude3. We show the full numerical results in
Tables 9 to 12.
Mode Prompt
Zero-shot You are a mathematician. Assuming that all numbers are in base-8 where the digits are "01234567",
what is 36+33? End the response with the result in "\boxed{result}".
Few-shot IO w/ MF You are a mathematician. You are asked to add two numbers. Assuming that all numbers are in
base-8 where the digits are "01234567". Below are some provided examples:
The result for 76+76 is 174.
Please identify the base being used and determine what is 36+33? End the response with the result
in "\boxed{result}".
Few-shot IO w/o MF You are a mathematician. You are asked to add two numbers, the base of which is unknown. Below
are some provided examples:
The result for 76+76 is 174.
Please identify the base being used and determine what is 36+33? End the response with the result
in "\boxed{result}".
SolverLearner You are an expert mathematician and programmer. You are asked to add two numbers, the base of
which is unknown. Below are some provided examples:
The result for 76+76 is 174.
Please identify the underlying pattern to determine the base being used and implement a solver()
function to achieve the goal.
def solver(n1: str, n2: str) -> str:
# Let’s write a Python program step by step
# Each input is a number represented as a string.
# The function computes the sum of these numbers and returns it as a string.
After defining the solver() function, create test cases based on the input examples and print the results.
An example of a test case could be "print(solver("76", "76"))". Place the function solver() as well as
the test cases between "START_CODE" and "END_CODE".
Mode Prompt
Zero-shot You are an expert in linguistics. Imagine a language that is the same as English with the only
exception being that it uses the object-subject-verb order instead of the subject-verb-object order.
Please identity the subject, verb, and object in the following sentences from this invented language:
shirts sue hates.
Encode the identified subject, verb, and object in the form of a dictionary with the following structure:
{’subject’: ?, ’verb’: ?, ’object’: ?}.
Few-shot IO w/ MF As a linguistics expert, your objective is to analyze sentences in a constructed language that shares
English vocabulary but uses the object-subject-verb order instead of the subject-verb-object order.
Presented below are examples of valid sentences in this constructed language, accompanied by their
corresponding English translations.
A sentence in this invented language: phones mary finds. Its equivalent sentence in English reads:
mary finds phones.
Following the examples, please analyze the subject, verb, and object in the following sentences from
this invented language:
shirts sue hates.
Encode the identified subject, verb, and object in the form of a dictionary with the following structure:
{’subject’: ?, ’verb’: ?, ’object’: ?}.
Few-shot IO w/o MF As a linguistics expert, your objective is to analyze sentences in a constructed language that shares
English vocabulary but follows a unique grammatical structure. Presented below are examples of valid
sentences in this constructed language, accompanied by their corresponding English translations.
A sentence in this invented language: phones mary finds. Its equivalent sentence in English reads:
mary finds phones.
Following the examples, please analyze the subject, verb, and object in the following sentences from
this invented language:
shirts sue hates.
Encode the identified subject, verb, and object in the form of a dictionary with the following structure:
{’subject’: ?, ’verb’: ?, ’object’: ?}.
SolverLearner As a linguistics expert, your objective is to analyze sentences in a constructed language that shares
English vocabulary but follows a unique grammatical structure.Presented below are examples of valid
sentences in this constructed language, accompanied by their corresponding English translations.
A sentence in this invented language: phones mary finds. Its equivalent sentence in English reads:
mary finds phones.
Please summarize the pattern concerning the order of subject, verb and object in this invented
linguistic system. Place the pattern between START_PATTERN and END_PATTERN.
Table 3: Prompts for the Spatial Reasoning Task.
Mode Prompt
Zero-shot You are in the middle of a room. You can assume that the room’s width and height are both 500
units. The layout of the room in the following format:
’name’: ’bedroom’, ’width’: 500, ’height’: 500, ’directions’: ’north’: [0, 1], ’south’: [0, -1], ’east’:
[1, 0], ’west’: [-1, 0], ’objects’: [’name’: ’chair’, ’direction’: ’east’, ’name’: ’wardrobe’, ’direction’:
’north’, ’name’: ’desk’, ’direction’: ’south’]
Please provide the coordinates of objects whose positions are described using cardinal directions,
under a conventional 2D coordinate system using the following format:
[’name’: ’chair’, ’x’: ’?’, ’y’: ’?’, ’name’: ’wardrobe’, ’x’: ’?’, ’y’: ’?’, ’name’: ’desk’, ’x’: ’?’, ’y’:
’?’]
Few-shot IO w/ MF You are an expert programmer. You are in the middle of a room. You can assume that the room’s
width and height are both 500 units. The layout of the room in the following format:
’name’: ’laundry room’, ’width’: 500, ’height’: 500, ’directions’: ’north’: [0, 1], ’south’: [0, -1],
’east’: [1, 0], ’west’: [-1, 0], ’objects’: [’name’: ’dryer’, ’direction’: ’east’, ’name’: ’sink’, ’direction’:
’west’, ’name’: ’washing machine’, ’direction’: ’south’]
Please provide the coordinates of objects whose positions are described using cardinal directions,
under a conventional 2D coordinate system. For example, the coordinates of objects in the above
example is:
[’name’: ’dryer’, ’x’: 500, ’y’: 250, ’name’: ’sink’, ’x’: 0, ’y’: 250, ’name’: ’washing machine’, ’x’:
250, ’y’: 0]
Following the examples, please give the coordinates of objects in the following room using the same
format:
’name’: ’bedroom’, ’width’: 500, ’height’: 500, ’directions’: ’north’: [0, 1], ’south’: [0, -1], ’east’:
[1, 0], ’west’: [-1, 0], ’objects’: [’name’: ’chair’, ’direction’: ’east’, ’name’: ’wardrobe’, ’direction’:
’north’, ’name’: ’desk’, ’direction’: ’south’]
Few-shot IO w/o MF You are in the middle of a room. You can assume that the room’s width and height are both 500
units. The layout of the room in the following format:
’name’: ’laundry room’, ’width’: 500, ’height’: 500, ’objects’: [’name’: ’dryer’, ’direction’: ’east’,
’name’: ’sink’, ’direction’: ’west’, ’name’: ’washing machine’, ’direction’: ’south’]
Please provide the coordinates of objects whose positions are described using cardinal directions,
under a conventional 2D coordinate system. For example, the coordinates of objects in the above
example is:
[’name’: ’dryer’, ’x’: 500, ’y’: 250, ’name’: ’sink’, ’x’: 0, ’y’: 250, ’name’: ’washing machine’, ’x’:
250, ’y’: 0]
Following the examples, please give the coordinates of objects in the following room using the same
format:
’name’: ’bedroom’, ’width’: 500, ’height’: 500, ’objects’: [’name’: ’chair’, ’direction’: ’east’, ’name’:
’wardrobe’, ’direction’: ’north’, ’name’: ’desk’, ’direction’: ’south’]
SolverLearner You are an expert programmer. You are in the middle of a room. You can assume that the room’s
width and height are both 500 units. The layout of the room in the following format: ’name’: ’laundry
room’, ’width’: 500, ’height’: 500, ’objects’: [’name’: ’dryer’, ’direction’: ’east’, ’name’: ’sink’,
’direction’: ’west’, ’name’: ’washing machine’, ’direction’: ’south’]
Please provide the coordinates of objects whose positions are described using cardinal directions,
under a conventional 2D coordinate system. For example, the coordinates of objects in the above
example is:
[’name’: ’dryer’, ’x’: 500, ’y’: 250, ’name’: ’sink’, ’x’: 0, ’y’: 250, ’name’: ’washing machine’, ’x’:
250, ’y’: 0]
Please summarize the pattern and implement a solver() function to achieve the goal.
def solver():
# Let’s write a Python program step by step
# the input is the layout of the room
# the output the coordinates of objects
After defining the solver() function. Place the function solver() between "START_CODE" and
"END_CODE".
Table 4: Prompts for the Cipher Decryption Task.
Mode Prompt
Zero-shot As an expert cryptographer and programmer, your task involves reordering the character sequence
according to the alphabetical order to decrypt secret messages. Please decode the following sequence:
spring
Please answer the question by placing the decoded sequence between "START_DECODING" and
"END_DECODING".
Few-shot IO w/ MF As an expert cryptographer and programmer, your task involves reordering the character sequence
according to the alphabetical order to decrypt secret messages. For example, given the sequence
"family," you must translate it into "afilmy." Below are further examples that demonstrate the
translation:
school -> chloos
Following the examples, please decode the following sequence:
spring
Please answer the question by placing the decoded sequence between "START_DECODING" and
"END_DECODING".
Few-shot IO w/o MF As an expert cryptographer and programmer, your task involves deciphering secret messages. For
example, given the sequence "family," you must translate it into "afilmy." Below are further examples
that demonstrate the translation:
school -> chloos
Following the examples, please decode the following sequence:
spring
Please answer the question by placing the decoded sequence between "START_DECODING" and
"END_DECODING".
SolverLearner As an expert cryptographer and programmer, your task involves deciphering secret messages. For
example, given the sequence "family," you must translate it into "afilmy." Below are further examples
that demonstrate the translation:
school -> chloos
Please deduce the encryption system and develop a solver() function for the decryption.
def solver():
# Let’s write a Python program step by step
# the input is the coded sequence
# the output is the decoded sequence
After defining the solver() function. Place the function solver() between "START_CODE" and
"END_CODE".
Base
8 9 10 11 16
Method
Zero-shot 0.330 0.117 1 0.066 0.294
8-IO w/ MF 0.376 0.089 1 0.089 0.849
8-IO w/o MF 0.120 0.027 0.905 0.057 0.587
GPT-3.5
16-IO w/ MF 0.428 0.088 1 0.098 0.912
16-IO w/o MF 0.108 0.025 0.924 0.063 0.575
8-shot SolverLearner 0.571 0.462 1 0.095 1
Zero-shot 0.600 0.697 0.999 0.551 0.819
8-IO w/ MF 0.576 0.717 0.860 0.540 0.862
8-IO w/o MF 0.255 0.268 0.545 0.264 0.431
GPT-4
16-IO w/ MF 0.543 0.720 0.817 0.534 0.840
16-IO w/o MF 0.257 0.245 0.505 0.237 0.435
8-shot SolverLearner 1 1 1 1 1
Table 6: Full Main Results for Basic Syntactic Reasoning.
Word Order
OSV OVS SOV VOS VSO
Method
Zero-shot 0.560 0.298 0.190 0.226 0.560
8-IO w/ MF 1 0.643 0.583 0.976 0.988
8-IO w/o MF 1 0.452 0.929 0.988 1
GPT-3.5
16-IO w/ MF 1 0.738 0.762 0.988 0.952
16-IO w/o MF 1 0.190 0.964 1 1
8-shot SolverLearner 0.988 1 1 1 1
Zero-shot 1 1 1 1 1
8-IO w/ MF 1 1 1 1 1
8-IO w/o MF 1 1 1 1 1
GPT-4
16-IO w/ MF 1 1 1 1 1
16-IO w/o MF 1 0.988 1 1 1
8-shot SolverLearner 1 1 1 1 1
Coordinates
Default S-NS S-WE R90 R180 R270 Random
Method
Zero-shot 0.273 0.702 0.143 0.012 0.310 0.060 0.024
8-IO w/ MF 0.952 0.845 0.869 0.25 0.976 0.060 0.095
8-IO w/o MF 0.369 0.726 0.310 0.083 0.690 0.107 0.071
GPT-3.5
16-IO w/ MF 0.929 0.893 0.857 0.274 0.952 0.071 0.131
16-IO w/o MF 0.452 0.667 0.452 0.083 0.798 0.131 0.083
8-shot SolverLearner 1 1 0 0 1 0 0
Zero-shot 0.119 0.060 0.083 0.024 0.048 0.012 0.036
8-IO w/ MF 1 1 0.964 0.643 0.952 0.679 0.190
8-IO w/o MF 1 0.976 0.929 0.560 0.976 0.429 0.333
GPT-4
16-IO w/ MF 1 1 0.952 0.690 0.929 0.667 0.214
16-IO w/o MF 1 0.976 0.964 0.607 0.976 0.405 0.369
8-shot SolverLearner 1 1 1 1 1 1 1
Encryption System
Alphabetically Sorting Cipher Caesar Cipher Morse Cipher
Method
Zero-shot 0.560 0.036 0.512
8-IO w/ MF 0.595 0.024 0.464
8-IO w/o MF 0.560 0 0.452
GPT-3.5
16-IO w/ MF 0.619 0.024 0.536
16-IO w/o MF 0.512 0.012 0.440
8-shot SolverLearner 1 0 1
Zero-shot 0.726 0 1
8-IO w/ MF 0.774 0.060 1
8-IO w/o MF 0.75 0.583 1
GPT-4
16-IO w/ MF 0.798 0.179 1
16-IO w/o MF 0.738 0.583 1
8-shot SolverLearner 1 1 1
Table 9: Results over Claude3 for Arithmetic Task.
Base
8 9 10 11 16
Method
Zero-shot 0.710 0.185 0.996 0.334 0.868
8-IO w/ MF 0.783 0.385 0.995 0.473 0.913
8-IO w/o MF 0.269 0.083 0.659 0.105 0.752
8-shot SolverLearner 0 0 1 0.095 1
Word Order
OSV OVS SOV VOS VSO
Method
Zero-shot 1 1 1 1 0.988
8-IO w/ MF 1 1 1 1 1
8-IO w/o MF 1 0.976 1 1 1
8-shot SolverLearner 1 1 1 1 1
Coordinates
Default R90 R180 R270 S-NS S-WE Random
Method
Zero-shot 0.607 0.012 0.119 0.024 0.321 0.262 0.060
8-IO w/ MF 1 1 1 1 0.988 0.988 1
8-IO w/o MF 1 1 1 1 1 1 1
8-shot SolverLearner 1 1 1 1 1 1 1
Encryption System
Alphabetically Sorting Cipher Caesar Cipher Morse Cipher
Method
Zero-shot 0.560 0.024 0.988
8-IO w/ MF 0.607 0.167 1
8-IO w/o MF 0.214 0.048 1
8-shot SolverLearner 0.131 0.119 1
Table 13: Results over the arithmetic task with Python interpreter as executor vs. GPT-3.5 as executor
Base
8 9 10 11 16
Executor
Python Interpreter 1 1 1 1 1
GPT-3.5 0.398 0.196 0.934 0.152 0.64
Table 14: Results for the spatial reasoning over GPT-3.5 w.t.r the number of few-shot examples
Coordinates
Default S-NS S-WE R90 R180 R270 Random
Shot
1 1 1 0 0 0 0 0
2 1 1 0 0 1 0 0
4 1 1 0 0 1 0 0
8 1 1 0 0 1 0 0
16 1 1 0 0 1 0 0