0% found this document useful (0 votes)
497 views

Mail Valley Reasoning

Uploaded by

immanuelmukuyuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
497 views

Mail Valley Reasoning

Uploaded by

immanuelmukuyuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Mail Valley Reasoning

Attempter Specifications
Table of Contents
Project Introduction
Helpful Links
Task Steps Overview
Task Specifications
Step 1: Write a Prompt to Stump the Model
Step 2: Annotate the Initial Response
Step 3: Rewrite Incorrect Steps
Step 4: Assess Response Quality
Appendix
COMMON ERRORS
Prompt Examples
Domains Overview
Reasoning Types
Prompt Complexity Guide
Grading Rubric
Reviewer Instructions

Project Introduction
Welcome to Mail Valley Reasoning! This prompt-focused project is designed to improve AI in
complex reasoning tasks.
Your job is to create a complex prompt that causes the model to fail, rate the model response,
and rewrite the incorrect steps in the response, justifying why the steps were wrong.
REMEMBER!
The goal of this project is to get the Model to fail.
The more complex your prompts are, the better.
Tasks should be done in English.
If you are not comfortable tasking in English,
please skip the task and notify a QM in
Discourse.

Helpful Links
 Calculator.Net - Includes a number of useful calculators including quadratic formula,
LCM, GCF, prime factorization, permutations, combinations, triangles, volume, hex, and
much more.

 GeoGebra - Desmos but with more functionality for geometry. Would recommend testing
outside of tasking hours to get familiarity with the tool as there is a learning curve

 WolframAlpha - a computational engine that answers questions and solves problems


across various subjects, including math, science, and finance.
Reviewers, please navigate to the Reviewer Instructions.

Task Steps Overview


1. Write a Prompt to Stump the Model (See Task Specifications)

 Write a complex prompt within the given Domain and Subdomain, Topic and
Reasoning Type, or Topic and Subtopic
 Your prompt should cause the model to fail in at least one of its reasoning,
logic, or calculation steps

 NOTE: If you do not feel comfortable with the Domain/Subdomain,


Topic/ Reasoning Type, or Topic/Subtopic that you’ve been
assigned, please skip the task.

2. Annotate the Initial Response (See Task Specifications)

 Rate each reasoning step within the model’s response according to various
dimensions
 Justify any incorrect steps you will need to rewrite with a rationale statement

3. Rewrite the Incorrect Steps (See Task Specifications)

 Suggest a rewrite for any incorrect steps


 Write a justification for why your rewrite is necessary
 Provide the Final Answer

4. Assess Response Quality (See Task Specifications)

 Indicate whether the response was accurate and followed the instructions
given in the prompt

Please take a look at the COMMON ERRORS that we frequently see in this project.

Task Specifications
Step 1: Write a Prompt to Stump the Model

Step 1: Write a Prompt to Stump the Model


1. IDENTIFY DOMAIN & SUBDOMAIN

You will be given a Domain and Subdomain, Topic and Reasoning Type, or a Topic
and Subtopic.

 Examples of domains (subdomains): law (contract law, constitutional law),


math (geometry, algebra, calculus, sequences/series)

 See the Domains Overview for the full list, and this official Domains
Document for more in-depth guidance.
 See the Subtopic examples
 See Prompt Examples to understand Topic and Reasoning Type

IMPORTANT:
If you are uncomfortable with the Domain or Subdomain provided
PLEASE SKIP THE TASK.
2. WRITE A PROMPT

According to the Domain and Subdomain,Topic and Subtopic, or Topic and Reasoning
Type provided, write a prompt that causes the model to make at least one reasoning
or calculation error.
Please see the Prompt Complexity Guide in the Appendix.
Your Complex Prompt Should:

 Be specific and include all information necessary to solve the prompt


 Have only one objective and verifiable solution

 Subjective answers are unacceptable


 Be careful of:

 Word prompts that are vague and can be interpreted in


multiple ways
 Math prompts that have multiple answers (e.g. predicting the
next number in a sequence)

 Cause a reasoning or calculation error

 THIS IS PARAMOUNT!

 Be written using LaTeX, if it requires solving an equation or formula

LaTex Guidance

 Use an LLM to write your expression in LaTeX


 Refer to our LaTeX Style Guide and the Overleaf Style Guide for best practices
Good Prompt Examples:
Domain & Subdomain Good Prompt Examples

 Logic/Algorithm Puzzles A sequence starts with 3. Each term


in the sequence is twice the previous
 Computational Thinking one plus 1 if the resulting number is
prime. Otherwise, the resulting
number is converted to a given base;
the resulting number is treated as if it
were written in base 10 after that. The
base alternates between 8 and 7,
starting with 8. What is the 10th
number in the sequence?

 Science Kellyanne does not like the band


Phish but all of her friends are excited
 Psychology to get tickets to their concert. She
waits in line for hours with them and
even starts telling herself she really
does enjoy their music. Her sister
knows that Kellyanne doesn't actually
want to spend all her money on the
concert and they fight. Kellyanne
admits she has FOMO. What
principles of persuasion has
Kellyanne fallen subject to?

 Social Sciences This economy has a consumption


function given by C=321+0.40Y where
 Economics C is consumption and Y is national
income. Investment in the economy is
fixed at $243 million.
a) Calculate the equilibrium level of
national income
b) Imagine that investment increases
by $53 million. Calculate the new
equilibrium level of national income.
c) Find out the change in savings
caused by this increased investment.
Bad Prompt Examples
Bad Prompt Example Why it’s Bad

What is the area of a rectangle with This problem is too simple for the task
a length of 5 units and a width of 3 requirements. It only requires a basic
units? application of the area formula for rectangles
(Area=length×width\text{Area} = \text{length}
\times \text{width}Area=length×width) and
does not challenge the model's problem-
solving abilities. It lacks complexity and is not
suitable for evaluating the model’s capacity to
handle more advanced tasks.

What is the next number in the This prompt is bad because it’s too simple.
sequence 1,3,5,7,9,…?

Find yesterday’s high sale price for This prompt is bad because it’s reliant on live
Google stock. data in order to be solved.

You want to inscribe a rectangle This prompt is bad because it is ambiguous


into a circle. What is the maximal and does not have an answer; it depends on
area of such a rectangle? What are the radius of the circle.
its dimensions?

3. ASSIGN DIFFICULTY LEVEL

To assign difficulty level to your prompt, answer the following questions:


Estimate (in minutes) the time it takes a human to solve the problem once the correct
approach is identified. (Round up to the nearest multiple of 5):
______________________

Tag a difficulty level based on complexity and the likelihood of model success:

 Easy

 A high school student can solve this without difficulty.

 Medium

 An undergraduate student can solve this without help.

 Hard

 An undergraduate student can solve this but might need to check their notes
or look up a hint.
4. MATH SPECIFIC GUIDANCE (for math tasks only)

Your Complex Math Prompt Should:

 Lead to an error in reasoning/calculation in at least one of the steps in


the model’s initial response
 Use proper LaTeX formatting for all mathematical expressions
 Be solvable
 Have only one correct solution
 Be sufficiently complex, but not contrived

Your Complex Math Prompt Should NOT:

 Be Missing any of the information necessary to solve it


 Include impossible scenarios, such as dividing by zero
 Contain any terms, concepts, theorems, or lemmas that do not adhere
to mathematical rules
 Have many potential solutions
 contain any spelling, grammar, or formatting errors

Step 2: Annotate the Initial Response

Step 2: Annotate the Initial Response

1. IDENTIFY INCORRECT REASONING or CALCULATION STEPS

The AI Model’s response will be separated into steps for easy analysis.
Evaluate each step as either:

 Correct - when the step has no math or logic issues


 Incorrect Calculation - when the step includes a computational mistake. For
example:

We next simplify to obtain 2-(3+4) = 2-3+4 = 3

If the monthly interest rate is r = 3% then after 5 months your initial


balance grows by a factor of (1+0.3)^5 = 3.71293

If Mary eats half the pizza and Jose eats one third of the pizza then
together they ate ½+⅓ = ⅖ of the pizza.
 Incorrect Reasoning - when the step includes a logical reasoning
mistake. For example:

In the race we know Bob comes in either 1st or 2nd and John comes
in 2nd or 3rd and Abby comes in 3rd then we conclude that Bob
comes in 2nd.

As ice cream sales and drowning rates both increase in the summer,
we conclude that ice cream causes drowning.

We know that x< y and y>z so it follows that x<z.


There may be times when a step contains both types of errors: reasoning and
calculation.

 In this case, select the most appropriate error.

In many cases, a reasoning error is more significant, as it takes the model down
the wrong line of reasoning:
We know 12 students are in Science and 8 students are in Art then
there are 12 - 8 = 3 students taking only Science.
The model may add preamble or summary steps which seem superfluous, but do
not mark them incorrect unless they contain an error.
2. RETURN TO PROMPT (if applicable)

REMEMBER:
Your prompt should be complex enough
to cause the model to fail
in at least one (1) reasoning or calculation step.
If this did not happen, you must return to your
prompt and make it more complex.
Go back to the prompt writing box and click on the “retry chat from here” button.
Refer to our Prompt Complexity Guide.
3. MATH SPECIFIC GUIDANCE

 Only mark steps incorrect for logical or mathematical errors, focusing on


the accuracy of each step

Do not mark a step incorrect if

 It contains LaTeX errors or badly formatted


solutions
 The solution is inefficient, as long as it’s correct
 It contains written preambles or summaries
 It’s irrelevant to solving the problem, but still
correct

Step 3: Rewrite Incorrect Steps

Step 3: Rewrite Incorrect Steps

1. REWRITING INCORRECT STEPS

In this part of the task, rewrite any steps that were rated as Incorrect
Your Rewrite Should

 Either modify the the original step or be a completely new step


depending on the degree of correctness
 Address any logical errors in the overall problem approach
 Address any errors in execution
 Maintain the same level of detail as other steps in the response
 Be written clearly, using simple language that can be understood
by a high school student
 Include only the information required to solve the problem
 Ensure that the step is self-contained and understandable on its
own
 DO NOT introduce any new errors (this is a common mistake)

Next, write a brief but detailed justification for why your rewrite was necessary.
2. PROVIDE RATIONALE

In this part of the task, provide a rationale for any steps that were rated as Incorrect
Your Rationale Should

 Clearly explain why the step is incorrect


 Use proper spelling and grammar
 Not contain the words “model”, “AI”, “LLM”, etc.
 Not be written in LaTex

Answer the following question:

 Is this step the final step, and does it contain the solution to the prompt?

 Yes
 No

 If yes, the model will stop generating subsequent steps


 If no, the model will generate subsequent steps to solve it
3. PROVIDE FINAL ANSWER

In this part you should supply the correct answer to your prompt.

 Your answer should be in the simplest form possible.

 Examples:

Prompt:
Find the sum of the numbers from 1 to 10.
The final answer should simply be: 55
NOT
“My answer is 55”
or
“The answer to the prompt is 55”

Prompt:
Alex ran a race at a 5mph pace, Edward ran the race at a 4mph pace,
and Sarah ran the race at a 7mph pace. Who came in first, second
and third?
The final answer should simply be:
Sarah was first, Alex was second, and Edward was 3rd.
NOT
“By comparing their paces we find that Sarah was first…”
Step 4: Assess Response Quality

Step 4: Assess Response Quality

1. LABEL THE RESPONSE

In this final step of the task, you will encounter the following:
Label your response according to the questions below.
Accuracy

 Is the first model response entirely correct?

 Yes
 No

Instruction Following

 Does the model follow instructions and understand the


specifications of the prompt? (Note: the model can be inaccurate
and still follow instructions.)

 Yes
 No

DEFINITIONS

 Accuracy

 Did the model make mistakes either in its calculation or reasoning in


the first Turn?

 Instruction Following

 Thinking of the model as a child, did the model attempt to solve your
prompt even if it made mistakes along the way?
Appendix
COMMON ERRORS
COMMON ERRORS in Mail Valley and how to avoid them!
Error Explanation & How to avoid
Examples

Ambiguity and Many prompts lack  Use specific, unambiguous language;


Open-Ended specificity or clarity, avoid words or phrases with multiple
Prompts allowing for multiple interpretations to prevent subjective
interpretations, answers.
leading to confusion.  Ensure prompts lead to a single, correct
Some prompts have solution; avoid open-ended or
erroneous steps that ambiguous prompts.
impact overall task  Make prompts challenging yet realistic
validity. and feasible, avoiding unnecessary
Others are too basic complexity or convoluted scenarios.
or common, reducing
their effectiveness in  Examples of Clear vs. Vague
challenging the Prompts:
model.
Includes unclear  Clear: “Calculate the equilibrium
phrasing and prompts income when investment rises
with multiple possible by $53 million, given the initial
answers, which equation C = 321 + 0.40Y.”
confuse the model  Vague: “Explain how income
and reviewers. changes when investment goes
up.”

Incorrect Several prompts  Confirm that the prompt requires


Assumptions and misrepresent the domain-specific knowledge and
Inaccurate Domain reasoning type, adheres to the reasoning type/subtopic
Matching subtopic or domain, provided.
like using history  Ask yourself the following:
questions as math
problems or mixing  "Does the prompt require
up reasoning domain-specific knowledge
categories (e.g., (e.g., mathematical steps in a
deductive vs. math domain)?"
analytical).  "Does it match the reasoning
type/subtopic assigned (e.g.,
logical reasoning vs.
statistical)?"
 If not, rework it or skip.

Poor Grammar, We often see  Review each prompt carefully before


Syntax, and Typos grammatical errors, submission, ensuring it’s free of
poor phrasing, and grammar, spelling, and syntax errors.
even typos in both  Use grammar checks to catch simple
prompts and rewrites, mistakes.
which sometimes
Error Explanation & How to avoid
Examples

result in inaccurate
information or
misinterpretations.

Incorrect Step Steps marked as  Verify that all "incorrect" labels are
Labeling and Poor incorrect are applied only to genuinely incorrect
Rewrite Placement sometimes correct, steps.
while others marked  Rewrites should always go in the
correct actually designated section, not in the
contain errors. justification box, to maintain clarity.
Additionally, rewrites
are occasionally
placed in the
justification box
instead of in the
designated rewrite
section, causing
confusion.

Not understanding Attempters often  Evaluate all aspects of the model’s


Accuracy vs. conflate "accuracy" response, including calculations, logic,
Instruction with "instruction and relevance, to confirm accuracy.
Following following," leading to  Check if the model reasonably followed
partial responses prompt instructions even if it made
marked as correct. mistakes, distinguishing between
attempts and outright failures.

Vague Justifications often  Provide concise but comprehensive


Justifications lack sufficient detail, justifications, focusing specifically on
sometimes failing to the nature of the error (e.g., logical or
address the errors computational) and how the rewrite
comprehensively or corrects it. Brief examples could clarify
provide a clear this.
rationale for
corrections.  Example Justification: “This
step contains a logical error
because it assumes that a base
change occurs only when the
result is prime, which was not
stated in the prompt. The rewrite
corrects this by clarifying the
conditions.”
Error Explanation & How to avoid
Examples

Unclear Rewrites Unclear rewrites can  Ensure rewrites are clear, self-
lead to contained, and stand alone as correct
misunderstandings or without needing additional context.
new errors, affecting
task accuracy.

Missed Missing calculation  Prioritize errors in steps where multiple


Calculation Errors errors can result in issues may occur, giving precedence to
false conclusions or a logical errors over minor computational
flawed final answer. mistakes.
 Don’t mark redundant steps as
incorrect if they’re accurate. Only mark
as an error if the step misleads or
contradicts the solution.

Misunderstanding Some Attempters  Make sure to differentiate Error


Calculation vs. don’t correctly Types:
Reasoning Errors distinguish between
calculation and  Calculation Error: Mistakes in
reasoning errors. mathematical computation.

 Example: Incorrect
simplification, like “2 - (3
+ 4) = 2 - 3 + 4 = 3”
(should be -5).

 Reasoning Error: Errors in


logic that lead to inaccurate
conclusions.

 Example: Misinterpreting
problem constraints,
e.g., “Assuming Bob
could only be 1st or 2nd
place, but concluding he
came in 3rd.”

 If a step contains both


errors, prioritize reasoning
errors for correction, as these
impact the model’s problem-
solving approach more broadly.

Prompt Examples
(Do not copy. These are just to give you some inspiration.)
Topic & Reasoning Type Prompt

Math / Logical Reasoning Assume you are an explorer. You are currently at point $A$ and
you are trying to determine the distance between two landmarks,
which we denote $B$ and $C$. You know that the angle $\angle
BAC = 80\degree$ and $\angle BCA = 30\degree$. From your
current position $A$, you move 30 meters in the direction of $B$,
and from there you determine that you are now 60 meters away
from $B$. What is the distance between the landmarks $B$ and
$C$?

Social Studies / Temporal Estimate the year in which the population of Nisos will be
Reasoning approximately 6,000.

Law / Analytical Thinking Interpret the statutory law regarding public protests in the context
of a recent event.

Biology / Causal Reasoning Given that cellular respiration involves the conversion of glucose
into ATP, what would happen to the energy production in a cell if
the supply of glucose was significantly reduced?

Math / Common Sense A boat has been moving for 17 hours. For the first 30 minutes, it
Reasoning moved at a speed of 35 knots. Then, it sped up to 85 knots for 5
hours, eventually reducing its speed to 60 knots due to choppy
waters, which lasted for 7 hours. Finally, as it neared its
destination, it slowed to 25 knots, maintaining that speed until it
docked at port. What was the average speed in knots of the
journey?

Math / Logical Reasoning What is the smallest root of the polynomial $x^5+3x^4-23x^3-
51x^2+94x+120$?Analytical Thinking: Use integration by parts to
find the integral of $f(x) = x^2 * e^x$.

Math / Deductive Reasoning Jack is 4 years younger than their sister Lola and their mom Su
is 12 years younger than quadruple their age. In two years, Jack
and their sister's ages will sum to ten years less than their mom's
current age. How old is Lola?

Math / Inductive Reasoning What is the next number in the sequence: 3,7,15,31,63,_?

Math / Comparative Analysis I live in Austin, Texas and my friend lives in Tulsa, Oklahoma.
We both plan to drive to Santa Fe, New Mexico to meet up. We
will leave our houses at the same time. I will be driving 75 mph
for all of the 686 mi but will have to stop for gas three times. If my
friend drives at 65 mph for all of the 641 mi but only has to stop
for gas twice, who will get there first?

Math / Causal Reasoning If I take a number less than 10, quadruple it, divide it by 2, and
multiply it by 3 with a result of ten tripled, was the original number
prime?

Math / Pattern Recognition A light flashes purple twice, then green once, then blue thrice.
This repeats three times, and then the light flashes purple twice
more. What color will flash next?
Topic & Reasoning Type Prompt

Math / Statistical Reasoning A company that manufactures washing machines, each of which
has a 3% chance of being defective. If 300 washing machines
are sold, what is the probability that exactly 5 will be defective?

Games / Temporal Reasoning An adventurer sets out to slay a dragon. The dragon is very
& Abstract Thinking magical and so they must make their weapon magical as well in
order to have a chance against the majestic beast. They first
travel to Neitherfield and gather the poppies that grow under the
blue oaks, then they journey to Everdeath and collect moss from
the haunted walking redwoods. The journey from Neitherfield to
Everdeath takes four days, and the moss collection takes a day.
The poppies last only three days, but their efficacy can be
doubled by taking a half-day detour into Valiant and spending
another half day collecting the sweat from a giant pure black
horse there. Preparing the sword-warding potion takes a quarter
of a day. Even with this detour, will the poppies last long enough
for them to succeed in their quest to magically ward their sword?

Puzzles / Deductive Jack is 4 years younger than their sister Lola and their mom Su
Reasoning is 12 years younger than quadruple their age. In two years, Jack
and their sister's ages will sum to ten years less than their mom's
current age. How old is Lola?

Riddles Beth places four whole ice cubes in a frying pan at the start of the first
minute, then five at the start of the second minute and some more at the
start of the third minute, but none in the fourth minute. If the average
number of ice cubes per minute placed in the pan while it was frying a
crispy egg was five, how many whole ice cubes can be found in the pan
at the end of the third minute? Pick the most realistic answer option.

1. 5
2. 11
3. 0
4. 20

Domains Overview

Domain Sub-domain Definition


Education General Education Focuses on general educational strategies and
problem-solving in learning environments.
K-12 Covers education from kindergarten to 12th grade,
including instructional strategies and student
development.
Tutoring Involves individualized instruction to enhance student
understanding in various subjects.
Domain Sub-domain Definition
Finance Financial Analysis This subdomain involves evaluating financial
performance, usually through financial metrics such
as ROI (Return on Investment) and Residual Income.
Financial Involves using financial principles to assess
Reasoning investment opportunities and make decisions based
on metrics like Net Present Value (NPV).
Law Contract Law Focuses on legally binding agreements and their
enforcement.
Constitutional Law Addresses the interpretation and application of
constitutional principles, including individual rights
and government powers.
Mathematics Optimization Focuses on maximizing or minimizing functions to
find optimal solutions within constraints.
Problem Involves breaking down complex problems into
Decomposition smaller, more manageable parts to simplify the
solution process.
Science Biology Focuses on understanding life forms, genetics, and
ecological systems.
Physics Explores the fundamental forces and properties of
matter, including the study of motion, energy, and the
structure of the universe.
Psychology Involves the study of the mind and behavior,
exploring how people think, feel, and act.
Scientific Method Experimental Focuses on planning and executing experiments to
Design accurately test hypotheses.
Social Sciences Economics Focuses on resource allocation and decision-making
in various economic systems.
Technology Cyber Security Deals with defending against threats like hacking,
malware, and phishing, while ensuring data
protection.
General Explores the application of systems and tools to
Technology solve problems and enhance human capabilities.
Logical Reasoning Abductive Involves identifying the most likely explanation for an
Reasoning observation based on limited evidence.
Analogical Involves comparing relationships between two pairs
Reasoning of concepts to infer similar relationships.
Deductive Involves drawing specific conclusions from general
Reasoning principles.
Inductive Involves generalizing conclusions from specific
Reasoning observations or patterns.
Information Processing Abstract Thinking Involves grasping underlying principles and
frameworks beyond concrete details.
Domain Sub-domain Definition
Pattern Focuses on identifying regularities or trends within
Recognition data.
Conceptual Involves understanding relationships between
Mapping entities in a structured manner.
Commonsense Reasoning about Focuses on understanding physical interactions,
Reasoning Physical World such as predicting how objects will behave in specific
conditions.
Social Reasoning Involves understanding human behavior and social
dynamics.
Session-Level and Ensures logical consistency across single and
Multi-Session multiple conversations.
Dialogue
Coherence
Critical Thinking Comparative Involves evaluating various options based on
Analysis multiple criteria.
Questioning Focuses on examining underlying beliefs or premises
Assumptions that are taken for granted.
Reflective Involves introspecting and evaluating one's own
Thinking thoughts and beliefs.
Problem-Solving Decision-Making Focuses on creating scenarios where individuals
assess various options and determine the most
effective solution.
Analytical Involves using logical reasoning and calculations to
Problem-Solving solve complex problems.
Causal and Thought Used to assess hypothetical situations and their
Counterfactual Experiments potential outcomes.
Reasoning
Counterfactual Involves exploring what would have happened if a
Reasoning prior event had been different.
Communication Pedagogical Involves selecting instructional strategies that correct
Reasoning misconceptions and improve understanding.
Logic/Algorithm Puzzles Sequential Involves solving puzzles that require predicting the
Reasoning order of events based on rules or patterns.
Cryptography Focus on deciphering coded messages or creating
Puzzles encryption methods.
Competition Direct Competition Occurs when businesses compete directly for the
same customers.
Indirect Involves entities offering different products that
Competition satisfy the same customer need.
Strategy Abstract Strategy Involves games requiring players to think several
steps ahead to secure victory.
Tactical Strategy Focuses on optimizing resources and planning
actions to achieve the best outcome.
Subtopics

Please view this Google sheet


Reasoning Types

Reasoning Types Defined

Common Sense Logical Analytical Deductive Inductive


Create prompts Reasoning Thinking Reasoning Reasoning
that require Write prompts Develop Write prompts Craft prompts that
everyday that encourage prompts that where the encourage the
reasoning and step-by-step require breaking Model must Model to make
basic problem-solving, down complex apply general generalizations
understanding of where the Model information into principles or based on specific
how the world must analyze parts. rules to specific examples.
works. information, spot Ask the Model cases to reach Your prompt
Ask about object relationships, to examine a conclusion. should ask the
interactions, time, and make details, compare Start with a Model to infer
space, and reasoned elements, or broad patterns, trends,
cause-and-effect conclusions. assess statement or or rules from a
in practical Use logic situations premise and series of
scenarios. puzzles or critically to ask how it observations or
scenarios with reach a solution. applies to data points.
clear, rational particular
solutions. scenarios.

Comparative Causal Pattern Statistical Temporal


Analysis Reasoning Recognition Reasoning Reasoning &
Create prompts Write prompts Develop Write a prompt Abstract
that ask the that involve prompts that that requests Thinking
Model to identifying involve the Model to Create prompts
compare two or causes and identifying make decisions that involve
more entities, predicting recurring or predictions sequences of
such as data effects. themes or based on data events or abstract
sets, ideas, or Ask the Model to trends within a or probability. concepts.
scenarios. explore how one certain scope or Ask the Model Ask the Model to
Request the event or action referenced to analyze data reason about how
Model to highlight might lead to document. sets or time affects
similarities, specific Ask the Model probability outcomes.
differences, and outcomes, to spot patterns scenarios, Request answers
make judgments focusing on the in information, making related to abstract
based on their relationships scenarios, or inferences or theories or
analysis. between actions data that can identifying hypothetical
and results. lead to trends from situations.
predictions or statistical
insights. information.
Prompt Complexity Guide

Complexity Guidelines

 Ground your prompts in your personal knowledge of your topic.

 Ideally, your prompt should reflect real-world scenarios that are highly complex
and creative.

 In Reasoning Tasks, prompt complexity is based on logical difficulty, rather than


constraints (below).

 However, as long as these constraints are “woven” into the actual reasoning
prompt, and are not inessential (or asking the model to do busy-work), feel free
to use them to add more creativity to your prompts.

BASIC COMPLEXITY
 As a general rule of thumb, basic complexity prompts should take less than 5 minutes
to figure out.

EXAMPLE:
Topic: Biology
Reasoning Type: Causal Reasoning

 If smoking causes lung cancer, why did the lung cancer rates go up
when the smoking rates went down?

INTERMEDIATE COMPLEXITY

 As a general rule of thumb, intermediate complexity prompts should take more than 5
minutes to figure out.

EXAMPLE:
Topic: Science
Reasoning Type: Logical Reasoning

 Lamarck's theory of evolution says that organisms can acquire traits in


their lifetime and pass them onto their offspring. Darwin's theory of
evolution says that individuals are born with different traits on a mostly
random basis; over time, the traits that enable more successful
reproduction end up being more common in the gene pool. If you were
an evolutionary biologist, how would you set up an experiment to figure
out who was right? Write the proposed experiment for a team of PhD
candidates with extensive background knowledge and practice in this
area of study.

Grading Rubric

Dimension 1-2 (Fail) 3 (Okay) 4-5


(Good/Perfect)
Prompt Rubric
How your prompt will be rated

Prompt Requirements Major requirements missing. Meets All requirements


The prompt does not lead to requirements met. The prompt
reasoning or calculation but has minor matches the
errors or doesn't match the issues. The domain and
domain or subdomain. prompt leads subdomain and
to one leads to a
reasoning or reasoning or
calculation calculation error.
error.

Prompt Clarity & Specificity Major clarity issues; prompt is Mostly clear, Clear and specific;
vague or hard to follow; key but could be all necessary
details missing. interpreted information is
multiple ways included.
or lacks a
minor detail.

Rating Rubric
How your ratings of the Initial Response will be evaluated

Correctness of Individual Attempter incorrectly Some steps All steps are


Steps identifies at least one step, are correctly correctly identified.
and justification is identified, but
inadequate. there is minor
misjudgment.

Initial Response Labels Major issue with labeling the N/A All labels are
response (e.g., incorrect correctly selected;
labels for steps). the response is
accurately labeled.

Justification Rubric
How reviewers will grade your written ratings justifications

Analysis Justification is generic or N/A Justification is


verdict is skewed. specific and
balanced; verdict is
reasonable.
Dimension 1-2 (Fail) 3 (Okay) 4-5
(Good/Perfect)
Support Claims Claims contradict the verdict 1 claim lacks All claims logically
or are inaccurate; 2+ claims evidence, but defend the verdict,
lack evidence. the rest are are accurate, and
accurate and supported by
well- evidence.
supported.

Accuracy 1 or more pieces of evidence 1 piece of All evidence is


are inaccurate or evidence is accurate and not
misconstrued. misconstrued. misconstrued.

Rewrite Rubric
How reviewers will grade your rewritten response steps

Accuracy 1 or more major factual errors 2 or more No major factual


or misleading points. minor factual errors or
errors or misleading points.
misleading
points.

Rewritten Steps Quality Clearly worse than the model About the Step clearly
response. same quality performs better
Steps should be rewritten but as the model than the model
aren't. response. response; rewritten
to the correct
degree.

Overall Task Quality


How reviewers will grade the Overall Quality of your task

Original Work Content contains evidence of N/A No chatbot usage


chatbot usage or plagiarism. or plagiarism;
content is original.

Harmful Content Contains harmful content. N/A No harmful content


is present.

Spelling/Grammar/Formatting 4+ minor 1-3 minor No discernible


spelling/grammar/punctuation errors or 1 errors; clean
errors, or major formatting egregious formatting.
issues. spelling error.

Clarity & Structure Content is hard to follow or Minor clarity Clear and easy to
unclear. issues; follow.
generally
makes sense.

Repetitiveness/Relevance Contains 3+ repetitive or 1-2 repetitive No unnecessary


irrelevant sentences. or irrelevant repetition or
sentences. irrelevant content.
Reviewers, please navigate to the Reviewer Instructions.

You might also like