Mail Valley Reasoning
Mail Valley Reasoning
Attempter Specifications
Table of Contents
Project Introduction
Helpful Links
Task Steps Overview
Task Specifications
Step 1: Write a Prompt to Stump the Model
Step 2: Annotate the Initial Response
Step 3: Rewrite Incorrect Steps
Step 4: Assess Response Quality
Appendix
COMMON ERRORS
Prompt Examples
Domains Overview
Reasoning Types
Prompt Complexity Guide
Grading Rubric
Reviewer Instructions
Project Introduction
Welcome to Mail Valley Reasoning! This prompt-focused project is designed to improve AI in
complex reasoning tasks.
Your job is to create a complex prompt that causes the model to fail, rate the model response,
and rewrite the incorrect steps in the response, justifying why the steps were wrong.
REMEMBER!
The goal of this project is to get the Model to fail.
The more complex your prompts are, the better.
Tasks should be done in English.
If you are not comfortable tasking in English,
please skip the task and notify a QM in
Discourse.
Helpful Links
Calculator.Net - Includes a number of useful calculators including quadratic formula,
LCM, GCF, prime factorization, permutations, combinations, triangles, volume, hex, and
much more.
GeoGebra - Desmos but with more functionality for geometry. Would recommend testing
outside of tasking hours to get familiarity with the tool as there is a learning curve
Write a complex prompt within the given Domain and Subdomain, Topic and
Reasoning Type, or Topic and Subtopic
Your prompt should cause the model to fail in at least one of its reasoning,
logic, or calculation steps
Rate each reasoning step within the model’s response according to various
dimensions
Justify any incorrect steps you will need to rewrite with a rationale statement
Indicate whether the response was accurate and followed the instructions
given in the prompt
Please take a look at the COMMON ERRORS that we frequently see in this project.
Task Specifications
Step 1: Write a Prompt to Stump the Model
You will be given a Domain and Subdomain, Topic and Reasoning Type, or a Topic
and Subtopic.
See the Domains Overview for the full list, and this official Domains
Document for more in-depth guidance.
See the Subtopic examples
See Prompt Examples to understand Topic and Reasoning Type
IMPORTANT:
If you are uncomfortable with the Domain or Subdomain provided
PLEASE SKIP THE TASK.
2. WRITE A PROMPT
According to the Domain and Subdomain,Topic and Subtopic, or Topic and Reasoning
Type provided, write a prompt that causes the model to make at least one reasoning
or calculation error.
Please see the Prompt Complexity Guide in the Appendix.
Your Complex Prompt Should:
THIS IS PARAMOUNT!
LaTex Guidance
What is the area of a rectangle with This problem is too simple for the task
a length of 5 units and a width of 3 requirements. It only requires a basic
units? application of the area formula for rectangles
(Area=length×width\text{Area} = \text{length}
\times \text{width}Area=length×width) and
does not challenge the model's problem-
solving abilities. It lacks complexity and is not
suitable for evaluating the model’s capacity to
handle more advanced tasks.
What is the next number in the This prompt is bad because it’s too simple.
sequence 1,3,5,7,9,…?
Find yesterday’s high sale price for This prompt is bad because it’s reliant on live
Google stock. data in order to be solved.
Tag a difficulty level based on complexity and the likelihood of model success:
Easy
Medium
Hard
An undergraduate student can solve this but might need to check their notes
or look up a hint.
4. MATH SPECIFIC GUIDANCE (for math tasks only)
The AI Model’s response will be separated into steps for easy analysis.
Evaluate each step as either:
If Mary eats half the pizza and Jose eats one third of the pizza then
together they ate ½+⅓ = ⅖ of the pizza.
Incorrect Reasoning - when the step includes a logical reasoning
mistake. For example:
In the race we know Bob comes in either 1st or 2nd and John comes
in 2nd or 3rd and Abby comes in 3rd then we conclude that Bob
comes in 2nd.
As ice cream sales and drowning rates both increase in the summer,
we conclude that ice cream causes drowning.
In many cases, a reasoning error is more significant, as it takes the model down
the wrong line of reasoning:
We know 12 students are in Science and 8 students are in Art then
there are 12 - 8 = 3 students taking only Science.
The model may add preamble or summary steps which seem superfluous, but do
not mark them incorrect unless they contain an error.
2. RETURN TO PROMPT (if applicable)
REMEMBER:
Your prompt should be complex enough
to cause the model to fail
in at least one (1) reasoning or calculation step.
If this did not happen, you must return to your
prompt and make it more complex.
Go back to the prompt writing box and click on the “retry chat from here” button.
Refer to our Prompt Complexity Guide.
3. MATH SPECIFIC GUIDANCE
In this part of the task, rewrite any steps that were rated as Incorrect
Your Rewrite Should
Next, write a brief but detailed justification for why your rewrite was necessary.
2. PROVIDE RATIONALE
In this part of the task, provide a rationale for any steps that were rated as Incorrect
Your Rationale Should
Is this step the final step, and does it contain the solution to the prompt?
Yes
No
In this part you should supply the correct answer to your prompt.
Examples:
Prompt:
Find the sum of the numbers from 1 to 10.
The final answer should simply be: 55
NOT
“My answer is 55”
or
“The answer to the prompt is 55”
Prompt:
Alex ran a race at a 5mph pace, Edward ran the race at a 4mph pace,
and Sarah ran the race at a 7mph pace. Who came in first, second
and third?
The final answer should simply be:
Sarah was first, Alex was second, and Edward was 3rd.
NOT
“By comparing their paces we find that Sarah was first…”
Step 4: Assess Response Quality
In this final step of the task, you will encounter the following:
Label your response according to the questions below.
Accuracy
Yes
No
Instruction Following
Yes
No
DEFINITIONS
Accuracy
Instruction Following
Thinking of the model as a child, did the model attempt to solve your
prompt even if it made mistakes along the way?
Appendix
COMMON ERRORS
COMMON ERRORS in Mail Valley and how to avoid them!
Error Explanation & How to avoid
Examples
result in inaccurate
information or
misinterpretations.
Incorrect Step Steps marked as Verify that all "incorrect" labels are
Labeling and Poor incorrect are applied only to genuinely incorrect
Rewrite Placement sometimes correct, steps.
while others marked Rewrites should always go in the
correct actually designated section, not in the
contain errors. justification box, to maintain clarity.
Additionally, rewrites
are occasionally
placed in the
justification box
instead of in the
designated rewrite
section, causing
confusion.
Unclear Rewrites Unclear rewrites can Ensure rewrites are clear, self-
lead to contained, and stand alone as correct
misunderstandings or without needing additional context.
new errors, affecting
task accuracy.
Example: Incorrect
simplification, like “2 - (3
+ 4) = 2 - 3 + 4 = 3”
(should be -5).
Example: Misinterpreting
problem constraints,
e.g., “Assuming Bob
could only be 1st or 2nd
place, but concluding he
came in 3rd.”
Prompt Examples
(Do not copy. These are just to give you some inspiration.)
Topic & Reasoning Type Prompt
Math / Logical Reasoning Assume you are an explorer. You are currently at point $A$ and
you are trying to determine the distance between two landmarks,
which we denote $B$ and $C$. You know that the angle $\angle
BAC = 80\degree$ and $\angle BCA = 30\degree$. From your
current position $A$, you move 30 meters in the direction of $B$,
and from there you determine that you are now 60 meters away
from $B$. What is the distance between the landmarks $B$ and
$C$?
Social Studies / Temporal Estimate the year in which the population of Nisos will be
Reasoning approximately 6,000.
Law / Analytical Thinking Interpret the statutory law regarding public protests in the context
of a recent event.
Biology / Causal Reasoning Given that cellular respiration involves the conversion of glucose
into ATP, what would happen to the energy production in a cell if
the supply of glucose was significantly reduced?
Math / Common Sense A boat has been moving for 17 hours. For the first 30 minutes, it
Reasoning moved at a speed of 35 knots. Then, it sped up to 85 knots for 5
hours, eventually reducing its speed to 60 knots due to choppy
waters, which lasted for 7 hours. Finally, as it neared its
destination, it slowed to 25 knots, maintaining that speed until it
docked at port. What was the average speed in knots of the
journey?
Math / Logical Reasoning What is the smallest root of the polynomial $x^5+3x^4-23x^3-
51x^2+94x+120$?Analytical Thinking: Use integration by parts to
find the integral of $f(x) = x^2 * e^x$.
Math / Deductive Reasoning Jack is 4 years younger than their sister Lola and their mom Su
is 12 years younger than quadruple their age. In two years, Jack
and their sister's ages will sum to ten years less than their mom's
current age. How old is Lola?
Math / Inductive Reasoning What is the next number in the sequence: 3,7,15,31,63,_?
Math / Comparative Analysis I live in Austin, Texas and my friend lives in Tulsa, Oklahoma.
We both plan to drive to Santa Fe, New Mexico to meet up. We
will leave our houses at the same time. I will be driving 75 mph
for all of the 686 mi but will have to stop for gas three times. If my
friend drives at 65 mph for all of the 641 mi but only has to stop
for gas twice, who will get there first?
Math / Causal Reasoning If I take a number less than 10, quadruple it, divide it by 2, and
multiply it by 3 with a result of ten tripled, was the original number
prime?
Math / Pattern Recognition A light flashes purple twice, then green once, then blue thrice.
This repeats three times, and then the light flashes purple twice
more. What color will flash next?
Topic & Reasoning Type Prompt
Math / Statistical Reasoning A company that manufactures washing machines, each of which
has a 3% chance of being defective. If 300 washing machines
are sold, what is the probability that exactly 5 will be defective?
Games / Temporal Reasoning An adventurer sets out to slay a dragon. The dragon is very
& Abstract Thinking magical and so they must make their weapon magical as well in
order to have a chance against the majestic beast. They first
travel to Neitherfield and gather the poppies that grow under the
blue oaks, then they journey to Everdeath and collect moss from
the haunted walking redwoods. The journey from Neitherfield to
Everdeath takes four days, and the moss collection takes a day.
The poppies last only three days, but their efficacy can be
doubled by taking a half-day detour into Valiant and spending
another half day collecting the sweat from a giant pure black
horse there. Preparing the sword-warding potion takes a quarter
of a day. Even with this detour, will the poppies last long enough
for them to succeed in their quest to magically ward their sword?
Puzzles / Deductive Jack is 4 years younger than their sister Lola and their mom Su
Reasoning is 12 years younger than quadruple their age. In two years, Jack
and their sister's ages will sum to ten years less than their mom's
current age. How old is Lola?
Riddles Beth places four whole ice cubes in a frying pan at the start of the first
minute, then five at the start of the second minute and some more at the
start of the third minute, but none in the fourth minute. If the average
number of ice cubes per minute placed in the pan while it was frying a
crispy egg was five, how many whole ice cubes can be found in the pan
at the end of the third minute? Pick the most realistic answer option.
1. 5
2. 11
3. 0
4. 20
Domains Overview
Complexity Guidelines
Ideally, your prompt should reflect real-world scenarios that are highly complex
and creative.
However, as long as these constraints are “woven” into the actual reasoning
prompt, and are not inessential (or asking the model to do busy-work), feel free
to use them to add more creativity to your prompts.
BASIC COMPLEXITY
As a general rule of thumb, basic complexity prompts should take less than 5 minutes
to figure out.
EXAMPLE:
Topic: Biology
Reasoning Type: Causal Reasoning
If smoking causes lung cancer, why did the lung cancer rates go up
when the smoking rates went down?
INTERMEDIATE COMPLEXITY
As a general rule of thumb, intermediate complexity prompts should take more than 5
minutes to figure out.
EXAMPLE:
Topic: Science
Reasoning Type: Logical Reasoning
Grading Rubric
Prompt Clarity & Specificity Major clarity issues; prompt is Mostly clear, Clear and specific;
vague or hard to follow; key but could be all necessary
details missing. interpreted information is
multiple ways included.
or lacks a
minor detail.
Rating Rubric
How your ratings of the Initial Response will be evaluated
Initial Response Labels Major issue with labeling the N/A All labels are
response (e.g., incorrect correctly selected;
labels for steps). the response is
accurately labeled.
Justification Rubric
How reviewers will grade your written ratings justifications
Rewrite Rubric
How reviewers will grade your rewritten response steps
Rewritten Steps Quality Clearly worse than the model About the Step clearly
response. same quality performs better
Steps should be rewritten but as the model than the model
aren't. response. response; rewritten
to the correct
degree.
Clarity & Structure Content is hard to follow or Minor clarity Clear and easy to
unclear. issues; follow.
generally
makes sense.