ITT
ITT
Dear Contributors, please take a moment to review the following customer feedback for the
Image-to-Text (ITT) project! The main takeaway is that prompts must appropriately
challenge the model to reason through the question to get to the answer! In other words,
the answer should NOT be obvious! As always, prompts should draw upon real-world
scenarios where you might ask an AI for guidance, using simple, direct, and natural-
sounding language.
Please do NOT attempt tasking any further before reading this feedback!
Here’s an example of too-simple prompt that was submitted in the Scene Understanding
competency:
Prompt Example
While this prompt is appropriate for the Scene Understanding competency in that it asks about
the position of the car relative to the garage door, the answer almost immediately evident just
by looking at the image 👎. Please avoid overly simple prompts like this, particularly in the Scene
Understanding/ Counting competencies! This kind of prompt does not challenge the model, or
help it to improve.
Here are some principles that you might consider incorporating into prompts for the Scene
Understanding competency:
● Hierarchical Spatial Reasoning: Eg, “Can you tell which objects are on the floor, which
are on furniture, and which are stacked on each other?”
● Occlusion and Depth Reasoning: Eg, “Which building looks closest to the viewer, and
which looks farthest away?”
● Relational Layout with Directional Anchoring: “Using the tree in the middle as a
reference, where are the animals located around it– left, right, behind, or in front?”
● Symmetry/ Alignment Detection: Eg, “Are the benches lined up even with the
fountain? If not, how are they positioned differently?”
● Navigation and Pathfinding Reasoning: Eg, “If someone walks from the red door to
the green gate, what’s the best clear path they can take? Are there any obstacles?”
● Object Orientation and Facing Direction: Eg, “Who’s facing the camera, who’s turned
sideways, and who has their back to us?”
● Nested Spatial Structures: Eg, “What items are inside other items in this image– like a
spoon in a cup or a cup in a cabinet? Can you describe the full nesting?”
● Motion & Temporal Spatial Inference: Eg, “Looking at the positions of the child and the
ball, who’s likely to reach it first?”
(Note that while the prompt example above does test the model’s ability to understand the
relational layout of the image, it only asks about two objects relative to each other. A better
prompt might ask about the relational layout of multiple objects in the image, relative to a
reference point.)
Prompts should draw upon real-world scenarios where you might ask an AI for guidance, using
simple, direct, and natural-sounding language. While sometimes it is necessary to use more
technical language for clarity, prompts should generally sound just as if someone in the real
world is using an AI model to ask a question about a problem that they are encountering, and
how to solve it.
In the same vein, many Attempters have relied heavily on formatting requests to fulfill
Complexity, resulting in unnatural-sounding prompts. Please avoid using formatting requests
in your prompts so that they sound more natural. However, you may still include formatting
requests in your prompts, if they make sense. Do NOT include formatting requests simply to
fulfill Complexity. The Complexity reviewer measure will be changed soon to reflect that
prompts should contain sufficient complexity, without relying on formatting requests.
Make sure your prompts are specific and precise in what you are asking for! Vague
prompts can be interpreted in many ways, leading to responses that may be off-target, and thus
difficult to rate. Be as specific as possible about the kind of information that you are
seeking in the response!
4. No Unattainable Requests! 🙅
Prompts should be answerable based on information that is visually provided by the image.
Please do not submit prompts that cannot be answered by information contained in the
image. This includes requests that are not among the options in an image.
The following “unattainable” prompt was submitted in the Patterns competency. However, it is
not possible to identify a correct answer based on the given options:
Prompt Example
|
The correct answer is not found in the second image. The answer is that the white horse
inside of the white diamond is the odd knight.
Additionally, the prompt does not provide any explicit instructions on what the model should do if
none of the options are correct. This lack of clear guidance may mislead the agent if the answer
is unattainable.