?️_?️ Vision SFT Handbook
?️_?️ Vision SFT Handbook
In this project we are helping develop a chatbot’s ability to process and understand
images. We’ll do this by writing prompts referring to a picture, then create strong
responses that incorporate details from those images.
Contributors will help write and edit prompts and responses and compare human- vs
generated- response quality.
This data will help our chatbots give us better answers when using images as
references across a variety of use cases and help types.
The Goals 🥇
1. 🌉 Help the model learn to write amazing prompts and responses with an image for
reference.
2. 📈 Write responses that are as good as or better than state-of-the-art chatbot
responses
3. ⏳ Give quick, complete, and useful answers to users.
a. A "complete" answer that includes every possible detail is less important
than a concise, relevant response that directly addresses the user's need.
Task Steps:
Attempter Workflow (L-1):
📖 🖼️ ✍️ ✍️
Read Variables Find an Image Write Prompt(s) Write Response(s)
● Goal: As the first person to attempt this task, your goal is to find an approved
image and create a fantastic, lifelike prompt given your task-specific variables,
then draft a response to go with it that is equally as vibrant and engaging. Your
aim is to write an even better response than a chatbot could.
● Steps:
a. Review the task-specific instructions for the variables you need to
incorporate, specifically:
i. Help type
ii. Image type
iii. Text Load
iv. Text Language
v. Locale
b. Review the prompt instructions for the task to ensure you write the
prompt with all the necessary considerations
c. Find an image that meets the requirements and attach it to the prompt
entry box
d. Write an excellent prompt that references the image and asks for the
relevant help type. The model shouldn’t be able to answer your prompt
without knowing what’s in the image. DO NOT describe your image in
the prompt itself, that would defeat the purpose of testing if the Bot
can understand the picture.
FOR THE PROMPT-ONLY PROJECT, THIS IS THE FINAL STEP. FOR
THE PROMPT+RESPONSE PROJECTS, KEEP GOING.
e. Write a tailored, high-quality response to your own prompt that fulfills
the needs of the user.
i. If you are performing Multiturn Tasks, an additional Prompt box
will appear after your response box. You do NOT need to choose a
new photo– continue the conversation based on the original photo.
Continue writing prompt/response pairs until you’ve reached the
minimum number of turns, or have come to a natural close of the
conversation, whichever is longer, with a maximum of 6 turns.
ii. It can take the rubric a few seconds to load the next Prompt box
or the Submit Task button after the last Response. When you’re
done writing pairs, select End Session
f. Submit the task
📖 📖 ✍️ 📖 ✍️ 🎚️ ⚖️ ✔️ 💯
Read Variables Review the image Edit if Review the Edit if Rate Choose the Best Log Your Give
& prompt(s) necessary Response(s) necessary Responses Answer(s) Edits Feedback
● Goal: Quality check the prompt/response pair from each the Attempter and a
Chatbot, and provide quality updates. You’ll rank the responses side-by-side
and tell us which response is better.
● Steps:
a. Review the task-specific instructions and variables to ensure you have a
good understanding of what the prompt and response need to include
b. Review the prompt(s) potentially adding edits to improve them
■ FOR THE PROMPT-ONLY PROJECT, THIS IS THE ONLY PIECE
YOU WILL REVIEW – THERE WILL BE NO RESPONSE OR
RATINGS.
c. Review the response(s), potentially adding edits to improve them
d. Grade each response on specific quality criteria, such as image
understanding, instruction following, etc.
■ If you are performing Multiturn tasks, click on each turn response
to open up the rubric and preference ranking for that turn. All turns
need to be edited/ rated.
e. Provide a preference rating for which response is better and a
justification.
■ On multiturn selection, you’ll give a preference rating for EACH
answer.
f. Notate how much editing you needed to do on the prompt/response, and
give an overall quality grade for the task.
g. Give Feedback to the Attempter, if needed. Grade the quality of the task
with a justification and leave inline feedback wherever it’s needed.
h. Submit the task (“Approve with changes”)
Task-Specific Variables
Each task has five(5) variables that need to be considered when writing and reviewing
the prompt and responses.
● 📷 Image type:
○ What type of picture you (as the user) are using for reference with the
chatbot. Each Image Type is a broad category, so get creative with what
you search for within each type!
● 📚 Text Load:
○ Each Image Type is also associated with a Text Load – how much text
should be in the image. For image types that are Text-Heavy, you’ll be
given a Text Language.
● 🗣️ Text Language
○ If your image is Text-Heavy, the language it is in will matter. Make sure to
pay attention to the text language that needs to be in your photo!
Prompt Creation Guidelines:
When crafting prompts, always adhere to the intended Help Type. This means
understanding what the user expects from the model.
● For instance, in a Creative Writing scenario, the user seeks an original text
output, regardless of its perceived "creativity" or "artistic" merit.
● In Extraction, the user needs a rapid answer derived from a provided source
text, without having to read it entirely.
● For Chatbot use cases, the user aims to engage in interactive experiences, such
as role-playing, game-playing, or general entertainment-focused conversations.
A task suitable for Extraction may also be appropriate for Closed Q&A. Similarly,
Chatbot or Brainstorming tasks can function as Open Q&A. Focus on creating tasks
that fit the primary use case, rather than trying to pick the only "correct" use case.
Faces and people tasks should not directly ask to identify a figure.
These are:
1. Do NOT ask direct people identification, such as: who is this? + [a photo of a
person], What is this person known for? + [a photo of a person] even for
public/historical figures.
2. Do NOT ask resemblance questions, such as: which famous soccer player does
this person look like? + [a photo of a person]
3. Do NOT ask inference of protected status from images, such as: What is the
sexual orientation of the person from the image? + [a photo of a person]
However, if there are clear identifications of the person, you are allowed to ask
questions about it.
Examples are:
1. “What’s the sexual orientation of this person” + [an photo of Alan uring with
text “Alan uring” besides him]
2. “what’s the sexual orientation of this person” + [an image with only text “Alan
uring” on it].
3. “This is a picture of Barack Obama. What were his greatest accomplishments?”
Chatbot tasks should not be scripted conversations unless explicitly requested. The
interaction should resemble a basic conversation where the prompt represents one
side, and the model represents the other.
Good Example:
● Prompt:
Unset
"I’ve often wondered why the sea is so blue around the Amalfi
Coast. Please answer in English in the voice of Jacques
Cousteau."
● Response:
Unset
"The blue sea of Amalfi is famous, and one of my favorites to go
diving in! It is so blue because the water is so clear and
because of the brightness of the sun."
Bad Example:
● Prompt:
Unset
"I’ve often wondered why the sea is so blue around the Amalfi
Coast. Please answer in English in the voice of Jacques
Cousteau."
Unset
Localization Highlights:
Tasks must be tailored to your specific language and country. This includes:
Do not include tasks about global celebrities (unless they have a local connection, such
as Bono's house in Dublin). Steer clear of universally common topics like Hollywood
movies, major video games, or widely available car models. Do not create tasks about
locales other than your own (e.g., do not write about Hong Kong if you are in
Singapore).
Responses should:
● Lead with the answer, using the pyramid principle to provide immediate value.
● Avoid meaningless pleasantries like "Of course!" or "Sure, I’d be happy to help!"
(except in Chatbot use cases where conversational politeness is expected).
● Eliminate unhelpful repetition, such as restating the prompt or summarizing the
response at the end.