0% found this document useful (0 votes)
36 views5 pages

Exposition Wholesale

Uploaded by

Matx 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views5 pages

Exposition Wholesale

Uploaded by

Matx 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 5

Welcome to the Extensions project! We are so excited to have you.

Agenda
Project Introduction and Setup [12 min]
Intro & Overview of the Extensions project
Task Overview
New Dimensions
Prompt Requests
Multi-turn Conversations
Punts and Blank Responses
Key Model Assumptions
Knowledge Checks

Intro & Overview: Extensions project [2 min]


Project Context

In this project, the customer () is interested in user-generated prompts that have


their models interact with tools like Google Maps, Google Flights, Google Hotels,
Google Search, etc. You'll see a lot of prompts that ask for directions,
itineraries, hotels, etc. These prompts come from real life users like yourself.
Hopefully, with this context, you can see why the quality bar must be so high. - if
the model provides false directions, the user will almost certainly never use the
chatbot.
For each task, you will:
Rate two (2) responses separately across various dimensions
Rate a "tool log" for accuracy
Provide two preference ranking scores:
A score for how much you prefer one response over the other
A score for how much you prefer one tool log over the other
Write a justification for the response preference ranking, explaining the specifics
and logic behind your decision.
Mark error categories across for both the response and tool log

Task Walkthrough (1/3)


A step-by-step guide to each part of the task.
Step 1: If there is a chat history, familiarize yourself with the prior
conversation between the user and the chatbot. Use the history as context for the
latest prompt.
See below for an example chat history.
Start a Chat
Please read the previous conversation carefully to gather context.
User: You are a indie hacker with experience in finding leads for software
engineering project. I'm trying to brainstorm ideas for generating leads and
attract clients and close deals for my custom software solutions company. I am a
fullstack developer with more than 9 years experience in NodeJS, Typescript,
Kubernetes, React, Astro, React, Serverless and AWS. I'm also a certified AWS Cloud
developer and AWS Solutions Architect with hands on experience designing,
provisioning and developing custom platforms.
Generate ideas on how a professional like me could land new jobs in the IT
industry. Make sure to include sources and tools to help in searching and finding
new clients, how to contact and how to attract in further discussions the leads,
and think of other methods that could land a job. After you suggestions I will give
feedback and you'll have to ask questions to find out how your suggestions help.
Adjust the ideas based on the received feedback.
Model:
Leveraging Your Expertise to Attract Clients
As a seasoned full-stack developer with extensive experience in NodeJS, Typescript,
Kubernetes, React, Astro, React, Serverless, and AWS, you possess a valuable
skillset that can be highly sought after by businesses seeking custom software
solutions. To effectively attract clients and land new jobs, consider implementing
a multi-pronged approach that combines online strategies with targeted offline
engagements.
1. Establish a Strong Online Presence
Create a Compelling Website: Craft a professional website that showcases your
expertise, projects, and case studies. Highlight your technical proficiency,
problem-solving skills, and ability to deliver successful software solutions.

Step 2: Read the prompt carefully to understand what the user needs. Look up terms
you don't know. If there is a link, look through it as you evaluate each response.
See below for an example prompt.
Make an extensive and effective summary of the following video: https://youtu.be/0-
3SlufYHwA? si=nX4ORzISCGR5p3vj

Step 3: Code and Code Output Section. [UPDATED]


Read the code and code output for each response, note the key components for all
code that invoke Tools APIs has four primary components: (1) Tools (2) Tool Methods
(3) Parameters and (4) business logic / reasoning in code comments. Click the
pencil in top right to make it easier to read.

Task Walkthrough (2/3)


Step 4: Model Response Section
Read each of the two responses carefully and provide ratings on the quality
dimensions.
See below for an example model response.

Answer the questions below


Label your response according to the questions below.
Response 1
A
A
This is an article about Sundar Pichai's goals for Google in 2024. It discusses
seven goals, including delivering the world's most advanced Al and building the
most helpful personal computing platforms and devices. Some of the important points
are that Google will be launching a new platform with Android XR and that Gemini
Ultra should be nearing launch.
Writing Quality
The correctness of the spelling and grammatical structure, and the quality of
sentence structure. Determine if the different ideas flow from one to another
1
2
3
Major Issues
Minor Issues
No Issues
Verbosity R
How effectively the written response conveys information without unnecessary
repetition or wordiness. Look for length, relevance and repetition
1
Too Verbose
2
3
Just Right
Too Short
Verbosity Justification*
In 1-2 sentences, explain why you think there are issues with Verbosity. Please use
evidence. Please make sure any nuance/details are still provided in the final
justification at the end. The mini justifications are strictly for the
reviewers/auditors. Repetition is okay between the mini justifications and the
final justification.
This reopenso provides a very brief cumon
of the
o main point

There are four new dimensions to be rated for the response: Content Conciseness &
Relevance, Content Completeness, Collaborativity, Contextual Awareness (more
details on these new dimensions in the rubric course).

Step 6: Write a justification for your decision on which model response you prefer

Step 7: Select the error categories for the response and the tool log

Prompt Requests
Links in the prompt and "@"

Links – A user usually includes a link in the prompt for a very specific reason. Be
sure to always read through the link in the prompt! It's a critical part of the
user's request.
"@" – If you see the "@" symbol before the name of a tool like "@Google Hotels" or
"@travel arrow," that means the user is requesting the model to use that specific
tool to fulfill their request.

Request Importance

Not all requests are made equal! Some are more important than others.

Imagine a prompt requests 10 hotels under $300 in LA. Let's say one response offers
5 hotels under $300 in LA and another response offers 10 hotels above $300 in LA.
Which response would a user probably prefer? Probably the first—in this context,
prices are important

Multi-turn Conversations
Multi-turn conversations (chat/conversation history) are where the user and chatbot
have a conversation before the final prompt. Sometimes, the conversation history
will appear in the response itself as "==Conversation History==". Sometimes, it
will appear above the other responses.
Always read the chat history for multi-turn conversations carefully: it provides
important context on the latest prompt and the 2 responses you'll rate.

Refusal to Answer Prompt and Blank Responses


Refusal to Answer Prompt
A model might refuse to answer the prompt because the request is outside of its
capabilities. In this project, we assume that if a model refuses to answer due to
model limitations, it had no other choice – whatever the model says about its
limitations is true. When a model refuses to answer, it is not helpful to the user.

Complete Refusal Examples:


I'm sorry, but I'm not able to access the video/website you've provided. Can I help
with anything else?
I'm just a language model, so I can't help you with that.
Partial Refusal Examples:
I'm not able to access the video/website you've provided. However, based on the
website title, I've searched the web and found that …
I'm not able to search for flights directly. However, you can use the following
websites to find direct flights ...
Blank Responses
A blank response is when the model has no output or consists strictly of
googleusercontent links.
If the entire response is blank, mark Instruction Following as major issues and
Truthfulness as N/A

Key Model Assumptions [2 min]


Response 1 and Response 2 are the outputs of two different models. These models
have different limitations, sources, data, etc.
The models DON'T have the ability to generate images: they render as [Image]. Don't
penalize these tags if you see them: we are simply not rendering them.
If location is NOT specified anywhere in the prompt (including the small text above
it), the models should default to Mountain View, CA or Chicago, IL
Assume that the limitations of the model are true (e.g. "I cannot access that
link"). If the model refuses to answer because of a model limitation, assume it had
no other choice.
If time is NOT specified anywhere in the prompt, the models default to when the
prompt was asked.
If you see a googleusercontent link, assume it's not there. googleusercontent links
are garbage links that the model outputs. However we should NOT penalize for these
links.
If the task requires specific domain knowledge you don't have (e.g. math, coding),
please skip the task.

Embedded UI Guidance
Table of Contents

Embedded UI Guidance

Tips

Conclusion

When does this apply?

When there are no flights, maps, videos, pictures, or music listed in the response.

In such a response, the model says it has options for flights or hotels, but none
are shown…

No Issues

If the text says there is an embedded UI, check the tool log output. If the tool
log output makes sense and is accurate, rate Truthfulness as “No Issues.”

Major Issues / Cannot Assess

If the tool log exists and the text DOES NOT say there is an embedded UI, rate
“Major Issues.”
If no tool log exists and the text DOES SAY that an embedded UI exists, rate
“Cannot Assess.”
If no tool log exists and the text DOES NOT say there is an embedded UI, rate
“Major Issues."
Tips
Table of Contents

Embedded UI Guidance

Tips

Conclusion

Other possible prompts/scenarios...


Play me music from…
Show me where I can find “X” on a map…
Show me the video of…
I want to see a picture of…

Whenever the prompt requests that the model show, play, or display something, you
should check for an embedded UI.

Extensions - Common Errors (2024-10-02)


Hello, team! This week's mandatory course on common errors covers the following
topics: 1) Handling punt ratings 2) Handling embedded UI 3) Checking Tool/Code
Quality and Each Code Step within the Code and Code Output 4) Handling an Empty
Code Step >>> Please digest the information before contributing. Unfortunately,
these mistakes are leading to demotions and/or removal from the project. If you
still have uncertainties, please ask in Discourse! It's much better to ask and
clarify than assume.

Introduction
In this supplemental course, we will cover the most common tasking errors we see on
this project and how to address them.

Table of Contents
Common Errors
Verifying Prompt Requirements
Handling Punt Ratings
Embedded UI
Carefully Checking the Code and Code Output
Ignoring the Last Step with an Empty `code`, `tool_executions`, `error`, and
`observation`
Conclusion

Reminder: Whether you are new to this project or have been tasking for a while,
please review the material carefully. Even simple mistakes can create bad training
data, and contributors that consistently submit low quality tasks may be removed
from the project.
Common Errors
The common errors that we will address in this course are:
Not verifying that all requirements in the prompt were addressed
Not rating punts correctly according to the instructions
Not knowing how to rate responses with an Embedded UI
Not checking the Code and Code Output carefully
Not checking every code step in the Code and Code Output
Not knowing to ignore the last code step with all fields set to blank values

You might also like