Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
Models
Strategies and Best Practices for using ChatGPT and Other LLMs
Sinan Ozdemir
Addison-Wesley
Contents at a Glance
Preface
Part I: Introduction to Large Language Models
1. Overview of Large Language Models
2. Launching an Application with Proprietary Models
3. Prompt Engineering with GPT3
4. Optimizing LLMs with Customized Fine-Tuning
Part II: Getting the most out of LLMs
5. Advanced Prompt Engineering
6. Customizing Embeddings and Model Architectures
7. Moving Beyond Foundation Models
8. Fine-Tuning Open-Source LLMs
9. Deploying Custom LLMs to the Cloud
Table of Contents
Preface
Part I: Introduction to Large Language Models
1. Overview of Large Language Models
What Are Large Language Models (LLMs)?
Popular Modern LLMs
Domain-Specific LLMs
Applications of LLMs
Summary
2. Launching an Application with Proprietary Models
Introduction
The Task
Solution Overview
The Components
Putting It All Together
The Cost of Closed-Source
Summary
3. Prompt Engineering with GPT3
Introduction
Prompt Engineering
Working with Prompts Across Models
Building a Q/A bot with ChatGPT
Summary
4. Optimizing LLMs with Customized Fine-Tuning
Introduction
Transfer Learning and Fine-Tuning: A Primer
A Look at the OpenAI Fine-Tuning API
Preparing Custom Examples with the OpenAI CLI
Our First Fine-Tuned LLM!
Case Study 2: Amazon Review Category Classification
Summary
Part II: Getting the most out of LLMs
5. Advanced Prompt Engineering
Introduction
Prompt Injection Attacks
Input/Output Validation
Batch Prompting
Prompt Chaining
Chain of Thought Prompting
Re-visiting Few-shot Learning
Testing and Iterative Prompt Development
Conclusion
6. Customizing Embeddings and Model Architectures
Introduction
Case Study – Building a Recommendation System
Conclusion
7. Moving Beyond Foundation Models
Introduction
Case Study—Visual Q/A
Case Study—Reinforcement Learning from Feedback
Conclusion
8. Fine-Tuning Open-Source LLMs
Overview of T5
Building Translation/Summarization Pipelines with
T5
9. Deploying Custom LLMs to the Cloud
Overview of Cloud Deployment
Best Practices for Cloud Deployment
Preface
Topics covered:
9. Combining Transformers
def classify_text(email):
"""
Use Facebook's BART model to classify an emai
Args:
email (str): The email to classify
Returns:
str: The classification of the email
"""
# COPILOT START. EVERYTHING BEFORE THIS COMME
classifier = pipeline(
'zero-shot-classification', model='facebo
labels = ['spam', 'not spam']
hypothesis_template = 'This email is {}.'
results = classifier(
email, labels, hypothesis_template=hypoth
return results['labels'][0]
# COPILOT END
Note
Definition of LLMs
Note
Pre-training
• 800M words.
Note
Transfer Learning
Transfer learning is a technique used in machine learning to
leverage the knowledge gained from one task to improve
performance on another related task. Transfer learning for
LLMs involves taking an LLM that has been pre-trained on one
corpus of text data and then fine-tuning it for a specific
“downstream” task, such as text classification or text
generation, by updating the model’s parameters with task-
specific data.
Fine-tuning
If some of that went over your head, not to worry: we will rely
on pre-built tools from Hugging Face’s Transformers package
(Figure 1.9) and OpenAI’s Fine-tuning API to abstract away a lot
of this so we can really focus on our data and our models.
Note
Attention
Embeddings
Tokenization
Note
Figure 1.14 Any LLM has to deal with words they’ve never seen
before. How an LLM tokenizes text can matter if we care about
the token limit of an LLM.
Some LLMs limit the number of tokens we can input at any one
time so how an LLM tokenizes text can matter if we are trying
to be mindful about this limit.
BERT
T5
These three LLMs are highly versatile and are used for various
NLP tasks, such as text classification, text generation, machine
translation, and sentiment analysis, among others. These three
LLMs, along with flavors (variants) of them will be the main
focus of this book and our applications.
Domain-Specific LLMs
Applications of LLMs
These methods use LLMs in different ways and while all options
take advantage of an LLM’s pre-training, only option 2 requires
any fine-tuning. Let’s take a look at some specific applications of
LLMs.
Text Classification
Translation Tasks
SQL Generation
What first caught the world’s eye in terms of modern LLMs like
ChatGPT was their ability to freely write blogs, emails, and even
academic papers. This notion of text generation is why many
LLMs are affectionately referred to as “Generative AI”, although
that term is a bit reductive and imprecise. I will not often use
the term “Generative AI” as the specific word “generative” has
its own meaning in machine learning as the analogous way of
learning to a “discriminative” model. For more on that, check
out my first book: The Principles of Data Science)
Note
Chatbots
Figure 1.24 ChatGPT isn’t the only LLM that can hold a
conversation. We can use GPT-3 to construct a simple
conversational chatbot. The text highlighted in green represents
GPT-3’s output. Note that before the chat even begins, I inject
context to GPT-3 that would not be shown to the end-user but
GPT-3 needs to provide accurate responses.
We have our work cut out for us. I’m excited to be on this
journey with you and I’m excited to get started!
Summary
Introduction
Now you might be thinking that it’s time to finally learn the best
ways to talk to ChatGPT and GPT-4 to get the optimal results,
and we will start to do that in the next chapter, I promise. In the
meantime, I want to show you what else we can build on top of
this novel transformer architecture. While text-to-text
generative models like GPT are extremely impressive in their
own right, one of the most versatile solutions that AI companies
offer is the ability to generate text embeddings based on
powerful LLMs.
The Task
The terms you input into a search engine may not always align
with the exact words used in the items you want to see. It could
be that the words in the query are too general, resulting in a
slew of unrelated findings. This issue often extends beyond just
differing words in the results; the same words might carry
different meanings than what was searched for. This is where
semantic search comes into play, as exemplified by the earlier-
mentioned Magic: The Gathering cards scenario.
Solution Overview
The Components
Text Embedder
OpenAI’s embedding
Getting embeddings from OpenAI is as simple as a few lines of
code (Listing 2.1). As mentioned previously, this entire system
relies on an embedding mechanism that places semantically
similar items near each other so that the cosine similiarty is
large when the items are actually similar. There are multiple
methods we could use to create these embeddings, but we will
for now rely on OpenAI’s embedding engines to do this work
for us. Engines are different embedding mechanism that
OpenAI offer. We will use their most recent engine that they
recommend for most use-cases.
Document Chunker
chunk.append(sentence)
tokens_so_far += token + 1
return chunks
Cluster 0: 2 embeddings
Cluster 1: 3 embeddings
Cluster 2: 4 embeddings
...
Pinecone
Open-source Alternatives
API
FastAPI
import hashlib
import os
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
openai.api_key = os.environ.get('OPENAI_API_KEY',
pinecone_key = os.environ.get('PINECONE_KEY', '')
def my_hash(s):
# Return the MD5 hash of the input string as
return hashlib.md5(s.encode()).hexdigest()
class DocumentInputRequest(BaseModel):
# define input to /document/ingest
class DocumentInputResponse(BaseModel):
# define output from /document/ingest
class DocumentRetrieveRequest(BaseModel):
# define input to /document/retrieve
class DocumentRetrieveResponse(BaseModel):
# define output from /document/retrieve
if __name__ == "__main__":
uvicorn.run("api:app", host="0.0.0.0", port=8
For the full file, be sure to check out the code repository for this
book!
With all of these moving parts, let’s take a look at our final
system architecture in Figure 2.10.
Performance
I’ve outlined a solution to the problem of semantic search, but I
want to also talk about how to test how these different
components work together. For this, let’s use a well-known
dataset to run against: the BoolQ dataset - a question answering
dataset for yes/no questions containing nearly 16K examples.
This dataset has pairs of (question, passage) that indicate for a
given question, that passage would be the best passage to
answer the question.
Table 2.2 outlines a few trials I ran and coded up in the code for
this book. I use combinations of embedders, re-ranking
solutions, and a bit of fine-tuning to try and see how well the
system performs on two fronts:
Note that the models I used for the cross-encoder and the bi-
encoder were both specifically pre-trained on data that is
similar to asymmetric semantic search. This is important
because we want the embedder to produce vectors for both
short queries and long documents and place them near each
other when they are related.
We have a few components in play and not all of them are free.
Fortunately FastAPI is an open-source framework and does not
require any licensing fees. Our cost with FastAPI is hosting
which could be on a free tier depending on what service we
use. I like Render which has a free tier but also pricing starts at
$7/month for 100% uptime. At the time of writing, Pinecone
offers a free tier with a limit of 100,000 embeddings and up to 3
indexes, but beyond that, they charge based on the number of
embeddings and indexes used. Their Standard plan charges
$49/month for up to 1 million embeddings and 10 indexes.
FastAPI Cost = $7
These costs can quickly add up as the system scales, and it may
be worth exploring open-source alternatives or other strategies
to reduce costs - like using open-source bi-encoders for
embedding or Pgvector as your vector database.
Summary
Stay tuned for our next chapter where we will build on this API
with a chatbot built using GPT-4 and our retrieval system.
3
Introduction
Prompt Engineering
By the end of this chapter, you will have the skills and
knowledge needed to create powerful LLM-based applications
that leverage the full potential of these cutting-edge models.
Just Ask
The first and most important rule of prompt engineering for
instruction aligned language models is to be clear and direct in
what you are asking for. When we give an LLM a task to
complete, we want to make sure that we are communicating
that task as clearly as possible. This is especially true for simple
tasks that are straightforward for the LLM to accomplish.
Figure 3.3 The best way to get started with an LLM aligned to
answer queries from humans is to simply ask.
Note
Many figures are screenshots of an LLM’s
playground. Experimenting with prompt formats
in the playground or via an online interface can
help identify effective approaches, which can then
be tested more rigorously using larger data batches
and the code/API for optimal output.
Figure 3.4 This more fleshed out version of our just ask prompt
has three components: a clear and concise set of instructions, our
input prefixed by an explanatory label and a prefix for our output
followed by a colon and no further whitespace.
Few-shot Learning
Output Structuring
Prompting Personas
Specific word choices in our prompts can greatly influence the
output of the model. Even small changes to the prompt can lead
to vastly different results. For example, adding or removing a
single word can cause the LLM to shift its focus or change its
interpretation of the task. In some cases, this may result in
incorrect or irrelevant responses, while in other cases, it may
produce the exact output desired.
Personas may not always be used for positive purposes. Just like
any tool or technology, some people may use LLMs to evoke
harmful messages like if we asked an LLM to imitate an anti-
Semite like in the last figure. By feeding the LLMs with prompts
that promote hate speech or other harmful content, individuals
can generate text that perpetuates harmful ideas and reinforces
negative stereotypes. Creators of LLMs tend to take steps to
mitigate this potential misuse, such as implementing content
filters and working with human moderators to review the
output of the model. Individuals who want to use LLMs must
also be responsible and ethical when using LLMs and consider
the potential impact of their actions (or the actions the LLM
take on their behalf) on others.
ChatGPT
Cohere
We’ve already seen Cohere’s command series of models in
action previously in this chapter but as an alternative to
OpenAI, it’s a good time to show that prompts cannot always be
simply ported over from one model to another. Usually we need
to alter the prompt slightly to allow another LLM to do its work.
Note
Figure 3.12 A 10,000 foot view of our chatbot that uses ChatGPT
to provide a conversational interface in front of our semantic
search API.
To dig into it one step deeper, Figure 3.13 shows how this will
work at the prompt level, step by step:
Figure 3.13 Starting from the top left and reading left to right,
these four states represent how our bot is architected. Every time
a user says something that surfaces a confident document from
our knowledge base, that document is inserted directly into the
system prompt where we tell ChatGPT to only use documents
from our knowledge base.
Let’s wrap all of this logic into a Python class that will have a
skeleton like in Listing 3.1.
Listing 3.1 A ChatGPT Q/A bot
Summary
Introduction
Let’s introduce our first case study. We will be working with the
amazon_reviews_multi dataset (previewed in Figure 4.3). This
dataset is a collection of product reviews from Amazon,
spanning multiple product categories and languages (English,
Japanese, German, French, Chinese and Spanish). Each review
in the dataset is accompanied by a rating on a scale of 1 to 5
stars, with 1 being the lowest and 5 being the highest. The goal
of this case study is to fine-tune a pre-trained model from
OpenAI to perform sentiment classification on these reviews,
enabling it to predict the number of stars given to a review.
Taking a page out of my own book (albeit one just a few pages
ago), let’s start with taking a look at the data.
Figure 4.3 A snippet of the amazon_reviews_multi dataset shows
our input context (review titles and bodies) and our response - the
thing we are trying to predict - the number of stars the review
was for (1-5).
Our goal will be to use the context of the title and body of the
review and predict the rating that was given.
Guidelines and Best Practices for Data
I should note that for our input data, I have concatenated the
title and the body of the review as the singular input. This was a
personal choice and I made it because I believe that the title can
have more direct language to indicate general sentiment while
the body likely has more nuanced to pinpoint the exact number
of stars they are going to give. Feel free to explore different
ways of combining text fields together! We are going to explore
this further in later case studies along with other ways of
formatting fields for a single text input.
Figure 4.5 Unshuffled data makes for bad training data! It gives
the model room to overfit on specific batches of data and overall
lowers the quality of the responses. The top three graphs
represent a model trained on unshuffled training data and the
accuracy is horrible compared to a model trained on shuffled
data, seen in the bottom three graphs.
english_training_df = prepare_df_for_openai(train
# export the prompts and completions to a JSONL f
english_training_df.to_json("amazon-english-full-
To install the OpenAI CLI, you can use pip, the Python package
manager. First, make sure you have Python 3.6 or later installed
on your system. Then, follow these steps:
Before you can use the OpenAI CLI, you need to configure it
with your API key. To do this, set the OPENAI_API_KEY
environment variable to your API key value. You can find your
API key in your OpenAI account dashboard.
OpenAI has done a lot of work to find optimal settings for most
cases, so we will lean on their recommendations for our first
attempt. The only thing we will change is to train for 1 epoch
instead of the default 4. We're doing this because we want to
see how the performance looks before investing too much time
and money. Experimenting with different values and using
techniques like grid search will help you find the optimal
hyperparameter settings for your task and dataset, but be
mindful that this process can be time-consuming and costly.
Let’s kick off our first fine-tuning! Listing 4.2 makes a call to
OpenAI to train an ada model (fastest, cheapest, weakest) for 1
epoch on our training and validation data.
Figure 4.6 Our model is performing pretty well after only one
epoch on de-duplicated shuffled training data
63% accuracy might sound low to you but hear me out:
predicting the exact number of stars is tricky because people
aren’t always consistent in what they write and how they
finally review the product so I’ll offer two more metrics:
import math
# Select a random prompt from the test dataset
prompt = english_test_df['prompt'].sample(1).iloc
Output:
Prompt:
Great pieces of jewelry for the price
Predicted Star: 4
Probabilities:
4: 0.9831
5: 0.0165
3: 0.0002
2: 0.0001
1: 0.0001
Summary
Introduction
There are different ways to phrase this attack text but the above
method is on the simpler side. Using this method of prompt
injection, one could potentially steal the prompt of a popular
application using a popular LLM and create a clone with near
identical quality of responses. There are already websites out
there that document prompts that popular companies use
(which we won’t link to out of respect) so this issue is already
on the rise.
Input/Output Validation
'''
{'sequence': " What do you mean you can't access
'labels': ['offensive', 'safe'],
'scores': [0.7064529657363892, 0.000636537268292
'''
'''
{'sequence': ' Absolutely! I can help you get int
'labels': ['safe', 'offensive'],
'scores': [0.36239179968833923, 0.02562042325735
'''
Batch Prompting
Prompt Chaining
Figure 5.6 A two prompt chain where the first call to the LLM
asks the model to describe the email sender’s emotional state and
the second call takes in the whole context from the first call and
asks the LLM to respond to the email with interest. The resulting
email is more attuned to Charle’s emotional state
2. The second call to the LLM asks for the response but now has
insight into how the other person is feeling and can write a
more empathetic and appropriate response.
The original prompt sees the attack input text and outputs the
prompt which would be unfortunate however the second call to
the LLM generates the output seen to the user and no longer
contains the original prompt.
You can also use output sanitization to ensure that your LLM
outputs are free from injection attacks. For example, you can
use regular expressions or other validation criteria like the
Levenshtein distance or some semantic model to check that the
output of the model is not too similar to the prompt and block
any output that does not conform to that criteria from reaching
the end-user.
This prompt has at least a dozen different tasks for the LLM
ranging from writing an entire marketing plan and outlining
potential concerns from key stakeholders. This is likely too
much for the LLM to do in one shot.
That’s not to say the marketing plan itself wasn’t alright. It was
a bit generic but it hit most of the key points we asked it to. The
problem was that when we ask too much of an LLM, it often
simply starts to select which tasks to solve and ignores the
others.
In extreme cases, prompt stuffing can arise when a user fills the
LLM’s input token limit with too much information in the hopes
that the LLM will simply “figure it out” which can lead to
incorrect or incomplete responses or hallucinations of facts. An
example of reaching the token limit would be if we want an
LLM to output a SQL statement to query a database given the
database’s structure and a natural language query, that could
quickly reach the input limit if we had a huge database with
many tables and fields.
There are a few ways to try and avoid the problem of prompt
stuffing. First and foremost, it is important to be concise and
specific in the prompt and only include the necessary
information for the LLM. This allows the LLM to focus on the
specific task at hand and produce more accurate results that
address all the points you want it to. Additionally we can
implement chaining to break up the multi-task workflow into
multiple prompts (as shown in Figure 5.9). We could for
example have one prompt to generate the marketing plan, and
then use that plan as input to ask the LLM to identify key
people, and so on.
Figure 5.9 A potential workflow of chained prompts would have
one prompt generate the plan, another generate the stakeholders,
and a final prompt to create ways to address those concerns.
Figure 5.10 visualizes this example. The full code for this
example can be found in our code repository.
Figure 5.10 Our multimodal prompt chain - starting with a user
in the top left submitting an image - uses 4 LLMs (3 open source
and Cohere) to take in an image, caption it, categorize it, come up
with follow up questions, and answer them with a given
confidence.
Example—Basic Arithmetic
More recent LLMs like ChatGPT and GPT-4 are more likely than
their predecessors to output chains of thought even without
being prompted to. Figure 5.11 shows the same exact prompt in
GPT-3 and ChatGPT.
Figure 5.11 (Top) a basic arithmetic question with multiple
choice proves to be too difficult for DaVinci. (Middle) When we
ask DaVinci to first think about the question by adding “Reason
through step by step” at the end of the prompt we are using a
“chain of thought” prompt and it getsit right! (Bottom) ChatGPT
and GPT-4 don’t need to be told to reason through the problem
because they are already aligned to think through the chain of
thought.
Note how the dataset includes << >> markers for equations,
just like how ChatGPT and GPT-4 does it. This is because they
were in part trained using similar datasets with similar
notation.
ChatGPT (gpt-3.5-turbo)
DaVinci (text-davinci-003)
Cohere (command-xlarge-nightly)
'''
Janet’s ducks lay 16 eggs per day. She eats three
--------------
Figure 5.18 shows what this new prompt would look like.
Figure 5.18 This third variant selects the most semantically
similar examples from the training set. We can see that our
examples are also about easter egg hunting.
Things are looking good, but let me ask one more question to
really be rigorous.
Conclusion
Happy Prompting!
6
Introduction
import numpy as np
np.split(rating_complete.sample(fra
[int(.9*len(rating_complet
With our data loaded up and split, let’s take some time to better
define what we are actually trying to solve.
I got lost here tracking exactly what we’re trying to do. Maybe
explaining how we’re going to construct this embedder, and
how it will combine collaborative filtering and semantic
similarity would be helpful. I realized later that we’re trying
this model on the collaborative filtering as a label.
relevant_animes = []
for each promoted_anime in promoted_animes:
add k animes to relevant_animes with the high
The github has the full code to run this step with examples too!
For example, given k=3 and user id 205282 , the result of step
two would result in the following dictionary where each key
represents a different embedding model used and the values
are anime title ids and corresponding cosine similarity scores to
promoted titles the user liked:
final_relevant_animes = {
'text-embedding-ada-002': { '6351': 0.921, '172
'paraphrase-distilroberta-base-v1': { '17835':
}
If the rating in the testing set for the user and the
recommended anime was 9 or 10, the anime is considered a
“Promoter” and the system receives +1 points.
def clean_text(text):
# Remove non-printable characters
text = ''.join(filter(lambda x: x in string.p
# Replace multiple whitespace characters with
text = re.sub(r'\s{2,}', ' ', text).strip()
return text.strip()
def get_anime_description(anime_row):
"""
Generates a custom description for an anime t
...
description = (
f"{anime_row['Name']} is a {anime_type}.\
... # NOTE I omitting over a dozen other rows he
f"Its genres are {anime_row['Genres']}\n"
)
return clean_text(description)
Next, we find the total number of distinct people who like either
Anime A or Anime B. Here, we have Alice, Bob, Carol, David,
Ethan, and Frank.
Now, we can calculate the Jaccard similarity by dividing the
number of common elements (2, as Bob and Carol like both
shows) by the total number of distinct elements (6, as there are
6 unique people in total).
• For what it’s worth, I did initially try this process with an
embedding size of 512 and got worse results while taking about
20% longer on my machine.
...
Summary of Results
Figure 6.9 shows our final results for our four embedder
candidates across lengthening recommendation windows (how
many recommendations we show the user).
Each tick on the x axis here represents showing the user a list of
that many anime titles and the y axis is a aggregated score for
the embedder using the scoring system outlined before where
we also further reward the model if a correct recommendation
was placed closer to the front of the list and likeways punish it
more if it recommends something that the user is a detractor
for closer to the beginning of the list.
The model might require more than 384 tokens to capture all
possible relationships.
All models start to degrade in performance as it is expected to
recommend more and more titles which is fair. The more titles
anything recommends, the less confident it will be as it goes
down the list.
Exploring Exploration
Honestly that last one is a bit pie in the sky and would really
work best if we could also combine that with some chain of
thought prompting on a different LLM but still, this is a big
question and sometimes that means we need big ideas and big
answers. So I leave it to you now; go have big ideas!
Conclusion
Introduction
We should note that when we use ViT, we should try to use the
same image preprocessing steps that it used during pre-training
so that the model has an easier time learning the new image
sets. This is not strictly necessary and has its pros and cons.
We will re-use the ViT image preprocessor for now. Figure 7.2
shows a sample of an image before preprocessing and the same
image after it has gone though ViT’s standard preprocessing
steps.
Figure 7.2 Image systems like the Vision Transformer (ViT)
generally have to standardize images to a set format with pre-
defined normalization steps so that each image is processed as
fairly and consistently as possible. For some images (like the
downed tree in the top row) the image preprocessing really takes
away context at the cost of standardization across all images.
When we feed our text and image inputs into their respective
models (DistilBERT and Vision Transformer), they produce
output tensors that contain useful feature representations of
the inputs. However, these features are not necessarily in the
same format, and they may have different dimensionalities.
But how will GPT-2 accept these inputs from the encoding
models? The answer to that is a type of attention mechanism
known as cross-attention.
Figure 7.5 Our VQA system needs to fuse the encoded knowledge
from the image and text encoders and pass that fusion to the
GPT-2 model via the cross-attention mechanism which will take
the fused key and value vectors (See Figure 7.4 for more on that)
from the image and text encoders and pass it onto our decoder
GPT-2 to use to scale its own attention calculations.
# 768
# 768
# 768
In our case, all models have the same hidden state size so in
theory we don’t need to project anything but it is still good
practice to include projection layers so that the model has a
trainable layer that translates our text/image representations
into something more meaningful for the decoder.
Before getting deeper into the code, I should note that not all of
the code that powers this example is in these pages, but all of it
lives in the notebooks on the github. I highly recommend
following along using both!
class MultiModalModel(nn.Module):
...
param.requires_grad = False
if freeze in ('decoder', 'all'):
...
for name, param in self.decoder.named
if "crossattention" not in name:
param.requires_grad = False
return self.image_projection(image_encode
# Forward pass: encode text and image, combin
def forward(self, input_text, input_image, de
# Check decoder input for NaN or infinite
self.check_input(decoder_input_ids, "deco
...
With a model defined and properly adjusted for cross-attention,
let’s take a look at the data that will power our engine.
Our Data—Visual QA
Listing 7.3 shows a function I wrote to parse the image files and
creates a dataset that we can use with HuggingFace’s Trainer
object.
data = []
images_used = defaultdict(int)
# Create a dictionary to map question_id to t
annotations_dict = {annotation["question_id"]
"image_id": image_id,
"question_id": question_id,
"question": question["question"],
"answer": decoder_tokenizer.bos_t
"all_answers": all_answers,
"image": image,
}
)
...
# Break the loop if the max_images limit
...
return data
model = MultiModalModel(
image_encoder_model=IMAGE_ENCODER_MODEL,
text_encoder_model=TEXT_ENCODER_MODEL,
decoder_model=DECODER_MODEL,
freeze='nothing'
)
data_collator=data_collator
)
Summary of Results
Let’s step away from the idea of pure language modeling and
image processing for just a moment and step into the world of a
novel way of fine-tuning language models using its powerful
cousin - reinforcement learning.
A reward model has to take in the output of an LLM (in our case
a sequence of text) and returns a scalar (single number) reward
which should numerically represent feedback on the output.
This feedback can come from an actual human which would be
very slow to run or could come from another language model
or even a more complicated system that ranks potential model
outputs, and those rankings are converted to rewards. As long
as we are assigning a scalar reward for each output, it is a
viable reward system.
texts = [
'The Eiffel Tower in Paris is the tallest str
'This is a bad book',
'this is a bad books'
]
The TRL library supports both pure decoder models like GPT-2
and GPT-Neo (more on that in the next chapter) as well as
Sequence to Sequence models like FLAN-T5. All models can be
optimized using what is known as Proximal Policy
Optimization (PPO). Honestly I won’t into how it works in this
book but it’s definitely something for you to look up if you’re
curious. TRL also has many examples on their github page if
you want to see even more applications.
Figure 7.11 shows the high level process of our (for now)
simplified RLF loop.
Figure 7.11 Our first Reinforcement Learning from Feedback
loop has our pre-trained LLM (FLAN-T5) learning from a pre-
curated dataset and a pre-built reward system. In the next
chapter, we will see this loop performed with much more
customization and rigor.
Let’s jump into defining our training loop with some code to
really see some results here.
b. Our “current” model which will get updated after every batch
of data
2. Grab a batch of data from a source (in our case we will use a
corpus of news articles I found from HuggingFace)
5. TRL updates the “current” model from the batch of data, logs
anything to a reporting system (I like the free Weights & Biases
platform) and start over from the beginning of the steps!
game_data["response"] = [flan_t5_tokenize
# Calculate rewards from the cleaned resp
game_data["clean_response"] = [flan_t5_to
game_data['cola_scores'] = get_cola_score
game_data['neutral_scores'] = get_sentime
rewards = game_data['neutral_scores']
transposed_lists = zip(game_data['cola_sc
# Calculate the averages for each index
rewards = [1 * values[0] + 0.5 * values
rewards = [torch.tensor([_]) for _ in rew
Summary of Results
Figure 7.13 shows how rewards were given over the training
loop of 2 epochs. We can see that as the system progressed, we
were giving out more rewards which is generally a good sign. I
should note that the rewards started out pretty high so FLAN-T5
was already giving relatively neutral and readable responses so
I would not expect drastic changes in the summaries.
Figure 7.13 Our system is giving out more rewards as training
progresses (the graph is smoothed to see the overall movement).
Conclusion