0% found this document useful (0 votes)

25 views

Fine-Tuning Models

Fine-tuning OpenAI's text generation models enhances their performance and efficiency by allowing training on more examples than can fit in a prompt, resulting in higher quality outputs and reduced costs. The process involves preparing training data, training a new model, evaluating results, and iterating as necessary. Common use cases for fine-tuning include improving style and tone, handling edge cases, and performing complex tasks, with specific guidelines provided for dataset preparation and job creation.

Uploaded by

adnan.shahab91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Fine-Tuning Models

Uploaded by

adnan.shahab91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 14

Fine-tuning

Fine-tune models for better results and efficiency.

Fine-tuning lets you get more out of the models available through the API by
providing:

Higher quality results than prompting

Ability to train on more examples than can fit in a prompt
Token savings due to shorter prompts
Lower latency requests
OpenAI's text generation models have been pre-trained on a vast amount of text. To
use the models effectively, we include instructions and sometimes several examples
in a prompt. Using demonstrations to show how to perform a task is often called
"few-shot learning."

Fine-tuning improves on few-shot learning by training on many more examples than

can fit in the prompt, letting you achieve better results on a wide number of
tasks. Once a model has been fine-tuned, you won't need to provide as many examples
in the prompt. This saves costs and enables lower-latency requests.

At a high level, fine-tuning involves the following steps:

Prepare and upload training data

Train a new fine-tuned model
Evaluate results and go back to step 1 if needed
Use your fine-tuned model
Visit our pricing page to learn more about how fine-tuned model training and usage
are billed.

Which models can be fine-tuned?

Most of our most recent models are capable of being fine tuned - please refer to
the model docs for the model you would like to fine tune to see if it supports this
feature.

When to use fine-tuning

Fine-tuning OpenAI text generation models can make them better for specific
applications, but it requires a careful investment of time and effort. We recommend
first attempting to get good results with prompt engineering, prompt chaining
(breaking complex tasks into multiple prompts), and function calling, with the key
reasons being:

There are many tasks at which our models may not initially appear to perform well,
but results can be improved with the right prompts - thus fine-tuning may not be
necessary
Iterating over prompts and other tactics has a much faster feedback loop than
iterating with fine-tuning, which requires creating datasets and running training
jobs
In cases where fine-tuning is still necessary, initial prompt engineering work is
not wasted - we typically see best results when using a good prompt in the fine-
tuning data (or combining prompt chaining / tool use with fine-tuning)
Our prompt engineering guide provides a background on some of the most effective
strategies and tactics for getting better performance without fine-tuning. You may
find it helpful to iterate quickly on prompts in our playground.

Common use cases

Some common use cases where fine-tuning can improve results:

Setting the style, tone, format, or other qualitative aspects

Improving reliability at producing a desired output
Correcting failures to follow complex prompts
Handling many edge cases in specific ways
Performing a new skill or task that’s hard to articulate in a prompt
One high-level way to think about these cases is when it’s easier to "show, not
tell". In the sections to come, we will explore how to set up data for fine-tuning
and various examples where fine-tuning improves the performance over the baseline
model.

Another scenario where fine-tuning is effective is reducing cost and/or latency by

replacing a more expensive model like gpt-4o with a fine-tuned gpt-4o-mini model.
If you can achieve good results with gpt-4o, you can often reach similar quality
with a fine-tuned gpt-4o-mini model by fine-tuning on the gpt-4o completions,
possibly with a shortened instruction prompt.

Preparing your dataset

Once you have determined that fine-tuning is the right solution (i.e. you’ve
optimized your prompt as far as it can take you and identified problems that the
model still has), you’ll need to prepare data for training the model. You should
create a diverse set of demonstration conversations that are similar to the
conversations you will ask the model to respond to at inference time in production.

Each example in the dataset should be a conversation in the same format as our Chat
Completions API, specifically a list of messages where each message has a role,
content, and optional name. At least some of the training examples should directly
target cases where the prompted model is not behaving as desired, and the provided
assistant messages in the data should be the ideal responses you want the model to
provide.

Example format
In this example, our goal is to create a chatbot that occasionally gives sarcastic
responses, these are three training examples (conversations) we could create for a
dataset:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also

sarcastic."}, {"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "Paris, as if everyone doesn't know that
already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also
sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"},
{"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever
heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also
sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"},
{"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few,
like that really matters."}]}
Multi-turn chat examples
Examples in the chat format can have multiple messages with the assistant role. The
default behavior during fine-tuning is to train on all assistant messages within a
single example. To skip fine-tuning on specific assistant messages, a weight key
can be added disable fine-tuning on that message, allowing you to control which
assistant messages are learned. The allowed values for weight are currently 0 or 1.
Some examples using weight for the chat format are below.

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also

sarcastic."}, {"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "Paris", "weight": 0}, {"role": "user", "content":
"Can you be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if
everyone doesn't know that already.", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also
sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"},
{"role": "assistant", "content": "William Shakespeare", "weight": 0}, {"role":
"user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content":
"Oh, just some guy named William Shakespeare. Ever heard of him?", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also
sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"},
{"role": "assistant", "content": "384,400 kilometers", "weight": 0}, {"role":
"user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content":
"Around 384,400 kilometers. Give or take a few, like that really matters.",
"weight": 1}]}
Crafting prompts
We generally recommend taking the set of instructions and prompts that you found
worked best for the model prior to fine-tuning, and including them in every
training example. This should let you reach the best and most general results,
especially if you have relatively few (e.g. under a hundred) training examples.

If you would like to shorten the instructions or prompts that are repeated in every
example to save costs, keep in mind that the model will likely behave as if those
instructions were included, and it may be hard to get the model to ignore those
"baked-in" instructions at inference time.

It may take more training examples to arrive at good results, as the model has to
learn entirely through demonstration and without guided instructions.

Example count recommendations

To fine-tune a model, you are required to provide at least 10 examples. We
typically see clear improvements from fine-tuning on 50 to 100 training examples
with gpt-4o-mini and gpt-3.5-turbo, but the right number varies greatly based on
the exact use case.

We recommend starting with 50 well-crafted demonstrations and seeing if the model

shows signs of improvement after fine-tuning. In some cases that may be sufficient,
but even if the model is not yet production quality, clear improvements are a good
sign that providing more data will continue to improve the model. No improvement
suggests that you may need to rethink how to set up the task for the model or
restructure the data before scaling beyond a limited example set.

Train and test splits

After collecting the initial dataset, we recommend splitting it into a training and
test portion. When submitting a fine-tuning job with both training and test files,
we will provide statistics on both during the course of training. These statistics
will be your initial signal of how much the model is improving. Additionally,
constructing a test set early on will be useful in making sure you are able to
evaluate the model after training, by generating samples on the test set.

Token limits
Token limits depend on the model you select. Here is an overview of the maximum
inference context length and training examples context length our models:

Model Inference context length Training examples context length

gpt-4.1-2025-04-14 128,000 tokens 65,536 tokens
gpt-4.1-mini-2025-04-14 128,000 tokens 65,536 tokens
gpt-4o-2024-08-06 128,000 tokens 65,536 tokens
gpt-4o-mini-2024-07-18 128,000 tokens 65,536 tokens
gpt-3.5-turbo-0125 16,385 tokens 16,385 tokens
gpt-3.5-turbo-1106 16,385 tokens 16,385 tokens
gpt-3.5-turbo-0613 16,385 tokens 4,096 tokens
Examples longer than the default will be truncated to the maximum context length
which removes tokens from the end of the training example(s). To be sure that your
entire training example fits in context, consider checking that the total token
counts in the message contents are under the limit.

You can compute token counts using our counting tokens notebook from the OpenAI
cookbook.

Estimate costs
For detailed pricing on training costs, as well as input and output costs for a
deployed fine-tuned model, visit our pricing page. Note that we don't charge for
tokens used for training validation. To estimate the cost of a specific fine-tuning
training job, use the following formula:

(base training cost per 1M input tokens ÷ 1M) × number of tokens in the input file
× number of epochs trained

For a training file with 100,000 tokens trained over 3 epochs, the expected cost
would be:

~$0.90 USD with gpt-4o-mini-2024-07-18 after the free period ends on October 31,
2024.
~$2.40 USD with gpt-3.5-turbo-0125.
Check data formatting
Once you have compiled a dataset and before you create a fine-tuning job, it is
important to check the data formatting. To do this, we created a simple Python
script which you can use to find potential errors, review token counts, and
estimate the cost of a fine-tuning job.

Fine-tuning data format validation

Learn about fine-tuning data formatting

Upload a training file

Once you have the data validated, the file needs to be uploaded using the Files API
in order to be used with a fine-tuning jobs:

Create a fine-tuning job with DPO

from openai import OpenAI

client = OpenAI()

job = client.fine_tuning.jobs.create(
training_file="file-all-about-the-weather",
model="gpt-4o-2024-08-06",
method={
"type": "dpo",
"dpo": {
"hyperparameters": {"beta": 0.1},
},
},
)
After you upload the file, it may take some time to process. While the file is
processing, you can still create a fine-tuning job but it will not start until the
file processing has completed.

Size limits and large uploads

The maximum upload size is 512 MB using the Files API. You can upload files up to 8
GB in multiple parts using the Uploads API. We recommend starting small, as you
don't need a lot of data to see improvements.
Create a fine-tuned model
After ensuring you have the right amount and structure for your dataset, and have
uploaded the file, the next step is to create a fine-tuning job. We support
creating fine-tuning jobs via the fine-tuning UI or programmatically.

To start a fine-tuning job using the OpenAI SDK:

Create a fine-tuning job

from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
training_file="file-abc123",
model="gpt-4o-mini-2024-07-18"
)
In this example, model is the name of the model you want to fine-tune. Note that
only specific model snapshots (like gpt-4o-mini-2024-07-18 in this case) can be
used for this parameter, as listed in our supported models. The training_file
parameter is the file ID that was returned when the training file was uploaded to
the OpenAI API. You can customize your fine-tuned model's name using the suffix
parameter.

If you choose not to specify a method, the default is Supervised Fine-Tuning (SFT).

To set additional fine-tuning parameters like the validation_file or

hyperparameters, please refer to the API specification for fine-tuning.

After you've started a fine-tuning job, it may take some time to complete. Your job
may be queued behind other jobs in our system, and training a model can take
minutes or hours depending on the model and dataset size. After the model training
is completed, the user who created the fine-tuning job will receive an email
confirmation.

In addition to creating a fine-tuning job, you can also list existing jobs,
retrieve the status of a job, or cancel a job.

Work with fine-tuning jobs

from openai import OpenAI
client = OpenAI()

# List 10 fine-tuning jobs

client.fine_tuning.jobs.list(limit=10)

# Retrieve the state of a fine-tune

client.fine_tuning.jobs.retrieve("ftjob-abc123")

# Cancel a job
client.fine_tuning.jobs.cancel("ftjob-abc123")

# List up to 10 events from a fine-tuning job

client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-abc123", limit=10)

# Delete a fine-tuned model

client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")
Use a fine-tuned model
When a job has succeeded, you will see the fine_tuned_model field populated with
the name of the model when you retrieve the job details. You may now specify this
model as a parameter to in the Chat Completions API, and make requests to it using
the Playground.
After your job is completed, the model should be available right away for inference
use. In some cases, it may take several minutes for your model to become ready to
handle requests. If requests to your model time out or the model name cannot be
found, it is likely because your model is still being loaded. If this happens, try
again in a few minutes.

Use fine-tuned model

from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
model="ft:gpt-4o-mini:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)

print(completion.choices[0].message)
You can start making requests by passing the model name as shown above and in our
GPT guide.

Use a checkpointed model

In addition to creating a final fine-tuned model at the end of each fine-tuning
job, OpenAI will create one full model checkpoint for you at the end of each
training epoch. These checkpoints are themselves full models that can be used
within our completions and chat-completions endpoints. Checkpoints are useful as
they potentially provide a version of your fine-tuned model from before it
experienced overfitting.

To access these checkpoints,

Wait until a job succeeds, which you can verify by querying the status of a job.
Query the checkpoints endpoint with your fine-tuning job ID to access a list of
model checkpoints for the fine-tuning job.
For each checkpoint object, you will see the fine_tuned_model_checkpoint field
populated with the name of the model checkpoint. You may now use this model just
like you would with the final fine-tuned model.

{
"object": "fine_tuning.job.checkpoint",
"id": "ftckpt_zc4Q7MP6XxulcVzj4MZdwsAB",
"created_at": 1519129973,
"fine_tuned_model_checkpoint": "ft:gpt-3.5-turbo-0125:my-org:custom-
suffix:96olL566:ckpt-step-2000",
"metrics": {
"full_valid_loss": 0.134,
"full_valid_mean_token_accuracy": 0.874
},
"fine_tuning_job_id": "ftjob-abc123",
"step_number": 2000
}
Each checkpoint will specify its:

step_number: The step at which the checkpoint was created (where each epoch is
number of steps in the training set divided by the batch size)
metrics: an object containing the metrics for your fine-tuning job at the step when
the checkpoint was created.
Currently, only the checkpoints for the last 3 epochs of the job are saved and
available for use. We plan to release more complex and flexible checkpointing
strategies in the near future.

Analyzing your fine-tuned model

We provide the following training metrics computed over the course of training:

training loss
training token accuracy
valid loss
valid token accuracy
Valid loss and valid token accuracy are computed in two different ways - on a small
batch of the data during each step, and on the full valid split at the end of each
epoch. The full valid loss and full valid token accuracy metrics are the most
accurate metric tracking the overall performance of your model. These statistics
are meant to provide a sanity check that training went smoothly (loss should
decrease, token accuracy should increase). While an active fine-tuning jobs is
running, you can view an event object which contains some useful metrics:

{
"object": "fine_tuning.job.event",
"id": "ftevent-abc-123",
"created_at": 1693582679,
"level": "info",
"message": "Step 300/300: training loss=0.15, validation loss=0.27, full
validation loss=0.40",
"data": {
"step": 300,
"train_loss": 0.14991648495197296,
"valid_loss": 0.26569826706596045,
"total_steps": 300,
"full_valid_loss": 0.4032616495084362,
"train_mean_token_accuracy": 0.9444444179534912,
"valid_mean_token_accuracy": 0.9565217391304348,
"full_valid_mean_token_accuracy": 0.9089635854341737
},
"type": "metrics"
}
After a fine-tuning job has finished, you can also see metrics around how the
training process went by querying a fine-tuning job, extracting a file ID from the
result_files, and then retrieving that files content. Each results CSV file has the
following columns: step, train_loss, train_accuracy, valid_loss, and
valid_mean_token_accuracy.

step,train_loss,train_accuracy,valid_loss,valid_mean_token_accuracy
1,1.52347,0.0,,
2,0.57719,0.0,,
3,3.63525,0.0,,
4,1.72257,0.0,,
5,1.52379,0.0,,
While metrics can be helpful, evaluating samples from the fine-tuned model provides
the most relevant sense of model quality. We recommend using the Evals product to
compare your base model to your fine-tuned model. Alternatively, you could manually
generate samples from both models on a test set, and compare the samples side by
side. The test set should ideally include the full distribution of inputs that you
might send to the model in a production use case.

Iterating on data quality

If the results from a fine-tuning job are not as good as you expected, consider the
following ways to adjust the training dataset:

Collect examples to target remaining issues

If the model still isn’t good at certain aspects, add training examples that
directly show the model how to do these aspects correctly
Scrutinize existing examples for issues
If your model has grammar, logic, or style issues, check if your data has any of
the same issues. For instance, if the model now says "I will schedule this meeting
for you" (when it shouldn’t), see if existing examples teach the model to say it
can do new things that it can’t do
Consider the balance and diversity of data
If 60% of the assistant responses in the data says "I cannot answer this", but at
inference time only 5% of responses should say that, you will likely get an
overabundance of refusals
Make sure your training examples contain all of the information needed for the
response
If we want the model to compliment a user based on their personal traits and a
training example includes assistant compliments for traits not found in the
preceding conversation, the model may learn to hallucinate information
Look at the agreement / consistency in the training examples
If multiple people created the training data, it’s likely that model performance
will be limited by the level of agreement / consistency between people. For
instance, in a text extraction task, if people only agreed on 70% of extracted
snippets, the model would likely not be able to do better than this
Make sure your all of your training examples are in the same format, as expected
for inference
Iterating on data quantity
Once you’re satisfied with the quality and distribution of the examples, you can
consider scaling up the number of training examples. This tends to help the model
learn the task better, especially around possible "edge cases". We expect a similar
amount of improvement every time you double the number of training examples. You
can loosely estimate the expected quality gain from increasing the training data
size by:

Fine-tuning on your current dataset

Fine-tuning on half of your current dataset
Observing the quality gap between the two
In general, if you have to make a trade-off, a smaller amount of high-quality data
is generally more effective than a larger amount of low-quality data.

Iterating on hyperparameters
We allow you to specify the following hyperparameters:

epochs
learning rate multiplier
batch size
We recommend initially training without specifying any of these, allowing us to
pick a default for you based on dataset size, then adjusting if you observe the
following:

If the model does not follow the training data as much as expected increase the
number of epochs by 1 or 2
This is more common for tasks for which there is a single ideal completion (or a
small set of ideal completions which are similar). Some examples include
classification, entity extraction, or structured parsing. These are often tasks for
which you can compute a final accuracy metric against a reference answer.
If the model becomes less diverse than expected decrease the number of epochs by 1
or 2
This is more common for tasks for which there are a wide range of possible good
completions
If the model does not appear to be converging, increase the learning rate
multiplier
You can set the hyperparameters as is shown below:

Setting hyperparameters
from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
training_file="file-abc123",
model="gpt-4o-mini-2024-07-18",
method={
"type": "supervised",
"supervised": {
"hyperparameters": {"n_epochs": 2},
},
},
)
Vision fine-tuning
Fine-tuning is also possible with images in your JSONL files. Just as you can send
one or many image inputs to Chat Completions, you can include those same message
types within your training data. Images can be provided either as HTTP URLs or data
URLs containing base64 encoded images.

Here's an example of an image message on a line of your JSONL file. Below, the JSON
object is expanded for readibility, but typically this JSON would appear on a
single line in your data file:

{
"messages": [
{
"role": "system",
"content": "You are an assistant that identifies uncommon cheeses."
},
{
"role": "user",
"content": "What is this cheese?"
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url":
"https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg"
}
}
]
},
{
"role": "assistant",
"content": "Danbo"
}
]
}
Image dataset requirements
Size
Your training file can contain a maximum of 50,000 examples that contain images
(not including text examples).
Each example can have at most 10 images.
Each image can be at most 10 MB.
Format
Images must be JPEG, PNG, or WEBP format.
Your images must be in the RGB or RGBA image mode.
You cannot include images as output from messages with the assistant role.
Content moderation policy
We scan your images before training to ensure that they comply with our usage
policy. This may introduce latency in file validation before fine-tuning begins.

Images containing the following will be excluded from your dataset and not used for
training:

People
Faces
Children
CAPTCHAs
Help
What to do if your images get skipped
Your images can get skipped for the following reasons:

contains CAPTCHAs, contains people, contains faces, contains children

Remove the image. For now, we cannot fine-tune models with images containing these
entities.
inaccessible URL
Ensure that the image URL is publicly accessible.
image too large
Please ensure that your images fall within our dataset size limits.
invalid image format
Please ensure that your images fall within our dataset format.
Reducing training cost
If you set the detail parameter for an image to low, the image is resized to 512 by
512 pixels and is only represented by 85 tokens regardless of its size. This will
reduce the cost of training. See here for more information.

{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg",
"detail": "low"
}
}
Other considerations for vision fine-tuning
To control the fidelity of image understanding, set the detail parameter of
image_url to low, high, or auto for each image. This will also affect the number of
tokens per image that the model sees during training time, and will affect the cost
of training. See here for more information.
Preference fine-tuning
Direct Preference Optimization (DPO) fine-tuning allows you to fine-tune models
based on prompts and pairs of responses. This approach enables the model to learn
from human preferences, optimizing for outputs that are more likely to be favored.
Note that we currently support text-only DPO fine-tuning.

Preparing your dataset for DPO

Each example in your dataset should contain:

A prompt, like a user message.

A preferred output (an ideal assistant response).
A non-preferred output (a suboptimal assistant response).
The data should be formatted in JSONL format, with each line representing an
example in the following structure:

{
"input": {
"messages": [
{
"role": "user",
"content": "Hello, can you tell me how cold San Francisco is today?"
}
],
"tools": [],
"parallel_tool_calls": true
},
"preferred_output": [
{
"role": "assistant",
"content": "Today in San Francisco, it is not quite cold as expected. Morning
clouds will give away to sunshine, with a high near 68°F (20°C) and a low around
57°F (14°C)."
}
],
"non_preferred_output": [
{
"role": "assistant",
"content": "It is not particularly cold in San Francisco today."
}
]
}
Currently, we only train on one-turn conversations for each example, where the
preferred and non-preferred messages need to be the last assistant message.

Stacking methods: Supervised + DPO

Currently, OpenAI offers Supervised Fine-Tuning (SFT) as the default method for
fine-tuning jobs. Performing SFT on your preferred responses (or a subset) before
running another DPO job afterwards can significantly enhance model alignment and
performance. By first fine-tuning the model on the desired responses, it can better
identify correct patterns, providing a strong foundation for DPO to refine
behavior.

A recommended workflow is as follows:

Fine-tune the base model with SFT using a subset of your preferred responses. Focus
on ensuring the data quality and representativeness of the tasks.
Use the SFT fine-tuned model as the starting point, and apply DPO to adjust the
model based on preference comparisons.
Configuring a DPO fine-tuning job
We have introduced a method field in the fine-tuning job creation endpoint, where
you can specify type as well as any associated hyperparameters. For DPO:

set the type parameter to dpo

optionally set the hyperparameters property with any options you'd like to
configure.
The beta hyperparameter is a new option that is only available for DPO. It's a
floating point number between 0 and 2 that controls how strictly the new model will
adhere to its previous behavior, versus aligning with the provided preferences. A
high number will be more conservative (favoring previous behavior), and a lower
number will be more aggressive (favor the newly provided preferences more often).

You can also set this value to auto (the default) to use a value configured by the
platform.

The example below shows how to configure a DPO fine-tuning job using the OpenAI
SDK. For more information on creating fine-tuning jobs in general, please refer to
the next section of the guide.

Create a fine-tuning job with DPO

import OpenAI from "openai";

const openai = new OpenAI();

const job = await openai.fineTuning.jobs.create({

training_file: "file-all-about-the-weather",
model: "gpt-4o-2024-08-06",
method: {
type: "dpo",
dpo: {
hyperparameters: { beta: 0.1 },
},
},
});
Fine-tuning examples
Now that we have explored the basics of the fine-tuning API, let’s look at going
through the fine-tuning lifecycle for a few different use cases.

Style and tone

Structured output
Tool calling
Function calling
Fine-tuning integrations
OpenAI provides the ability for you to integrate your fine-tuning jobs with 3rd
parties via our integration framework. Integrations generally allow you to track
job state, status, metrics, hyperparameters, and other job-related information in a
3rd party system. You can also use integrations to trigger actions in a 3rd party
system based on job state changes. Currently, the only supported integration is
with Weights and Biases, but more are coming soon.

Weights and Biases Integration

Weights and Biases (W&B) is a popular tool for tracking machine learning
experiments. You can use the OpenAI integration with W&B to track your fine-tuning
jobs in W&B. This integration will automatically log metrics, hyperparameters, and
other job-related information to the W&B project you specify.

To integrate your fine-tuning jobs with W&B, you'll need to

Provide authentication credentials for your Weights and Biases account to OpenAI
Configure the W&B integration when creating new fine-tuning jobs
Authenticate your Weights and Biases account with OpenAI
Authentication is done by submitting a valid W&B API key to OpenAI. Currently, this
can only be done via the Account Dashboard, and only by account administrators.
Your W&B API key will be stored encrypted within OpenAI and will allow OpenAI to
post metrics and metadata on your behalf to W&B when your fine-tuning jobs are
running. Attempting to enable a W&B integration on a fine-tuning job without first
authenticating your OpenAI organization with WandB will result in an error.

The weights and balances key in the dashboard

Enable the Weights and Biases integration
When creating a new fine-tuning job, you can enable the W&B integration by
including a new "wandb" integration under the integrations field in the job
creation request. This integration allows you to specify the W&B Project that you
wish the newly created W&B Run to show up under.

Here's an example of how to enable the W&B integration when creating a new fine-
tuning job:

curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4o-mini-2024-07-18",
"training_file": "file-ABC123",
"validation_file": "file-DEF456",
"integrations": [
{
"type": "wandb",
"wandb": {
"project": "custom-wandb-project",
"tags": ["project:tag", "lineage"]
}
}
]
}' https://api.openai.com/v1/fine_tuning/jobs
By default, the Run ID and Run display name are the ID of your fine-tuning job
(e.g. ftjob-abc123). You can customize the display name of the run by including a
"name" field in the wandb object. You can also include a "tags" field in the wandb
object to add tags to the W&B Run (tags must be <= 64 character strings and there
is a maximum of 50 tags).

Sometimes it is convenient to explicitly set the W&B Entity to be associated with

the run. You can do this by including an "entity" field in the wandb object. If you
do not include an "entity" field, the W&B entity will default to the default W&B
entity associated with the API key you registered previously.

The full specification for the integration can be found in our fine-tuning job
creation documentation.

View your fine-tuning job in Weights and Biases

Once you've created a fine-tuning job with the W&B integration enabled, you can
view the job in W&B by navigating to the W&B project you specified in the job
creation request. Your run should be located at the URL:

https://wandb.ai/<WANDB-ENTITY>/<WANDB-PROJECT>/runs/ftjob-ABCDEF
You should see a new run with the name and tags you specified in the job creation
request. The Run Config will contain relevant job metadata such as:

model: The model you are fine-tuning

training_file: The ID of the training file
validation_file: The ID of the validation file
hyperparameters: The hyperparameters used for the job (e.g. n_epochs,
learning_rate_multiplier, batch_size)
seed: The random seed used for the job
Likewise, OpenAI will set some default tags on the run to make it easier for your
to search and filter. These tags will be prefixed with "openai/" and will include:

openai/fine-tuning: Tag to let you know this run is a fine-tuning job

openai/ft-abc123: The ID of the fine-tuning job
openai/gpt-4o-mini: The model you are fine-tuning
An example W&B run generated from an OpenAI fine-tuning job is shown below:

An example W&B run

Metrics for each step of the fine-tuning job will be logged to the W&B run. These
metrics are the same metrics provided in the fine-tuning job event object and are
the same metrics your can view via the OpenAI fine-tuning Dashboard. You can use
W&B's visualization tools to track the progress of your fine-tuning job and compare
it to other fine-tuning jobs you've run.

An example of the metrics logged to a W&B run is shown below:

Charts for valid_loss, train_loss, valid_mean_token_accuracy, and

train_mean_token_accuracy
FAQ
Can I fine-tune models that support long context (1M+ tokens)?
You can fine-tune models that support long context windows, but note that during
training, the maximum context size for an individual input is 64k tokens. You
should also note that the context window size for fine-tuned models (even if they
otherwise could support long context) for inference (making an API request to
generate content from your fine-tuned model) is still 128k tokens.

When should I use fine-tuning vs embeddings / retrieval augmented generation?

Embeddings with retrieval is best suited for cases when you need to have a large
database of documents with relevant context and information.

By default OpenAI’s models are trained to be helpful generalist assistants. Fine-

tuning can be used to make a model which is narrowly focused, and exhibits specific
ingrained behavior patterns. Retrieval strategies can be used to make new
information available to a model by providing it with relevant context before
generating its response. Retrieval strategies are not an alternative to fine-tuning
and can in fact be complementary to it.

The Art Of Prompt Engineering With Chatgpt A Hands-on Guide Pdf Download
No ratings yet
The Art Of Prompt Engineering With Chatgpt A Hands-on Guide Pdf Download
4 pages
Owners_Associations_Management_List
No ratings yet
Owners_Associations_Management_List
30 pages
SAP Variant Configuration: Your Successful Guide to Modeling
From Everand
SAP Variant Configuration: Your Successful Guide to Modeling
Mike Piehl
5/5 (2)
Python Machine Learning By Example
From Everand
Python Machine Learning By Example
Yuxi (Hayden) Liu
4/5 (7)
AWS Solution Architect Certification Exam Practice Paper 2019
From Everand
AWS Solution Architect Certification Exam Practice Paper 2019
Tech Interviews
3.5/5 (3)
Hands-On Machine Learning with Microsoft Excel 2019: Build complete data analysis flows, from data collection to visualization
From Everand
Hands-On Machine Learning with Microsoft Excel 2019: Build complete data analysis flows, from data collection to visualization
Julio Cesar Rodriguez Martino
No ratings yet
CompTIA Network+ Practice Questions
From Everand
CompTIA Network+ Practice Questions
IP Specialist
No ratings yet
Building Serverless Apps with Azure Functions and Cosmos DB: Leverage Azure functions and Cosmos DB for building serverless applications (English Edition)
From Everand
Building Serverless Apps with Azure Functions and Cosmos DB: Leverage Azure functions and Cosmos DB for building serverless applications (English Edition)
Hansamali Gamage
No ratings yet
KNIME Essentials
From Everand
KNIME Essentials
Gábor Bakos
No ratings yet
Cobol
From Everand
Cobol
Sumanta Soren
No ratings yet
Fine
No ratings yet
Fine
14 pages
Fine Tuning OpenAI API
No ratings yet
Fine Tuning OpenAI API
20 pages
Fine-Tuning - OpenAI API
No ratings yet
Fine-Tuning - OpenAI API
19 pages
How to use ChatGPT
From Everand
How to use ChatGPT
Bernhard Gaum
No ratings yet
Mastering Prompt Engineering
From Everand
Mastering Prompt Engineering
Youngsoo Chae
No ratings yet
Accelerate Model Training with PyTorch 2.X: Build more accurate models by boosting the model training process
From Everand
Accelerate Model Training with PyTorch 2.X: Build more accurate models by boosting the model training process
Maicon Melo Alves
No ratings yet
Go Design Patterns
From Everand
Go Design Patterns
Mario Castro Contreras
5/5 (1)
Lesson 04 Fine-Tuning ChatGPT
No ratings yet
Lesson 04 Fine-Tuning ChatGPT
41 pages
Lesson_02_Optimizing_GenAI_Models
No ratings yet
Lesson_02_Optimizing_GenAI_Models
40 pages
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Scrum Art Hand Book: Effective Tips & Techniques
From Everand
Scrum Art Hand Book: Effective Tips & Techniques
Durga Madiraju
No ratings yet
04-24
No ratings yet
04-24
22 pages
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
From Everand
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
PARTHA MAJUMDAR
No ratings yet
DevOps Master Courseware
From Everand
DevOps Master Courseware
Alejandro Pestchanker
No ratings yet
Linux, Apache, MySQL, PHP Performance End to End
From Everand
Linux, Apache, MySQL, PHP Performance End to End
Colin McKinnon
5/5 (1)
The C++ Template Handbook: Advanced Techniques for Modern C++ Developers
From Everand
The C++ Template Handbook: Advanced Techniques for Modern C++ Developers
Robert Johnson
No ratings yet
Learning Dynamics NAV Patterns: Create solutions that are easy to maintain, are quick to upgrade, and follow proven concepts and design
From Everand
Learning Dynamics NAV Patterns: Create solutions that are easy to maintain, are quick to upgrade, and follow proven concepts and design
Marije Brummel
No ratings yet
PctYT8dTSK eNUsx2ZUefg - Openai Workingcourse Large Language Models Llms Fine Tuning
No ratings yet
PctYT8dTSK eNUsx2ZUefg - Openai Workingcourse Large Language Models Llms Fine Tuning
12 pages
why finetuning
No ratings yet
why finetuning
7 pages
SAP Service Management - Advanced Configuration Guide 2
From Everand
SAP Service Management - Advanced Configuration Guide 2
Mike Piehl
No ratings yet
ChatGPT Mastery: Integrating AI into Your Workflow for Advanced Users
From Everand
ChatGPT Mastery: Integrating AI into Your Workflow for Advanced Users
GN
No ratings yet
FineTune OPUS MT Engine
No ratings yet
FineTune OPUS MT Engine
9 pages
The PHP Workshop: Learn to build interactive applications and kickstart your career as a web developer
From Everand
The PHP Workshop: Learn to build interactive applications and kickstart your career as a web developer
Alexandru Busuioc
No ratings yet
Rapid Application Development With CakePHP
From Everand
Rapid Application Development With CakePHP
Jamie Munro
No ratings yet
Machine Learning in Production: Master the art of delivering robust Machine Learning solutions with MLOps (English Edition)
From Everand
Machine Learning in Production: Master the art of delivering robust Machine Learning solutions with MLOps (English Edition)
Suhas Pote
No ratings yet
OpenAI s Best Practices for Fine Tuning 1734652435
No ratings yet
OpenAI s Best Practices for Fine Tuning 1734652435
26 pages
Agile Foundation Courseware – English
From Everand
Agile Foundation Courseware – English
Nader Rad
No ratings yet
LLM-Intro
No ratings yet
LLM-Intro
19 pages
AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide: The ultimate guide to passing the MLS-C01 exam on your first attempt
From Everand
AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide: The ultimate guide to passing the MLS-C01 exam on your first attempt
Somanath Nanda
No ratings yet
Learning Google Cloud Vertex AI: Build, deploy, and manage machine learning models with Vertex AI (English Edition)
From Everand
Learning Google Cloud Vertex AI: Build, deploy, and manage machine learning models with Vertex AI (English Edition)
Hemanth Kumar K
No ratings yet
Create Your Custom ChatGPT With Transfer Learning
No ratings yet
Create Your Custom ChatGPT With Transfer Learning
10 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
PMI-ACP Exam Insights: Q&A with Explanations
From Everand
PMI-ACP Exam Insights: Q&A with Explanations
SUJAN
No ratings yet
Mastering Machine Learning Algorithms - Second Edition: Expert techniques for implementing popular machine learning algorithms, fine-tuning your models, and understanding how they work, 2nd Edition
From Everand
Mastering Machine Learning Algorithms - Second Edition: Expert techniques for implementing popular machine learning algorithms, fine-tuning your models, and understanding how they work, 2nd Edition
Giuseppe Bonaccorso
No ratings yet
Chatgpt | Generative AI - The Step-By-Step Guide For OpenAI & Azure OpenAI In 36 Hrs.
From Everand
Chatgpt | Generative AI - The Step-By-Step Guide For OpenAI & Azure OpenAI In 36 Hrs.
AJIT DASH
No ratings yet
AI Chatbot - Response Comparisons v2.2S
No ratings yet
AI Chatbot - Response Comparisons v2.2S
9 pages
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
Mastering Generic Programming in C++: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Generic Programming in C++: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
From Everand
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Alok Kumar
No ratings yet
Software Testing: A Guide to Testing Mobile Apps, Websites, and Games
From Everand
Software Testing: A Guide to Testing Mobile Apps, Websites, and Games
Mark Garzone
4.5/5 (3)
PowerShell Troubleshooting Guide
From Everand
PowerShell Troubleshooting Guide
Michael Shepard
No ratings yet
User Training for Busy Programmers
From Everand
User Training for Busy Programmers
William Rice
No ratings yet
FuelPHP Application Development Blueprints
From Everand
FuelPHP Application Development Blueprints
Sébastien Drouyer
No ratings yet
IT Interview Guide for Freshers: Crack your IT interview with confidence
From Everand
IT Interview Guide for Freshers: Crack your IT interview with confidence
Sameer S Paradkar
No ratings yet
Team Foundation Server 2015 Customization
From Everand
Team Foundation Server 2015 Customization
Beeming Gordon
No ratings yet
Modern C++ Templates: A Practical Guide for Developers
From Everand
Modern C++ Templates: A Practical Guide for Developers
Robert Johnson
No ratings yet
Prompt Engineering
100% (5)
Prompt Engineering
100 pages
Practice Questions for UiPath Certified RPA Associate Case Based
From Everand
Practice Questions for UiPath Certified RPA Associate Case Based
Exam OG
No ratings yet
Lakera - Ai-The Ultimate Guide To LLM Fine Tuning Best Practices Amp Tools
100% (1)
Lakera - Ai-The Ultimate Guide To LLM Fine Tuning Best Practices Amp Tools
13 pages
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Building Agents
No ratings yet
Building Agents
5 pages
Handoffs agent
No ratings yet
Handoffs agent
2 pages
Tools For Agents
No ratings yet
Tools For Agents
4 pages
Tracing
No ratings yet
Tracing
3 pages
Table
No ratings yet
Table
71 pages
Context Management
No ratings yet
Context Management
2 pages
CertificateOfCompletion_Managing Your Emotional Response to Workplace Stress
No ratings yet
CertificateOfCompletion_Managing Your Emotional Response to Workplace Stress
1 page
Guard Rails
No ratings yet
Guard Rails
3 pages
Configuring The SDKS
No ratings yet
Configuring The SDKS
2 pages
CertificateOfCompletion_Body Language for Authentic Leadership
No ratings yet
CertificateOfCompletion_Body Language for Authentic Leadership
1 page
Cover Letter Jan 2020
No ratings yet
Cover Letter Jan 2020
1 page
Vercel LeadGeneration Advanced
No ratings yet
Vercel LeadGeneration Advanced
1 page
CertificateOfCompletion_Career Essentials in Generative AI by Microsoft and LinkedIn (1)
No ratings yet
CertificateOfCompletion_Career Essentials in Generative AI by Microsoft and LinkedIn (1)
1 page
Contract_Comparison__Original_vs_Final_Draft
No ratings yet
Contract_Comparison__Original_vs_Final_Draft
2 pages
1 Haytham CV
No ratings yet
1 Haytham CV
1 page
scrapeez
No ratings yet
scrapeez
3 pages
Effective Team Collaboration & Decision-Making
No ratings yet
Effective Team Collaboration & Decision-Making
11 pages
Updated_DSO_Residential_Buildings_FM_Security_Services
No ratings yet
Updated_DSO_Residential_Buildings_FM_Security_Services
2 pages
Upcoming projects in UAE
No ratings yet
Upcoming projects in UAE
52 pages
Charges Owners
No ratings yet
Charges Owners
14 pages
Apollo_Data_20241110231403
No ratings yet
Apollo_Data_20241110231403
64 pages
ASD
No ratings yet
ASD
29 pages
updated 3
No ratings yet
updated 3
18 pages
Umer Khalid CV
No ratings yet
Umer Khalid CV
4 pages
Lesson 3 CRS Setup
No ratings yet
Lesson 3 CRS Setup
49 pages
Rough Notes For Marketing 4.0
100% (1)
Rough Notes For Marketing 4.0
33 pages
1981 Fourcade & Allegre
No ratings yet
1981 Fourcade & Allegre
19 pages
Standard Test Procedures Manual: 1. Scope 1.1. Description of Test
No ratings yet
Standard Test Procedures Manual: 1. Scope 1.1. Description of Test
6 pages
Deck Machinery
100% (5)
Deck Machinery
18 pages
La International Public School: Hindi-5
No ratings yet
La International Public School: Hindi-5
3 pages
Lecture4 Thermo II
No ratings yet
Lecture4 Thermo II
13 pages
3M Screen Printable PSA SP7202 External Presentation
No ratings yet
3M Screen Printable PSA SP7202 External Presentation
17 pages
Primefact Estimating Tractor Power Needs
No ratings yet
Primefact Estimating Tractor Power Needs
5 pages
Pre-flight _Transit Inspection Sheet
No ratings yet
Pre-flight _Transit Inspection Sheet
3 pages
Stock Pabrik
No ratings yet
Stock Pabrik
30 pages
Byron Germain 2005
No ratings yet
Byron Germain 2005
4 pages
Icrtet - 2014 Snjbcoe Full Paper - Elsevier Bacterial Concrete
No ratings yet
Icrtet - 2014 Snjbcoe Full Paper - Elsevier Bacterial Concrete
5 pages
Illinois Licensure Testing System (ILTS) : Score Report
No ratings yet
Illinois Licensure Testing System (ILTS) : Score Report
2 pages
GP1A50HR/GP1A51HR GP1A52HR/GP1A53HR: Opic Photointerrupter
No ratings yet
GP1A50HR/GP1A51HR GP1A52HR/GP1A53HR: Opic Photointerrupter
6 pages
The Descriptive Essay, 7 PP., Pages From CH 16 Types of Essays - Evergreen A Guide To Writing With Readings, 2017 - (Susan Fawcett)
No ratings yet
The Descriptive Essay, 7 PP., Pages From CH 16 Types of Essays - Evergreen A Guide To Writing With Readings, 2017 - (Susan Fawcett)
7 pages
Topic 1 .2 - 1.3 Polynomials
No ratings yet
Topic 1 .2 - 1.3 Polynomials
3 pages
Matruprasad Rout (Editor), Kishore Debnath (Editor) - Forming and Machining of Polymers, Ceramics, and Composites (Advanced Materials Processing and Manufacturing) - CRC Press (2024)
No ratings yet
Matruprasad Rout (Editor), Kishore Debnath (Editor) - Forming and Machining of Polymers, Ceramics, and Composites (Advanced Materials Processing and Manufacturing) - CRC Press (2024)
269 pages
Field Training Report: Department of Civil Engineering
No ratings yet
Field Training Report: Department of Civil Engineering
8 pages
Departments Dos & Don'ts
No ratings yet
Departments Dos & Don'ts
2 pages
Cps 330 C
No ratings yet
Cps 330 C
110 pages
Form 1 Exercises
100% (1)
Form 1 Exercises
160 pages
TIG and tig Welding Practical Assignment Checklist
No ratings yet
TIG and tig Welding Practical Assignment Checklist
17 pages
Fuzzy Logic Design Using VHDL On FPGA
100% (1)
Fuzzy Logic Design Using VHDL On FPGA
69 pages
NSTP Cwts Chapter 5 Leadership 1
No ratings yet
NSTP Cwts Chapter 5 Leadership 1
20 pages
FC2060 Panel A6V10224214
No ratings yet
FC2060 Panel A6V10224214
14 pages
O Level Past Papers 20s
No ratings yet
O Level Past Papers 20s
24 pages
Progress of R GM
No ratings yet
Progress of R GM
49 pages
Assessment Category: Assessment Title: Course Title: Semester: Course Code: Session: Instructor: Date Total Time Instructions
No ratings yet
Assessment Category: Assessment Title: Course Title: Semester: Course Code: Session: Instructor: Date Total Time Instructions
2 pages
Multivibrators
100% (1)
Multivibrators
12 pages
Translation of Narsinh Mehta Into English
67% (3)
Translation of Narsinh Mehta Into English
227 pages

Fine-Tuning Models

Uploaded by

Fine-Tuning Models

Uploaded by

Fine-tuning

Fine-tune models for better results and efficiency.

Higher quality results than prompting

Fine-tuning improves on few-shot learning by training on many more examples than

At a high level, fine-tuning involves the following steps:

Prepare and upload training data

Which models can be fine-tuned?

When to use fine-tuning

Common use cases

Setting the style, tone, format, or other qualitative aspects

Another scenario where fine-tuning is effective is reducing cost and/or latency by

Preparing your dataset

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also

Example count recommendations

We recommend starting with 50 well-crafted demonstrations and seeing if the model

Train and test splits

Model Inference context length Training examples context length

Fine-tuning data format validation

Upload a training file

Create a fine-tuning job with DPO

Size limits and large uploads

To start a fine-tuning job using the OpenAI SDK:

Create a fine-tuning job

To set additional fine-tuning parameters like the validation_file or

Work with fine-tuning jobs

# List 10 fine-tuning jobs

# Retrieve the state of a fine-tune

# List up to 10 events from a fine-tuning job

# Delete a fine-tuned model

Use fine-tuned model

Use a checkpointed model

To access these checkpoints,

Analyzing your fine-tuned model

Iterating on data quality

Collect examples to target remaining issues

Fine-tuning on your current dataset

contains CAPTCHAs, contains people, contains faces, contains children

Preparing your dataset for DPO

A prompt, like a user message.

Stacking methods: Supervised + DPO

A recommended workflow is as follows:

set the type parameter to dpo

Create a fine-tuning job with DPO

const openai = new OpenAI();

const job = await openai.fineTuning.jobs.create({

Style and tone

Weights and Biases Integration

To integrate your fine-tuning jobs with W&B, you'll need to

The weights and balances key in the dashboard

Sometimes it is convenient to explicitly set the W&B Entity to be associated with

View your fine-tuning job in Weights and Biases

model: The model you are fine-tuning

openai/fine-tuning: Tag to let you know this run is a fine-tuning job

An example W&B run

An example of the metrics logged to a W&B run is shown below:

Charts for valid_loss, train_loss, valid_mean_token_accuracy, and

When should I use fine-tuning vs embeddings / retrieval augmented generation?

By default OpenAI’s models are trained to be helpful generalist assistants. Fine-

You might also like