Fine-Tuning Models
Fine-Tuning Models
There are many tasks at which our models may not initially appear to perform well,
but results can be improved with the right prompts - thus fine-tuning may not be
necessary
Iterating over prompts and other tactics has a much faster feedback loop than
iterating with fine-tuning, which requires creating datasets and running training
jobs
In cases where fine-tuning is still necessary, initial prompt engineering work is
not wasted - we typically see best results when using a good prompt in the fine-
tuning data (or combining prompt chaining / tool use with fine-tuning)
Our prompt engineering guide provides a background on some of the most effective
strategies and tactics for getting better performance without fine-tuning. You may
find it helpful to iterate quickly on prompts in our playground.
Each example in the dataset should be a conversation in the same format as our Chat
Completions API, specifically a list of messages where each message has a role,
content, and optional name. At least some of the training examples should directly
target cases where the prompted model is not behaving as desired, and the provided
assistant messages in the data should be the ideal responses you want the model to
provide.
Example format
In this example, our goal is to create a chatbot that occasionally gives sarcastic
responses, these are three training examples (conversations) we could create for a
dataset:
If you would like to shorten the instructions or prompts that are repeated in every
example to save costs, keep in mind that the model will likely behave as if those
instructions were included, and it may be hard to get the model to ignore those
"baked-in" instructions at inference time.
It may take more training examples to arrive at good results, as the model has to
learn entirely through demonstration and without guided instructions.
Token limits
Token limits depend on the model you select. Here is an overview of the maximum
inference context length and training examples context length our models:
You can compute token counts using our counting tokens notebook from the OpenAI
cookbook.
Estimate costs
For detailed pricing on training costs, as well as input and output costs for a
deployed fine-tuned model, visit our pricing page. Note that we don't charge for
tokens used for training validation. To estimate the cost of a specific fine-tuning
training job, use the following formula:
(base training cost per 1M input tokens ÷ 1M) × number of tokens in the input file
× number of epochs trained
For a training file with 100,000 tokens trained over 3 epochs, the expected cost
would be:
~$0.90 USD with gpt-4o-mini-2024-07-18 after the free period ends on October 31,
2024.
~$2.40 USD with gpt-3.5-turbo-0125.
Check data formatting
Once you have compiled a dataset and before you create a fine-tuning job, it is
important to check the data formatting. To do this, we created a simple Python
script which you can use to find potential errors, review token counts, and
estimate the cost of a fine-tuning job.
client = OpenAI()
job = client.fine_tuning.jobs.create(
training_file="file-all-about-the-weather",
model="gpt-4o-2024-08-06",
method={
"type": "dpo",
"dpo": {
"hyperparameters": {"beta": 0.1},
},
},
)
After you upload the file, it may take some time to process. While the file is
processing, you can still create a fine-tuning job but it will not start until the
file processing has completed.
client.fine_tuning.jobs.create(
training_file="file-abc123",
model="gpt-4o-mini-2024-07-18"
)
In this example, model is the name of the model you want to fine-tune. Note that
only specific model snapshots (like gpt-4o-mini-2024-07-18 in this case) can be
used for this parameter, as listed in our supported models. The training_file
parameter is the file ID that was returned when the training file was uploaded to
the OpenAI API. You can customize your fine-tuned model's name using the suffix
parameter.
If you choose not to specify a method, the default is Supervised Fine-Tuning (SFT).
After you've started a fine-tuning job, it may take some time to complete. Your job
may be queued behind other jobs in our system, and training a model can take
minutes or hours depending on the model and dataset size. After the model training
is completed, the user who created the fine-tuning job will receive an email
confirmation.
In addition to creating a fine-tuning job, you can also list existing jobs,
retrieve the status of a job, or cancel a job.
# Cancel a job
client.fine_tuning.jobs.cancel("ftjob-abc123")
completion = client.chat.completions.create(
model="ft:gpt-4o-mini:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
You can start making requests by passing the model name as shown above and in our
GPT guide.
Wait until a job succeeds, which you can verify by querying the status of a job.
Query the checkpoints endpoint with your fine-tuning job ID to access a list of
model checkpoints for the fine-tuning job.
For each checkpoint object, you will see the fine_tuned_model_checkpoint field
populated with the name of the model checkpoint. You may now use this model just
like you would with the final fine-tuned model.
{
"object": "fine_tuning.job.checkpoint",
"id": "ftckpt_zc4Q7MP6XxulcVzj4MZdwsAB",
"created_at": 1519129973,
"fine_tuned_model_checkpoint": "ft:gpt-3.5-turbo-0125:my-org:custom-
suffix:96olL566:ckpt-step-2000",
"metrics": {
"full_valid_loss": 0.134,
"full_valid_mean_token_accuracy": 0.874
},
"fine_tuning_job_id": "ftjob-abc123",
"step_number": 2000
}
Each checkpoint will specify its:
step_number: The step at which the checkpoint was created (where each epoch is
number of steps in the training set divided by the batch size)
metrics: an object containing the metrics for your fine-tuning job at the step when
the checkpoint was created.
Currently, only the checkpoints for the last 3 epochs of the job are saved and
available for use. We plan to release more complex and flexible checkpointing
strategies in the near future.
training loss
training token accuracy
valid loss
valid token accuracy
Valid loss and valid token accuracy are computed in two different ways - on a small
batch of the data during each step, and on the full valid split at the end of each
epoch. The full valid loss and full valid token accuracy metrics are the most
accurate metric tracking the overall performance of your model. These statistics
are meant to provide a sanity check that training went smoothly (loss should
decrease, token accuracy should increase). While an active fine-tuning jobs is
running, you can view an event object which contains some useful metrics:
{
"object": "fine_tuning.job.event",
"id": "ftevent-abc-123",
"created_at": 1693582679,
"level": "info",
"message": "Step 300/300: training loss=0.15, validation loss=0.27, full
validation loss=0.40",
"data": {
"step": 300,
"train_loss": 0.14991648495197296,
"valid_loss": 0.26569826706596045,
"total_steps": 300,
"full_valid_loss": 0.4032616495084362,
"train_mean_token_accuracy": 0.9444444179534912,
"valid_mean_token_accuracy": 0.9565217391304348,
"full_valid_mean_token_accuracy": 0.9089635854341737
},
"type": "metrics"
}
After a fine-tuning job has finished, you can also see metrics around how the
training process went by querying a fine-tuning job, extracting a file ID from the
result_files, and then retrieving that files content. Each results CSV file has the
following columns: step, train_loss, train_accuracy, valid_loss, and
valid_mean_token_accuracy.
step,train_loss,train_accuracy,valid_loss,valid_mean_token_accuracy
1,1.52347,0.0,,
2,0.57719,0.0,,
3,3.63525,0.0,,
4,1.72257,0.0,,
5,1.52379,0.0,,
While metrics can be helpful, evaluating samples from the fine-tuned model provides
the most relevant sense of model quality. We recommend using the Evals product to
compare your base model to your fine-tuned model. Alternatively, you could manually
generate samples from both models on a test set, and compare the samples side by
side. The test set should ideally include the full distribution of inputs that you
might send to the model in a production use case.
Iterating on hyperparameters
We allow you to specify the following hyperparameters:
epochs
learning rate multiplier
batch size
We recommend initially training without specifying any of these, allowing us to
pick a default for you based on dataset size, then adjusting if you observe the
following:
If the model does not follow the training data as much as expected increase the
number of epochs by 1 or 2
This is more common for tasks for which there is a single ideal completion (or a
small set of ideal completions which are similar). Some examples include
classification, entity extraction, or structured parsing. These are often tasks for
which you can compute a final accuracy metric against a reference answer.
If the model becomes less diverse than expected decrease the number of epochs by 1
or 2
This is more common for tasks for which there are a wide range of possible good
completions
If the model does not appear to be converging, increase the learning rate
multiplier
You can set the hyperparameters as is shown below:
Setting hyperparameters
from openai import OpenAI
client = OpenAI()
client.fine_tuning.jobs.create(
training_file="file-abc123",
model="gpt-4o-mini-2024-07-18",
method={
"type": "supervised",
"supervised": {
"hyperparameters": {"n_epochs": 2},
},
},
)
Vision fine-tuning
Fine-tuning is also possible with images in your JSONL files. Just as you can send
one or many image inputs to Chat Completions, you can include those same message
types within your training data. Images can be provided either as HTTP URLs or data
URLs containing base64 encoded images.
Here's an example of an image message on a line of your JSONL file. Below, the JSON
object is expanded for readibility, but typically this JSON would appear on a
single line in your data file:
{
"messages": [
{
"role": "system",
"content": "You are an assistant that identifies uncommon cheeses."
},
{
"role": "user",
"content": "What is this cheese?"
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url":
"https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg"
}
}
]
},
{
"role": "assistant",
"content": "Danbo"
}
]
}
Image dataset requirements
Size
Your training file can contain a maximum of 50,000 examples that contain images
(not including text examples).
Each example can have at most 10 images.
Each image can be at most 10 MB.
Format
Images must be JPEG, PNG, or WEBP format.
Your images must be in the RGB or RGBA image mode.
You cannot include images as output from messages with the assistant role.
Content moderation policy
We scan your images before training to ensure that they comply with our usage
policy. This may introduce latency in file validation before fine-tuning begins.
Images containing the following will be excluded from your dataset and not used for
training:
People
Faces
Children
CAPTCHAs
Help
What to do if your images get skipped
Your images can get skipped for the following reasons:
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg",
"detail": "low"
}
}
Other considerations for vision fine-tuning
To control the fidelity of image understanding, set the detail parameter of
image_url to low, high, or auto for each image. This will also affect the number of
tokens per image that the model sees during training time, and will affect the cost
of training. See here for more information.
Preference fine-tuning
Direct Preference Optimization (DPO) fine-tuning allows you to fine-tune models
based on prompts and pairs of responses. This approach enables the model to learn
from human preferences, optimizing for outputs that are more likely to be favored.
Note that we currently support text-only DPO fine-tuning.
{
"input": {
"messages": [
{
"role": "user",
"content": "Hello, can you tell me how cold San Francisco is today?"
}
],
"tools": [],
"parallel_tool_calls": true
},
"preferred_output": [
{
"role": "assistant",
"content": "Today in San Francisco, it is not quite cold as expected. Morning
clouds will give away to sunshine, with a high near 68°F (20°C) and a low around
57°F (14°C)."
}
],
"non_preferred_output": [
{
"role": "assistant",
"content": "It is not particularly cold in San Francisco today."
}
]
}
Currently, we only train on one-turn conversations for each example, where the
preferred and non-preferred messages need to be the last assistant message.
Fine-tune the base model with SFT using a subset of your preferred responses. Focus
on ensuring the data quality and representativeness of the tasks.
Use the SFT fine-tuned model as the starting point, and apply DPO to adjust the
model based on preference comparisons.
Configuring a DPO fine-tuning job
We have introduced a method field in the fine-tuning job creation endpoint, where
you can specify type as well as any associated hyperparameters. For DPO:
You can also set this value to auto (the default) to use a value configured by the
platform.
The example below shows how to configure a DPO fine-tuning job using the OpenAI
SDK. For more information on creating fine-tuning jobs in general, please refer to
the next section of the guide.
Provide authentication credentials for your Weights and Biases account to OpenAI
Configure the W&B integration when creating new fine-tuning jobs
Authenticate your Weights and Biases account with OpenAI
Authentication is done by submitting a valid W&B API key to OpenAI. Currently, this
can only be done via the Account Dashboard, and only by account administrators.
Your W&B API key will be stored encrypted within OpenAI and will allow OpenAI to
post metrics and metadata on your behalf to W&B when your fine-tuning jobs are
running. Attempting to enable a W&B integration on a fine-tuning job without first
authenticating your OpenAI organization with WandB will result in an error.
Here's an example of how to enable the W&B integration when creating a new fine-
tuning job:
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4o-mini-2024-07-18",
"training_file": "file-ABC123",
"validation_file": "file-DEF456",
"integrations": [
{
"type": "wandb",
"wandb": {
"project": "custom-wandb-project",
"tags": ["project:tag", "lineage"]
}
}
]
}' https://api.openai.com/v1/fine_tuning/jobs
By default, the Run ID and Run display name are the ID of your fine-tuning job
(e.g. ftjob-abc123). You can customize the display name of the run by including a
"name" field in the wandb object. You can also include a "tags" field in the wandb
object to add tags to the W&B Run (tags must be <= 64 character strings and there
is a maximum of 50 tags).
The full specification for the integration can be found in our fine-tuning job
creation documentation.
https://wandb.ai/<WANDB-ENTITY>/<WANDB-PROJECT>/runs/ftjob-ABCDEF
You should see a new run with the name and tags you specified in the job creation
request. The Run Config will contain relevant job metadata such as: