Fine-Tune & Evaluate LLMs in 2024 With Amazon SageMaker
Fine-Tune & Evaluate LLMs in 2024 With Amazon SageMaker
https://www.philschmid.de/sagemaker-train-evalaute-llms-2024 1/12
6/24/24, 6:13 PM Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker
If you are going to use a gated model like Llama 2 or Gemma you need to login into
our hugging face account, to use your token for accessing the gated repository. We
can do this by running the following command:
!huggingface-cli login --token YOUR_TOKEN
If you are going to use Sagemaker in a local environment. You need access to an IAM
Role with the required permissions for Sagemaker. You can find here more about it.
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
instruction format
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
In our example we are going to load our open-source dataset using the 🤗 Datasets
library and then convert it into the the conversational format, where we include the
schema definition in the system message for our assistant. We'll then save the
dataset as jsonl file, which we can then use to fine-tune our model. We are randomly
downsampling the dataset to only 10,000 samples.
Note: This step can be different for your use case. For example, if you have already a
dataset from, e.g. working with OpenAI, you can skip this step and go directly to the
fine-tuning step.
https://www.philschmid.de/sagemaker-train-evalaute-llms-2024 3/12
6/24/24, 6:13 PM Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker
def create_conversation(sample):
return {
"messages": [
{"role": "system", "content": system_message.format(schema=sample["context"])},
{"role": "user", "content": sample["question"]},
{"role": "assistant", "content": sample["answer"]}
]
}
print(dataset["train"][345]["messages"])
After we processed the datasets we are going to use the FileSystem integration to
upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you
want to store the dataset in a different S3 bucket. We will use the S3 path later in our
training script.
# save train_dataset to s3 using our SageMaker session
training_input_path = f's3://{sess.default_bucket()}/datasets/text-to-sql'
# save datasets to s3
dataset["train"].to_json(f"{training_input_path}/train_dataset.json", orient="records")
dataset["test"].to_json(f"{training_input_path}/test_dataset.json", orient="records")
library and supports all the same features, including logging, evaluation, and
checkpointing, but adds additiional quality of life features, including:
Dataset formatting, including conversational and instruction format
Training on completions only, ignoring prompts
Packing datasets for more efficient training
PEFT (parameter-efficient fine-tuning) support including Q-LoRA
Preparing the model and tokenizer for conversational fine-tuning (e.g. adding
special tokens)
We will use the dataset formatting, packing and PEFT features in our example. As peft
method we will use QLoRA a technique to reduce the memory footprint of large
language models during finetuning, without sacrificing performance by using
quantization. If you want to learn more about QLoRA and how it works, check
out Making LLMs even more accessible with bitsandbytes, 4-bit quantization and
QLoRA blog post. In Addition to QLoRA we will leverage the new Flash Attention 2
integrationg with Transformers to speed up the training. Flash Attention 2 is a new
efficient attention mechanism that is up to 3x faster than the standard attention
mechanism.
We prepared a run_sft.py, which uses `trl` with all of the features describe above.
The script is re-usable, but still hackable if you want to make changes. Paramters are
provided via CLI arguments using the HFArgumentParser, which cann parse any CLI
argument from the TrainingArguments or from our ScriptArguments.
This means you can easily adjust the `hyperparameters` below and change the
model_id from `codellama/CodeLlama-7b-hf` to `mistralai/Mistral-7B-v0.1`. Similar for
other parameters. The parameters we selected should for any 7B model, but you can
adjust them to your needs.
# hyperparameters, which are passed into the training job
hyperparameters = {
### SCRIPT PARAMETERS ###
'dataset_path': '/opt/ml/input/data/training/train_dataset.json', # path where sagemaker will s
'model_id': "codellama/CodeLlama-7b-hf", # or `mistralai/Mistral-7B-v0.1`
'max_seq_len': 3072, # max sequence length for model and packing
'use_qlora': True, # use QLoRA model
### TRAINING PARAMETERS ###
'num_train_epochs': 3, # number of training epochs
'per_device_train_batch_size': 1, # batch size per device during training
'gradient_accumulation_steps': 4, # number of steps before performing a backwa
'gradient_checkpointing': True, # use gradient checkpointing to save memory
https://www.philschmid.de/sagemaker-train-evalaute-llms-2024 5/12
6/24/24, 6:13 PM Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker
'optim': "adamw_torch_fused", # use fused adamw optimizer
'logging_steps': 10, # log every 10 steps
'save_strategy': "epoch", # save checkpoint every epoch
'learning_rate': 2e-4, # learning rate, based on QLoRA paper
'bf16': True, # use bfloat16 precision
'tf32': True, # use tf32 precision
'max_grad_norm': 0.3, # max gradient norm based on QLoRA paper
'warmup_ratio': 0.03, # warmup ratio based on QLoRA paper
'lr_scheduler_type': "constant", # use constant learning rate scheduler
'report_to': "tensorboard", # report metrics to tensorboard
'output_dir': '/tmp/tun', # Temporary output directory for model check
'merge_adapters': True, # merge LoRA adapters into model for easier
}
https://www.philschmid.de/sagemaker-train-evalaute-llms-2024 6/12
6/24/24, 6:13 PM Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker
“You can also use `g5.2xlarge` instead of the `g5.4xlarge` instance type, but then
it is not possible to use `merge_weights` parameter, since to merge the LoRA
weights into the model weights, the model needs to fit into memory. But you could
save the adapter weights and merge them using merge_adapter_weights.py after
training.”
We can now start our training job, with the `.fit()` method passing our S3 path to
the training script.
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}
In our example for CodeLlama 7B, the SageMaker training job took `6162 seconds`,
which is about `1.8 hours`. The ml.g5.2xlarge instance we used costs `$1.515 per
hour` for on-demand usage. As a result, the total cost for training our fine-tuned Code
https://www.philschmid.de/sagemaker-train-evalaute-llms-2024 7/12
6/24/24, 6:13 PM Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker
Now lets make sure SageMaker has successfully uploaded the model to S3. We can
use the `model_data` property of the estimator to get the S3 path to the model. Since
we used `merge_weights=True` and `disable_output_compression=True` the model is
stored as raw files in the S3 bucket.
huggingface_estimator.model_data["S3DataSource"]["S3Uri"].replace("s3://", "https://s3.console.aw
You should see a similar folder structure and files in your S3 bucket:
Hugging Face Lighteval on Amazon SageMaker or you can deploy the model to an
endpoint and interactively test the model. We are going to use the latter approach in
this example. We will load our eval dataset and evaluate the model on those samples,
using a simple loop and accuracy as our metric.
Note: Evaluating Generative AI models is not a trivial task since 1 input can have
multiple correct outputs. If you want to learn more about evaluating generative
models, check out Evaluate LLMs and RAG a practical example using Langchain and
Hugging Face blog post.
We are going to use the Hugging Face LLM Inference DLC a purpose-built Inference
Container to easily deploy LLMs in a secure and managed environment. The DLC is
powered by Text Generation Inference (TGI) solution for deploying and serving Large
Language Models (LLMs).
from sagemaker.huggingface import get_huggingface_llm_image_uri
We can now create a `HuggingFaceModel` using the container uri and the S3 path to our
model. We also need to set our TGI configuration including the number of GPUs, max
input tokens. You can find a full list of configuration options here.
import json
from sagemaker.huggingface import HuggingFaceModel
# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300
https://www.philschmid.de/sagemaker-train-evalaute-llms-2024 9/12
6/24/24, 6:13 PM Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
}
After the model is deployed we can use the `predict` to evaluate our model on the
full 2,500 samples of our test dataset.
Note: As mentioned above, evaluating generative models is not a trivial task. In our
example we used the accuracy of the generated SQL based on the ground truth SQL
query as our metric. An alternative way could be to automatically execute the
generated SQL query and compare the results with the ground truth. This would be a
more accurate metric but requires more work to setup.
But first lets test a simple request to our endpoint to see if everything is working as
expected. To correctly template our prompt we need to load the tokenizer from our
trained model from s3 and then template and example from our `test_dataset`. We
can then use the `predict` method to send a request to our endpoint.
from transformers import AutoTokenizer
from sagemaker.s3 import S3Downloader
https://www.philschmid.de/sagemaker-train-evalaute-llms-2024 10/12
6/24/24, 6:13 PM Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker
test_dataset = load_dataset("json", data_files="test_dataset.json",split="train")
random_sample = test_dataset[345]
def request(sample):
prompt = tokenizer.apply_chat_template(sample, tokenize=False, add_generation_prompt=True)
outputs = llm.predict({
"inputs": prompt,
"parameters": {
"max_new_tokens": 512,
"do_sample": False,
"return_full_text": False,
"stop": ["<|im_end|>"],
}
})
return {"role": "assistant", "content": outputs[0]["generated_text"].strip()}
print(random_sample["messages"][1])
request(random_sample["messages"][:2])
Awesome! Our model is working as expected. Now we can evaluate our model on
1000 samples from test dataset.
from tqdm import tqdm
def evaluate(sample):
predicted_answer = request(sample["messages"][:2])
if predicted_answer["content"] == sample["messages"][2]["content"]:
return 1
else:
return 0
success_rate = []
number_of_eval_samples = 1000
# iterate over eval dataset and predict
for s in tqdm(test_dataset.shuffle().select(range(number_of_eval_samples))):
success_rate.append(evaluate(s))
# compute accuracy
accuracy = sum(success_rate)/len(success_rate)
print(f"Accuracy: {accuracy*100:.2f}%")
We evaluated our model on 1000 samples from the evaluation dataset and got an
accuracy of 77.40%, which took ~25 minutes. This is quite good, but as mentioned
you need to take this metric with a grain of salt. It would be better if we could
evaluate our model by running the qureies against a real database and compare the
results. Since there might be different "correct" SQL queries for the same instruction.
There are also several ways on how we could improve the performance by using few-
shot learning, using RAG, Self-healing to generate the SQL query.
Don't forget to delete your endpoint once you are done.
https://www.philschmid.de/sagemaker-train-evalaute-llms-2024 11/12
6/24/24, 6:13 PM Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker
llm.delete_model()
llm.delete_endpoint()
Thanks for reading! If you have any questions, feel free to contact me on Twitter or
LinkedIn.
PHILIPP SCHMID © 2024 IMPRINT
https://www.philschmid.de/sagemaker-train-evalaute-llms-2024 12/12