0% found this document useful (0 votes)
9 views24 pages

Nebius Llm Fine Tuning Mlflow

This whitepaper provides a practical guide on fine-tuning Large Language Models (LLMs) using MLflow, emphasizing the importance of tailoring AI applications to meet user needs. It outlines the challenges faced in the fine-tuning process, such as resource intensity and collaboration complexity, and presents a structured approach across four stages: model selection, fine-tuning design, training and evaluation, and model management. The document highlights MLflow's capabilities in streamlining these processes, enhancing collaboration, and ensuring reproducibility in machine learning workflows.

Uploaded by

samarasimhareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views24 pages

Nebius Llm Fine Tuning Mlflow

This whitepaper provides a practical guide on fine-tuning Large Language Models (LLMs) using MLflow, emphasizing the importance of tailoring AI applications to meet user needs. It outlines the challenges faced in the fine-tuning process, such as resource intensity and collaboration complexity, and presents a structured approach across four stages: model selection, fine-tuning design, training and evaluation, and model management. The document highlights MLflow's capabilities in streamlining these processes, enhancing collaboration, and ensuring reproducibility in machine learning workflows.

Uploaded by

samarasimhareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

February 2025

Whitepaper

LLM fine-tuning
with MLflow:
a practical guide
1
Table of contents
Introduction 3

MLflow: Brief product overview 4

LLM fine-tuning for GenAI applications 5

Stage 1. Model selection 6

Stage 2. Fine-tuning design 9

Stage 3. Training and evaluation 13

Stage 4. Model management and deployment 17

Conclusion 21

Managed MLflow in Nebius AI Cloud 22

References 23

2
Introduction
Large Language Models (LLMs) has revolutionized how we use AI
by making text generation and interpretation easier than ever. They are not
just for experts anymore — now, anyone can create AI tools for problem-
solving or boosting productivity.

But just having an LLM is not enough. To succeed, your AI application


has to meet user needs precisely, which is why fine-tuning has become
essential. Fine-tuning lets ML teams add value by tailoring models
to specific audiences, but it’s not without its challenges:

• Resource intensity: LLM fine-tuning demands substantial


computational resources, requiring careful cost management.

• Experiment management: As teams scale their efforts,


tracking numerous experiments, hyperparameters
and model versions becomes increasingly complex.

• Reproducibility: Ensuring consistent results across runs and maintaining


reproducible environments is critical for production deployments.

• Collaboration complexity: Sharing results, comparing


experiments and maintaining version consistency
are challenging tasks without the right tools.

In this white paper, we explore how MLflow simplifies these challenges


and helps streamline your LLM fine-tuning process.

3
MLflow: Brief
product overview
MLflow is a popular open-source tool for experiment tracking and simplified
model management, enabling you to streamline and enhance your
MLOps pipeline. Many ML teams around the world use MLflow in various
configurations to develop AI models and deploy them to production. Below
are three core advantages this tool offers to ML engineers and data scientists:

1. Pipeline organization: MLflow helps you structure your


MLOps pipeline by providing essential functionality, such
as experiment tracking and a model parameter registry,
to support the model development process.

2. Process transparency: By collecting and organizing metadata


from your model training, MLflow makes your ML pipeline more
transparent and visible. This enables your team to monitor
progress and implement changes with a high level of precision.

3. Improved collaboration: Developing ML models generates


a significant amount of information and assets that need
to be accessible to various stakeholders. MLflow provides efficient
tools for organizing and sharing these resources across the team.

Today, there are many MLOps platforms and tools that offer functionality
similar to MLflow. Some are available as software-as-a-service solutions,
while others come as on-premises installations. As an open-source tool,
MLflow remains an affordable MLOps solution, allowing small and medium
ML teams to compete effectively in the global AI race. However, its
open-source distribution model does come with certain limitations, such
as a lack of stability guarantees and added complexity in installation
and maintenance.

Reducing complexity for ML teams


Nebius offers MLflow as a fully managed and ready-to-work cloud solution.
This enables you to deliver production-ready models faster without having to worry
about server maintenance and provisioning.

4
LLM fine-
tuning for GenAI
applications
Let’s imagine your ML team received a task to develop a chatbot
for the upcoming GenAI feature in your SaaS product. According
to the initial specification, the foundation of this service should
be one of the existing LLMs, like Llama 3.

The diagram below shows four main stages your team will likely
go through to accomplish this goal.

Figure 1. Simplified version of LLM fine-tuning workflow.

You may notice that the data preparation stage is not included
in the diagram. This is a large and complex activity that we have
intentionally chosen to omit to keep the focus on the fine-tuning
part of the process.

5
Stage 1. Model selection
At this stage, the goal is to choose a pre-trained LLM model that
is most suitable for our needs. First, define and formulate the technical
requirements, considering the parameters and conditions of your business
use case. Then, shortlist the most suitable models and conduct several
runs to evaluate their performance quality.

Challenges of the model selection stage

Performance-consumption tradeoff
Bigger LLMs usually deliver better answer accuracy but their inference
requires more compute resources, significantly impacting the unit economics
of the project

Evaluation complexity
Selecting an LLM requires considering multiple parameters across different
dimensions. Without using a systematic approach, this can become confusing

Variety of models and applications


There are multiple LLMs with unique strengths and characteristics that could
be less beneficial for your specific use cases

License limitations
Different models have varying license and usage restrictions. Carefully
investigate the applicable scope for the selected model

6
Step 1: Define model requirements
Before selecting a base model, clearly define your task and specific
requirements:

• Define the task type (e.g., classification, generation, summarization).

• Specify the domain (e.g., general, medical, legal, financial).

• Outline the required output format and model context size.

• Select performance metrics (e.g., accuracy, latency, efficiency).

Step 2: Identify potential LLM candidates Good practices

• Create a shortlist of potential


After defining your requirements, systematically evaluate models based on the criteria above.
potential models across key criteria to find the best match • Balance model size with available
for your needs: computational resources.

• Select an instruct/chat model


• Choose a model architecture suitable for your if you have a small dataset.
tasks (e.g., classification, generation). • Look for models trained on data
similar to your target domain
• Select model size based on performance for better initial performance.
needs and available compute resources.

• Check pre-training data relevance to your target domain.

• Calculate required compute resources


for fine-tuning and inference.

• Verify licensing and usage


rights for your intended application.

Step 3: Assess model performance Good practices

• Use model characteristics


Test shortlisted models to evaluate their actual as MLflow run parameters to make
performance on your specific use case and data: it easy to compare different options.

• Evaluate results considering


• Review benchmark results for tasks similar to yours. both performance
and resource utilization.
• Run preliminary tests on a subset of your
• Make your final selection based
data to compare model performance. on the best balance of performance,
efficiency and practicality
• Compare models using both standardized for your specific use case.
metrics and custom evaluation criteria.

7
Figure 2. MLflow can visually show the difference between models in tabular format or as a diagram.
(Source: DSC 2024 Tutorial RAG Evaluation with MLflow, GitHub)

Helpful MLflow practices


Organize model Create a separate MLflow experiment to keep model selection
selection metrics. You can give it a clear name (e.g., “1 - Model Selection”).
as a separate MLflow
Use separate runs to track metrics, parameters and artefacts
experiment
for every model candidate. By default, MLflow generates run
names automatically, but you can specify a model name (e.g.,
“llama-3.2”) as a run name, or set it as a parameter or tag.

Log metrics Use MLflow Metrics Tracking functionality to track all relevant
and parameters information about the model selection process.
about the model,
Use mlflow.autolog() for starting MLflow tracking.
data and resource
consumption Use manual logging functions to track custom metrics
and parameters.

Enable mlflow.enable_system_metrics_logging() to track CPU,


RAM, GPU, etc.

Track datasets and custom artifacts for lineage and reproducibility.

Evaluate LLM with Use MLflow built-in metrics for evaluating LLMs for quantitative
built-in and custom and qualitative model measurement. Implement custom metrics
metrics using MLflow’s flexible metric logging system with heuristic-based
metrics and LLM-as-a-Judge metrics.

Make data- Explore MLflow UI for easy comparison of different model


driven decisions architectures and sizes.
on the performance-
Use MLflow’s tagging feature to categorize experiments for easier
consumption tradeoff
filtering and analysis. 8
Stage 2. Fine-tuning design
This stage focuses on shaping your fine-tuning strategy and outlining how the fine-tuning
process will operate. Starting with the definition of project objectives, you can establish
proper dataset requirements, choose the appropriate fine-tuning method and plan
resources for further experimentation.

Challenges of the fine-tuning design stage

Design complexity
Changes in some parameters of the design process can cause unpredictable
(and sometimes undetectable) outcomes for the final fine-tuning strategy.

Reproducibility
Even minor differences in how you run your training can lead to different
outcomes, making it difficult to ensure consistency across iterations.

Collaboration complexity
Having multiple ML engineers work on the same project can turn the fine-
tuning design into a poorly organized and overly complicated process.

Scalability limitations
Practices valid for small projects may not translate effectively as the scale
of your efforts increases.

9
Step 4: Define project objectives Good practices

• Document objectives with


Clearly define the goals for your LLM fine-tuning project: measurable outcomes.

• Specify desired changes in model outputs • Set realistic targets based


on the base model’s capabilities.
(e.g., style, format, domain-specific terminology).
• Define clear success

• Define target task success criteria metrics for fine-tuning.

with measurable metrics. • Consider deployment


constraints early.

• Set minimum acceptable performance • Review objectives with

thresholds for deployment. stakeholders.

• Establish GPU resource constraints.

Step 5: Define model evaluation metrics Good practices

• Define clear evaluation criteria


Select and implement LLM-specific evaluation metrics: aligned with business goals.

• Set up prompt-response evaluation. • Establish baseline performance


metrics for model comparison.

• Implement task-specific metrics. • Implement automated evaluation


pipelines for consistent testing.
• Track inference speed. • Configure MLflow to automatically
log system metrics (CPU,
• Design domain-specific metrics. GPU, RAM usage).

• Configure automated metric logging.

• Set up MLflow experiment tracking.

• Define metric visualization needs.

• Plan A/B testing structure.

Figure 3. MLflow can automatically track dozens of LLM fine-tuning parameters and metrics.

10
Step 6: Define dataset strategy Good practices

• Codify and document data


Plan your fine-tuning dataset approach: preparation steps.

• Select appropriate training data sources. • Test the data processing pipeline.

• Monitor data quality metrics.


• Define data quality requirements. • Log dataset metadata in MLflow.

• Design prompt-completion formats. • Control versions for training


and validation datasets.

• Plan data preprocessing and augmentation methods. • Track data lineage..

• Create validation splits.

Step 7: Select fine-tuning method Good practices

• Start with PEFT methods


Choose the most appropriate fine-tuning approach for your for resource efficiency.
use case: • Consider QLoRA for memory-
constrained scenarios. QLoRA offers
• Evaluate feature-based fine-tuning options. 33% memory savings at the cost
of a 39% runtime increase. Source:
• Consider full fine-tuning requirements. Practical Tips for Finetuning LLMs
Using LoRA (Low-Rank Adaptation).
• Assess instruction fine-tuning needs.
• Document memory-speed tradeoffs

• Explore parameter-efficient (PEFT) methods of each method in MLflow.

as a resource-efficient alternative to full fine- • Test multiple approaches


on sample data.
tuning in instruction fine-tuning methodologies.
• Monitor resource utilization

• Determine memory-performance tradeoffs. for each method.

Step 8: Plan and organize experiments Good practices

• Start with small-scale


Design a structured experimentation strategy: validation experiments.

• Define hyperparameter search space. • Begin with smaller models


before scaling up.

• Implement automated training pipelines • Use MLflow to track all

for consistent and reproducible results. experiment metrics.

• Document resource
• Set up experiment tracking. consumption for each run.

• Decide which artifacts to log in MLflow


• Plan resource allocation. for each run to optimize disk space.

• Maintain experiment versioning

11
Helpful MLflow practices
Centralize experiment Configure MLflow’s tracking URI to centralize experiment
tracking logging across teams (e.g., mlflow.set_tracking_uri()
for remote tracking).

Save fine-tuning results systematically using MLflow logging


functions: mlflow.log_metric(), mlflow.log_param() and
mlflow.log_artifact().

Enable automated tracking with mlflow.autolog() to capture


framework-specific metrics and parameters automatically.

Organize MLflow UI dashboards for effective team


communication with custom tags, meaningful Run names
and customized metric columns.

Document experiment steps and insights in the MLflow run


description field.

Ensure reproducibility Log models with MLflow Tracking APIs to automatically infer
required dependencies for the model flavor.

Control dataset versions used in experiments.

12
Stage 3. Training
and evaluation
This stage involves training the model based on the designed fine-tuning
strategy. Typically, this stage includes running several training iterations
and evaluating the resulting metrics. As the most resource-intensive part
of the workflow, it requires careful planning to minimize errors and ensure
optimal results. The outcome of this stage is a fine-tuned LLM evaluated
against the initial requirements.

Challenges of the training and evaluation stage

Hardware failures
Hardware issues, connection losses or resource limits can disrupt the training
process. Without proper checkpointing and recovery systems, hours of costly
progress can be lost.

Complex metric monitoring


Simultaneous tracking of training loss, validation metrics, resource usage
and learning progress demands robust monitoring systems to preempt issues.

High resource consumption


Running multiple training jobs requires extensive GPU capacity and can lead
to budget overruns if not carefully managed.

Hyperparameter optimization
Finding the right combination of learning rate, batch size and other
parameters is particularly challenging with LLMs. Each test run is expensive
and time-consuming, making traditional optimization approaches impractical.

13
Step 9: Prepare data Good practices

• Prioritize data quality over quantity.


Before starting fine-tuning, prepare your data to ensure
• Document all preprocessing
it meets quality standards and model requirements: steps for reproducibility.

• Download the required dataset. • Implement data validation checks.

• Preprocess the data: clean


and format for model compatibility.

• Split the data into training, validation and test sets.

• Save and version processed datasets.

Step 10: Fine-tune the LLM Good practices

• Optimize training parameters


Launch the fine-tuning process with careful monitoring on a small sample of the dataset to find
of training dynamics: the optimal settings for your model.

• Optimize learning rate and batch


• Initialize the model with pre-trained weights. size before other parameters.

• Configure task-specific architecture • Adjust training duration based


on learning curves.
modifications (if applicable).
• Monitor training and validation

• Set training parameters such as learning metrics via MLflow dashboards.

rate, batch size and epochs. • Save checkpoints regularly


to prevent data loss.

• Launch the fine-tuning job and track its progress.

Figure 4. The results of model fine-tuning in tabular format.

14
Step 11: Evaluate the LLM Good practices

• Use a separate test dataset


Systematically assess model performance using both for final evaluation.
automated and human evaluation: • Implement a comprehensive
evaluation suite, including
• Run an automated evaluation pipeline using bias and security testing.
predefined metrics (refer to Step 5). • Log evaluation results
in MLflow for tracking.
• Conduct human evaluation.
• Compare results across

• Document evaluation results. different training runs

• Compare against baseline metrics.

Figure 5. The results of model fine-tuning in linear charts.

Step 12: Iterate and improve Good practices

• Document all changes made


Review evaluation results and refine the model to enhance between iterations.
its performance: • Track performance
improvements using MLflow.
• Analyze the outcomes of automated
• Set clear goals for each iteration cycle.
and human evaluations.
• Maintain a comprehensive history

• Identify areas for improvement. of experiments and iterations.

• Adjust model parameters or architecture as needed.

• Update and refine the training dataset if necessary.

15
Helpful MLflow practices
Enable real-time metric Use mlflow.autolog() to capture training progress metrics
logging autmatically.

Monitor training progress via MLflow UI dashboards.

Create custom visualizations for LLM-specific metrics.

Configure distributed training logging.

Aggregate metrics from multiple GPUs.

Configure checkpointing Automate checkpoint saving via MLflow.


and artifact logging
Log model weights, configurations and other artifacts.

Track resource utilization Track GPU utilization and memory consumption.


Monitor training speed and throughput.

Log batch processing metrics.

Identify performance bottlenecks.

Compare resource usage across runs.

Integrate hyperparameter Integrate optimization libraries like Optuna or Hyperopt.


tuning
Track parallel optimization runs:

• Use parent-child runs: mlflow.start_run(nested=True).

• Group related runs with tags:


mlflow.set_tag(“optimization_group”, “batch_1”).
• Compare runs using MLflow UI’s parallel coordinates plot.

Log optimization results:

• Save best parameters configuration as an artifact.

• Track optimization history.

• Document parameter importance metrics.

• Store search space configuration.

16
Stage 4. Model
management
and deployment
This stage focuses on preparing the fine-tuned model for production
deployment and ensuring it can deliver business value efficiently.
Additionally, this stage includes key procedures and routines
for maintaining a streamlined and scalable delivery environment.

Challenges of the training and evaluation stage

Heavy model assets


Managing and sharing large model files (often several gigabytes) across
environments can pose significant technical challenges.

Prompt template inconsistency


Even minor changes in prompt templates and inference parameters across
deployments can lead to inconsistent model behavior.

Lineage tracking complexity


Tracking model versions, training datasets and parameter changes becomes
more complex with each iteration.

Environment management complexity


Different models may require specific software dependencies and runtime
configurations, complicating development and production processes.

17
Step 13: Manage environment and model Good practices

settings • Store prompt templates


in separate configuration files.
Ensure consistency and reproducibility by systematically • Use version control for all
managing model settings: configurations.

• Document each parameter’s


• Set up model configurations and parameters. impact systematically.

• Create environment-specific settings. • Apply standardized


naming conventions.

• Define inference parameters. • Track configuration


changes in MLflow.
• Document configuration dependencies.

• Establish a version control system.

Step 14: Store models and artifacts Good practices

• Create clear directory hierarchies.


Implement structured storage and tracking of all model-
• Implement consistent
related components: naming patterns.

• Organize model files and adapters. • Document all dependencies.

• Create metadata documentation.

• Establish backup procedures.

• Configure access controls.

Figure 6. MLflow model is a standardized format for packaging that contains all metadata about the model and dependencies.

18
Step 15: Control model versions Good practices

• Use meaningful version tags.


Design a comprehensive version control strategy
• Maintain detailed changelogs
for managing the model lifecycle: to document updates.

• Establish a centralized model repository. • Track performance across


different versions.

• Implement a version tracking system. • Use MLflow’s Model Registry.

• Develop a model registration process.

• Define promotion and deployment workflows.

• Set up archival procedures.

Figure 7. Model Registry helps version models and manage lifecycle with aliases.

Step 16: Automate deployment pipelines Good practices

• Automate deployment processes.


Create reliable, automated deployment workflows
• Implement pre-deployment
for model releases: validation checks.

• Design an efficient deployment architecture. • Create health monitoring


dashboards.

• Set up automated pipelines. • Configure alerting systems.

• Implement monitoring systems.

• Create rollback procedures.

19
Helpful MLflow practices
Automate model Use MLflow’s automatic logging features to log models
and artifact logging and artifacts alongside experiment runs.

Use the mlflow.pyfunc module to package LLMs with their


dependencies and configurations.

Log model cards and associated documentation.

Register and version Register models using mlflow.register_model() or manually


successful models through the Model Registry UI.

Stick to clear naming conventions.

Maintain version history.

Use aliases for model Deploy and organize models with aliases and tags.
lifecycle management
Set up automated promotion workflows.

Serve MLflow-registered models directly


from the Model Registry.

20
Conclusion
In this white paper, we explored the standard steps involved in fine-tuning
an LLM for a generative AI application. We also highlighted the value that
MLflow brings to every step of this process.

Key takeaways:

• MLflow gives you various options to organize the data about your
runs, experiments and results in the most convenient way.
• MLflow collects and stores model development
metadata in a standardized way, ensuring
the reproducibility of runs and experiments.

• Automated logging and versioning significantly reduce


the overhead of managing fine-tuning pipelines.

• Integration with popular LLM frameworks makes MLflow


a versatile choice for diverse fine-tuning scenarios.

• MLflow fosters a transparent and collaborative environment,


streamlining teamwork and improving workflows.

We hope this starter guide proves useful for ML teams and individuals
customizing existing models to extract additional value. MLflow’s robust
functionality is more than enough to perform large-scale training and fine-
tuning for ML teams of any scale.

21
Managed MLflow
in Nebius AI Cloud
Considering how useful and convenient MLflow is for ML teams,
we decided to launch it in our cloud as a fully-managed solution.

From the user’s perspective, this means you don’t have to worry about
software version control, updates or server maintenance. Nebius handles
all necessary compute and supporting services to ensure MLflow runs
seamlessly and is available out of the box.

Figure 8. How Managed MLflow works in Nebius AI Cloud.

22
References
• MLflow documentation: Tutorials and Use Case
Guides for GenAI applications in MLflow

• MLflow documentation: Metrics Tracking

• MLflow documentation: MLflow Model Registry

• What is supervised fine-tuning in LLMs?


Unveiling the process (Nebius blog)

• Budget Instruction Fine-tuning of Llama 3 8B


Instruct (MLOps Community blog)

• The Ultimate Guide to LLM Fine Tuning:


Best Practices & Tools (Lakera blog)

23
© 2025 Nebius B.V.

24

You might also like