LLaMA-Factory 合并 LoRA 适配器

LLaMA-Factory 合并 LoRA 适配器

flyfish

将LoRA适配器合并到基础模型中的命令

llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

llama3_lora_sft.yaml内容

### Note: DO NOT use quantized model or quantization_bit when merging lora adapters

### model
model_name_or_path: Qwen/Qwen2.5-VL-7B-Instruct
adapter_name_or_path: saves/qwen2_5vl-7b/lora/sft
template: qwen2_vl
trust_remote_code: true

### export
export_dir: output/qwen2_5vl_lora_sft
export_size: 5
export_device: cpu  # choices: [cpu, auto]
export_legacy_format: false

一、参数说明

1. [model] 部分
  • model_name_or_path:
    基础模型的路径或名称,例如:Qwen/Qwen2.5-VL-7B-Instruct
    注意:必须使用未量化(非4bit/8bit压缩)的原始模型权重,否则会导致合并失败或精度异常。

  • adapter_name_or_path:
    LoRA适配器的路径,例如:saves/qwen2_5vl-7b/lora/sft
    这是训练得到的低秩参数,会与基础模型的权重合并。

  • template:
    模型使用的模板名称(如 qwen2_vl),主要影响tokenization过程,尤其是多模态输入(如图文混合)的格式兼容性。

  • trust_remote_code:
    设置为 true,信任模型代码仓库中可能包含的自定义脚本或tokenizer,这通常是训练/推理必需的。

2. [export] 部分
  • export_dir:
    合并后的完整模型保存目录,例如:output/qwen2_5vl_lora_sft

  • export_size:
    模型文件的分片数量,值为 5 表示将模型拆分为 5个文件 储存(如 pytorch_model-00001-of-00005.bin 等)。
    作用:避免单个文件过大,降低硬件存储压力,尤其适合大模型(如7B/13B级别)。

  • export_device:
    指定合并使用的设备,可选 cpuauto

    • cpu: 借助CPU内存合并模型(速度较慢但更安全,避免显存不足)。
    • auto: 使用GPU(如显卡足够显存)加速合并。
  • export_legacy_format:
    设置为 false 表示使用新版模型格式(如 safetensors),以兼容现代工具链;设置为 true 会使用旧版格式(通常不再建议)。

二、注意事项

  1. 不要使用量化模型

    • 注意事项中特别提示:合并时必须使用原始未压缩的模型(例如32-bit浮点精度的原始权重),否则可能导致LoRA参数不能正确与基础模型融合。
  2. 合并流程概括

    1. 加载未量化基础模型(如 Qwen2.5-VL-7B-Instruct)。
    2. 加载已训练的LoRA适配器权重。
    3. 将LoRA参数与基础模型权重合并,生成一个完整的“新模型”。
    4. 按配置将合并后模型保存为多个文件(如5份)。
  3. 分片(export_size)的作用
    大模型(如7B参数)可能超过单个设备的存储限制,将其拆分为多个文件便于加载和传输。例如:

    • 文件名:pytorch_model-00001-of-00005.bin(共5份)。

三、为什么要合并LoRA

  • LoRA训练的特点:训练仅更新少量参数(低秩矩阵),而主模型权重保持冻结。
  • 合并目的
    • 将LoRA适配器永久融合到基础模型中,形成完整的模型文件(如 merged_model.bin)。
    • 合并后可直接部署推理,无需再加载额外的LoRA权重文件。

四、合并命令

llamafactory-cli export examples/merge_lora/qwen2_5vl_lora_sft.yaml

[2025-06-14 10:40:03,136] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 06-14 10:40:05 [__init__.py:239] Automatically detected platform cuda.
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,236 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,236 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,236 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,236 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,236 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,236 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,237 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2323] 2025-06-14 10:40:08,551 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|image_processing_base.py:378] 2025-06-14 10:40:08,552 >> loading configuration file /media/user/model/Qwen/Qwen2___5-VL-7B-Instruct/preprocessor_config.json
[INFO|image_processing_base.py:378] 2025-06-14 10:40:08,553 >> loading configuration file /media/user/model/Qwen/Qwen2___5-VL-7B-Instruct/preprocessor_config.json
[WARNING|logging.py:328] 2025-06-14 10:40:08,553 >> Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[INFO|image_processing_base.py:433] 2025-06-14 10:40:08,553 >> Image processor Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2_5_VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 12845056,
    "shortest_edge": 3136
  },
  "temporal_patch_size": 2
}

[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,554 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,554 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,554 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,554 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,554 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,554 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2058] 2025-06-14 10:40:08,554 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2323] 2025-06-14 10:40:08,863 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|processing_utils.py:884] 2025-06-14 10:40:09,405 >> Processor Qwen2_5_VLProcessor:
- image_processor: Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2_5_VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 12845056,
    "shortest_edge": 3136
  },
  "temporal_patch_size": 2
}

- tokenizer: Qwen2TokenizerFast(name_or_path='/media/user/model/Qwen/Qwen2___5-VL-7B-Instruct/', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151657: AddedToken("<tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	151658: AddedToken("</tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
}
)

{
  "processor_class": "Qwen2_5_VLProcessor"
}

[INFO|configuration_utils.py:691] 2025-06-14 10:40:09,440 >> loading configuration file /media/user/model/Qwen/Qwen2___5-VL-7B-Instruct/config.json
[INFO|configuration_utils.py:765] 2025-06-14 10:40:09,442 >> Model config Qwen2_5_VLConfig {
  "architectures": [
    "Qwen2_5_VLForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "image_token_id": 151655,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 128000,
  "max_window_layers": 28,
  "model_type": "qwen2_5_vl",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "mrope_section": [
      16,
      24,
      24
    ],
    "rope_type": "default",
    "type": "default"
  },
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.3",
  "use_cache": true,
  "use_sliding_window": false,
  "video_token_id": 151656,
  "vision_config": {
    "depth": 32,
    "fullatt_block_indexes": [
      7,
      15,
      23,
      31
    ],
    "hidden_act": "silu",
    "hidden_size": 1280,
    "in_channels": 3,
    "in_chans": 3,
    "intermediate_size": 3420,
    "model_type": "qwen2_5_vl",
    "num_heads": 16,
    "out_hidden_size": 3584,
    "patch_size": 14,
    "spatial_merge_size": 2,
    "spatial_patch_size": 14,
    "temporal_patch_size": 2,
    "tokens_per_second": 2,
    "window_size": 112
  },
  "vision_end_token_id": 151653,
  "vision_start_token_id": 151652,
  "vision_token_id": 151654,
  "vocab_size": 152064
}

[INFO|2025-06-14 10:40:09] llamafactory.model.model_utils.kv_cache:143 >> KV cache is enabled for faster generation.
[INFO|modeling_utils.py:1121] 2025-06-14 10:40:09,455 >> loading weights file /media/user/model/Qwen/Qwen2___5-VL-7B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:2167] 2025-06-14 10:40:09,455 >> Instantiating Qwen2_5_VLForConditionalGeneration model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1142] 2025-06-14 10:40:09,457 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151645
}

[INFO|modeling_utils.py:2167] 2025-06-14 10:40:09,457 >> Instantiating Qwen2_5_VisionTransformerPretrainedModel model under default dtype torch.bfloat16.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.95it/s]
[INFO|modeling_utils.py:4930] 2025-06-14 10:40:12,225 >> All model checkpoint weights were used when initializing Qwen2_5_VLForConditionalGeneration.

[INFO|modeling_utils.py:4938] 2025-06-14 10:40:12,225 >> All the weights of Qwen2_5_VLForConditionalGeneration were initialized from the model checkpoint at /media/user/model/Qwen/Qwen2___5-VL-7B-Instruct/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2_5_VLForConditionalGeneration for predictions without further training.
[INFO|configuration_utils.py:1095] 2025-06-14 10:40:12,228 >> loading configuration file /media/user/model/Qwen/Qwen2___5-VL-7B-Instruct/generation_config.json
[INFO|configuration_utils.py:1142] 2025-06-14 10:40:12,228 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.05,
  "temperature": 0.1,
  "top_k": 1,
  "top_p": 0.001
}

[INFO|2025-06-14 10:40:12] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
/home/user/anaconda3/envs/llamafactory/lib/python3.12/site-packages/awq/__init__.py:21: DeprecationWarning: 
I have left this message as the final dev message to help you transition.

Important Notice:
- AutoAWQ is officially deprecated and will no longer be maintained.
- The last tested configuration used Torch 2.6.0 and Transformers 4.51.3.
- If future versions of Transformers break AutoAWQ compatibility, please report the issue to the Transformers project.

Alternative:
- AutoAWQ has been adopted by the vLLM Project: https://github.com/vllm-project/llm-compressor

For further inquiries, feel free to reach out:
- X: https://x.com/casper_hansen_
- LinkedIn: https://www.linkedin.com/in/casper-hansen-804005170/

  warnings.warn(_FINAL_DEV_MESSAGE, category=DeprecationWarning, stacklevel=1)

INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.                                                                                                                                                                                     
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.                                                                                                                                                                                                             
[INFO|2025-06-14 10:40:12] llamafactory.model.adapter:143 >> Merged 1 adapter(s).
[INFO|2025-06-14 10:40:12] llamafactory.model.adapter:143 >> Loaded adapter(s): saves/qwen2_5vl-7b/lora/sft
[INFO|2025-06-14 10:40:12] llamafactory.model.loader:143 >> all params: 8,292,166,656
[INFO|2025-06-14 10:40:12] llamafactory.train.tuner:143 >> Convert model dtype to: torch.bfloat16.
[INFO|configuration_utils.py:419] 2025-06-14 10:40:12,856 >> Configuration saved in output/qwen2_5vl_lora_sft/config.json
[INFO|configuration_utils.py:911] 2025-06-14 10:40:12,857 >> Configuration saved in output/qwen2_5vl_lora_sft/generation_config.json
[INFO|modeling_utils.py:3580] 2025-06-14 10:40:33,311 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at output/qwen2_5vl_lora_sft/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2510] 2025-06-14 10:40:33,312 >> tokenizer config file saved in output/qwen2_5vl_lora_sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2025-06-14 10:40:33,312 >> Special tokens file saved in output/qwen2_5vl_lora_sft/special_tokens_map.json
[INFO|image_processing_base.py:260] 2025-06-14 10:40:33,430 >> Image processor saved in output/qwen2_5vl_lora_sft/preprocessor_config.json
[INFO|tokenization_utils_base.py:2510] 2025-06-14 10:40:33,451 >> tokenizer config file saved in output/qwen2_5vl_lora_sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2025-06-14 10:40:33,451 >> Special tokens file saved in output/qwen2_5vl_lora_sft/special_tokens_map.json
[INFO|processing_utils.py:648] 2025-06-14 10:40:34,023 >> chat template saved in output/qwen2_5vl_lora_sft/chat_template.json
[INFO|2025-06-14 10:40:34] llamafactory.train.tuner:143 >> Ollama modelfile saved in output/qwen2_5vl_lora_sft/Modelfile
### LLaMA-Factory 的本地部署及使用教程 #### 准备工作 为了成功在本地环境中运行 LLaMA-Factory,需要先安装必要的依赖并配置开发环境。具体操作如下: 创建一个新的 Conda 虚拟环境,并激活该环境以确保兼容性和隔离性[^2]: ```bash conda create -n llama_factory python=3.11 -y conda activate llama_factory ``` #### 安装依赖项 进入虚拟环境后,克隆 LLaMA-Factory 仓库到本地机器上[^1]: ```bash git clone https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory ``` 随后按照项目文档中的说明安装所需的 Python 库和其他依赖项。通常可以通过 `requirements.txt` 文件来完成这一过程: ```bash pip install -r requirements.txt ``` 如果存在 GPU 加速的需求,则需额外确认 PyTorch 是否已正确配置 CUDA 支持。 #### 配置与启动服务 LLaMA-Factory 提供了一个图形化界面用于简化模型管理、微调以及推理流程的操作体验[^3]。要启用此功能,请执行以下命令启动 Web UI 平台: ```bash python app.py ``` 默认情况下,Web 界面会绑定至 localhost 地址上的某个端口(通常是8080)。打开浏览器访问指定 URL 即可加载控制面板。 #### 使用高级特性 通过集成 LoRA 和 QLoRA 技术,用户能够针对不同应用场景高效地优化大型语言模型的表现效果。这些技术允许仅更新少量参数而不是整个网络权重矩阵从而显著减少计算资源消耗量的同时保持较高的性能水平。 对于希望进一步定制其解决方案的人来说,还可以利用 API 接口实现自动化脚本或者与其他系统无缝对接的目的。 #### 示例代码片段展示如何加载预训练模型并对其进行简单测试 下面给出了一段简单的Python程序用来演示基本的功能调用方式: ```python from peft import PeftModel, get_peft_config import torch from transformers import AutoTokenizer, AutoModelForCausalLM # 初始化 tokenizer 和 model 实例对象 tokenizer = AutoTokenizer.from_pretrained("decapoda-research/llama-7b-hf") model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf") # 获取 PEFT 配置信息 peft_config = get_peft_config(peft_type="LORA", r=8) # 将适配器应用到原始基础架构之上形成新的增强版本 model = PeftModel(model, peft_config).to(torch.device('cuda')) text = "Hello world!" input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device='cuda') outputs = model.generate(input_ids=input_ids, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

二分掌柜的

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值