安装LLaMA-Factory微调chatglm3，修改自我认知

MC数据局

已于 2023-12-13 22:24:08 修改

阅读量5.2k

点赞数 16

文章标签： python 人工智能深度学习

于 2023-12-11 09:32:25 首次发布

本文链接：https://blog.csdn.net/WASEFADG/article/details/134909922

版权

本文详细介绍了如何通过Git克隆LLaMA-Factory项目，使用Conda环境管理Python3.10版本，设置CUDA和命令行参数进行单卡训练。涉及了模型训练、模型合并和使用DeepSpeed进行分布式训练，以及可能出现的问题及解决方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

安装git clone https://github.com/hiyouga/LLaMA-Factory.git
conda create -n llama_factory python=3.10
conda activate llama_factory
cd LLaMA-Factory
pip install -r requirements.txt

之后运行

单卡训练，

CUDA_VISIBLE_DEVICES=0 python src/train_web.py，按如下配置

demo_tran.sh

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path /data/models/llm/chatglm3-lora/ \
    --do_train \
    --overwrite_output_dir \
    --dataset self_cognition \
    --template chatglm3 \
    --finetuning_type lora \
    --lora_target query_key_value \
    --output_dir export_chatglm3 \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-3 \
    --num_train_epochs 10.0 \
    --plot_loss \
    --fp16

export_model.sh

python src/export_model.py \
    --model_name_or_path /data/models/llm/chatglm3-lora/ \
    --template chatglm3 \
    --finetuning_type lora \
    --checkpoint_dir /data/projects/LLaMA-Factory/export_chatglm3 \
    --export_dir lora_merge_chatglm3

cli_demo.sh

python src/cli_demo.py \
    --model_name_or_path /data/models/llm/chatglm3-lora/ \
    --template default \
    --finetuning_type lora

注意合并模型的时候，最后复制chatglm3的tokenizer.model和tokenizer_config.json到合并后模型覆盖之后，要修改

不覆盖会有这个错误，

Use DeepSpeed方法

deepspeed --num_gpus 3 --master_port=9901 src/train_bash.py \
    --deepspeed ds_config.json \
    --stage sft \
    --model_name_or_path /media/cys/65F33762C14D581B/chatglm2-6b \
    --do_train True \
    --finetuning_type lora \
    --template chatglm2 \
    --flash_attn False \
    --shift_attn False \
    --dataset_dir data \
    --dataset self_cognition,sharegpt_zh \
    --cutoff_len 1024 \
    --learning_rate 0.001 \
    --num_train_epochs 10.0 \
    --max_samples 1000 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 10 \
    --save_steps 1000 \
    --warmup_steps 0 \
    --neft_alpha 0 \
    --train_on_prompt False \
    --upcast_layernorm False \
    --lora_rank 8 \
    --lora_dropout 0.1 \
    --lora_target query_key_value \
    --resume_lora_training True \
    --output_dir saves/ChatGLM2-6B-Chat/lora/train_2023-12-12-23-26-49 \
    --fp16 True \
    --plot_loss True

ds_config.json的格式下面的：

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },  
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": false,
    "contiguous_gradients": true
  }
}

跑成功的效果图：

如果出现下面这个问题，

[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: CUDA error: the launch timed out and was terminated CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

可能原因是显卡坏了或者显卡不是同一个型号！