环境安装
-
基础环境
# 创建环境 conda create -n llama_factory python=3.10 conda activate llama_factory # 按照个人情况选择CUDA版本 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # 安装微调工具 git clone https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory pip install -r requirements.txt # 安装分布式加速训练库 pip install deepspeed
-
下载模型
# 镜像地址:https://hf-mirror.com/ conda activate base pip install -U huggingface_hub export HF_ENDPOINT=https://hf-mirror.com huggingface-cli download --resume-download --local-dir-use-symlinks False 01-ai/Yi-34B-Chat --local-dir Yi-34B-Chat # 若下载中断导致文件校验失败,可通过--include或--exclude指定多个文件,以空格分隔
训练过程
-
配置文件
# 参考:https://github.com/hiyouga/LLaMA-Factory/issues/256 vi ds_config_lora.json { "bfloat16": { "enabled": false }, "fp16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_fp16_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 1e5, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
-
训练脚本
deepspeed --include localhost:0,1,2,3,4,5,6,7 src/train_bash.py \ --stage sft \ --do_train \ --model_name_or_path /path/to/model/Yi-34B-Chat/ \ --dataset alpaca_gpt4_zh \ --template yi \ --finetuning_type lora \ --lora_target k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \ --output_dir ./yi_sft_checkpoint \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16 \ --deepspeed "./ds_config_lora.json"
ZeRO-1:分割Optimizer States;ZeRO-2:分割Optimizer States与Gradients;ZeRO-3:分割Optimizer States、Gradients与Parameters
-
量化训练
--quantization_bit 4 # ValueError: DeepSpeed ZeRO-3 is incompatible with quantization. # 注意:参数配置行之间不要保留注释,会导致其后的配置不生效
常见问题
内存占用超高
since you are offloading both parameters and optimizer state to CPU you would need roughly 18 bytes per model parameter. That means for 7B model you would need ~126GB of CPU memory. Please see page 3 of https://arxiv.org/pdf/1910.02054.pdf for a discussion of the memory breakdown.
通过启用DeepSpeed的ZeRO-3优化,可直接将模型拆分加载到显存中,初始化和训练过程中一直保持较低的内存占用。