【LLM】deepseek R1之SFT和GRPO训练笔记

原创已于 2025-07-10 14:30:28 修改 · 2.3k 阅读

17 ·

CC 4.0 BY-SA版权

文章标签：

#LLM #GRPO #RL #大模型

于 2025-05-04 22:50:34 首次发布

多模态大模型同时被 3 个专栏收录

41 篇文章

订阅专栏

大模型推理优化

35 篇文章

订阅专栏

强化学习

22 篇文章

订阅专栏

note

相关框架对比：
- 需微调模型且资源有限 → Unsloth；
- 本地隐私优先的小规模推理 → Ollama；
- 复杂逻辑或多模态任务 → SGLang；
- 高并发生产环境 → vLLM
微调SFT和GRPO是确实能学到新知识的
使用swift框架时，四种格式（messages、sharegpt、alpaca、query-response）在AutoPreprocessor处理下都会转换成ms-swift标准格式中的messages字段，即都可以直接使用--dataset <dataset-path>接入，即可直接使用json数据
使用简单的、可验证的、基于结果的奖励（例如，判断对错）是有效的，并且能降低奖励操纵（reward hacking）的风险
推理模型带来了新的安全挑战，例如奖励操纵（reward hacking）、过度思考（overthinking）以及特定的越狱（jailbreaking）漏洞。

一、Swift框架

数据集定义

Coundown Game任务：给定几个数字，进行加减乘除后得到目标数值。
数据量：5w条

[INFO:swift] train_dataset: Dataset({
    features: ['nums', 'messages', 'target'],
    num_rows: 49500
})
[INFO:swift] val_dataset: Dataset({
    features: ['nums', 'messages', 'target'],
    num_rows: 500
})

通过 template，使用 numbers 和 target 完成任务定义，并给到 query 字段供模型采样使用。同时，我们需要保留 nums 和 target 两个字段，用于后续的奖励函数计算。

class CoundownTaskPreprocessor(ResponsePreprocessor):

    def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
        numbers = row['nums']
        target = row.pop('response', None)
        query = f"""
        Using the numbers {numbers}, create an equation that equals {target}.
        You can use basic arithmetic operations (+, -, *, /) and each number can only be used once.
        Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags,
        for example <answer> (1 + 2) / 3 * 4 = 4 </answer>.
        """
        row.update({'target': target, 'query': query})
        return super().preprocess(row)

register_dataset(
    DatasetMeta(
        ms_dataset_id='zouxuhong/Countdown-Tasks-3to4',
        subsets=['default'],
        preprocess_func=CoundownTaskPreprocessor(),
        tags=['math']))

奖励函数

格式奖励函数：Deepseek-R1 中提到的格式奖励函数，已经在swift中内置，通过 --reward_funcs format 可以直接使用
准确性奖励函数：使用 external_plugin 的方式定义准确性奖励函数，将代码放在swift/examples/train/grpo/plugin/plugin.py中。
- 奖励函数的输入包括 completions、target 和 nums 三个字段，分别表示模型生成的文本、目标答案和可用的数字。
- 每个都是list，支持多个 completion 同时计算。注意这里除了 completions 之外的参数都是数据集中定义的字段透传而来，如果有任务上的变动，可以分别对数据集和奖励函数做对应的改变即可。

class CountdownORM(ORM):
    def __call__(self, completions, target, nums, **kwargs) -> List[float]:
        """
        Evaluates completions based on Mathematical correctness of the answer
        Args:
            completions (list[str]): Generated outputs
            target (list[str]): Expected answers
            nums (list[str]): Available numbers
        Returns:
            list[float]: Reward scores
        """
        rewards = []
        for completion, gt, numbers in zip(completions, target, nums):
            try:
                # Check if the format is correct
                match = re.search(r"<answer>(.*?)<\/answer>", completion)
                if match is None:
                    rewards.append(0.0)
                    continue
                # Extract the "answer" part from the completion
                equation = match.group(1).strip()
                if '=' in equation:
                    equation = equation.split('=')[0]
                # Extract all numbers from the equation
                used_numbers = [int(n) for n in re.findall(r'\d+', equation)]
                # Check if all numbers are used exactly once
                if sorted(used_numbers) != sorted(numbers):
                    rewards.append(0.0)
                    continue
                # Define a regex pattern that only allows numbers, operators, parentheses, and whitespace
                allowed_pattern = r'^[\d+\-*/().\s]+$'
                if not re.match(allowed_pattern, equation):
                    rewards.append(0.0)
                    continue
                # Evaluate the equation with restricted globals and locals
                result = eval(equation, {"__builti'ns__": None}, {})
                # Check if the equation is correct and matches the ground truth
                if abs(float(result) - float(gt)) < 1e-5:
                    rewards.append(1.0)
                else:
                    rewards.append(0.0)
            except Exception as e:
                # If evaluation fails, reward is 0
                rewards.append(0.0)
        return rewards
orms['external_countdown'] = CountdownORM

GRPO公式

$\begin{aligned} \mathcal{J}_{G R P O}(\theta) & =\mathbb{E}\left[q \sim P(Q),\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{o l d}}(O \mid q)\right] \\ & \frac{1}{G} \sum_{i=1}^G \frac{1}{\left|o_i\right|} \sum_{t=1}^{\left|o_i\right|}\left\{\min \left[\frac{\pi_\theta\left(o_{i, t} \mid q, o_{i,<t}\right)}{\pi_{\theta_{o l d}}\left(o_{i, t} \mid q, o_{i,<t}\right)} \hat{A}_{i, t}, \operatorname{clip}\left(\frac{\pi_\theta\left(o_{i, t} \mid q, o_{i,<t}\right)}{\pi_{\theta_{o l d}}\left(o_{i, t} \mid q, o_{i,<t}\right)}, 1-\varepsilon, 1+\varepsilon\right) \hat{A}_{i, t}\right]-\beta \mathbb{D}_{K L}\left[\pi_\theta| | \pi_{r e f}\right]\right\} \end{aligned}$

训练参数

选取 Qwen2.5-3B-Instruct 作为基础模型进行训练，选取 Instruct 而不是基模的主要原因是可以更快地获取 format reward。我们在三卡 GPU 上进行实验，因此vllm的推理部署在最后一张卡上，而进程数设置为2，在剩下两张卡上进行梯度更新。

由于任务较为简单，我们设置 max_completion_length 和 vllm_max_model_len 为1024，如果有更复杂的任务，可以适当加大模型输出长度。注意，这两个参数越大，模型训练需要的显存越多，训练速度越慢，单个step的训练时间与max_completion_length呈现线性关系。

在我们的实验中，总batch_size为 $num\_processes \times per\_device\_train\_batch\_size \times gradient\_accumulation\_steps = 2 \times 8 \times 8 = 128$ 而参数设置有一个限制，即： $num\_processes \times per\_device\_train\_batch\_size$ 必须整除 $num\_generations$ ，其中， $num\_generations$ 就是GRPO公式中的 $G$ ，故我们设置为8。

注意：

这里单卡batch_size设置也与显存息息相关，请根据显存上限设置一个合适的值。
总的steps数量 : $num\_steps = epochs \times len(datasets) \times num\_generations \div batch\_size$ ，需要根据这个来合理规划训练的学习率和warmup设置。
设置是学习率和 beta，学习率比较好理解，而beta则是是以上公式的 $\beta$ ，即KL散度的梯度的权重。这两个参数设置的越大，模型收敛原则上更快，但训练往往会不稳定。经过实验，我们分别设置为 5e-7 和 0.001。在实际训练中，请根据是否出现不稳定的震荡情况适当调整这两个参数。
对于KL散度，社区有很多的讨论，可以参考为什么GRPO坚持用KL散度。
具体的参数介绍：https://swift.readthedocs.io/zh-cn/latest/Instruction/%E5%91%BD%E4%BB%A4%E8%A1%8C%E5%8F%82%E6%95%B0.html

CUDA_VISIBLE_DEVICES=0,1,2 \
WANDB_API_KEY=your_wandb_key \
NPROC_PER_NODE=2 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-3B-Instruct \
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_countdown format \
    --use_vllm true \
    --vllm_device auto \
    --vllm_gpu_memory_utilization 0.6 \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset 'zouxuhong/Countdown-Tasks-3to4#50000' \
    --max_length 2048 \
    --max_completion_length 1024 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --learning_rate 5e-7 \
    --gradient_accumulation_steps 8 \
    --eval_steps 500 \
    --save_steps 100 \
    --save_total_limit 20 \
    --logging_steps 1 \
    --output_dir output/GRPO_COUNTDOWN \
    --warmup_ratio 0.01 \
    --dataloader_num_workers 4 \
    --num_generations 8 \
    --temperature 1.0 \
    --system 'You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.' \
    --deepspeed zero3 \
    --log_completions true \
    --vllm_max_model_len 1024 \
    --report_to wandb \
    --beta 0.001 \
    --num_iterations 1

训练结果

（1）reward_std开始还波动着，300step后降到0左右，说明最后训练也较为收敛了：
在这里插入图片描述

参考信息：
[1] swift框架微调：https://github.com/modelscope/ms-swift/tree/main/examples/train/think_model
[2] qwen3 moe分布式训练： [Fine-tuning] Qwen3-MoE Megatron Training Implementation and Best Practices👋 #1278
[3] Swift微调命令参数
[4] MS-SWIFT微调Qwen3

二、unsloth框架

链接：https://docs.unsloth.ai/basics/reasoning-grpo-and-rl/tutorial-train-your-own-reasoning-model-with-grpo

1. Unsloth框架介绍

开源项目Unsloth AI实现重大突破，通过优化GRPO训练方法，将内存使用减少80%，让7GB显存GPU就能本地运行DeepSeek-R1级别的推理模型；
Unsloth实现了与vLLM的深度整合，可将模型吞吐量提升20倍，同时仅需一半VRAM，使单张48GB GPU就能微调Llama 3.3 70B；
该项目在GitHub获2万多星，其核心团队仅由两兄弟组成，成功大幅降低了AI推理模型的部署门槛。本地也能体验「Aha」时刻：现在可以在本地设备上复现DeepSeek-R1的推理！只需7GB VRAM，你就能体验到「Aha」时刻。Unsloth把GRPO训练需要的内存减少了80%。15GB VRAM就可以把Llama-3.1（8B）和Phi-4（14B）转变为推理模型。

2. 使用

unsloth是推理、微调一体式框架，unsloth将Llama 3.3、Mistral、Phi-4、Qwen 2.5和Gemma的微调速度提高2倍，同时节省80%的内存。官网地址：GitHub - unslothai/unsloth: Finetune Llama 3, Mistral & Gemma LLMs 2-5x faster with 80% less memoryhttps://github.com/unslothai/unsloth

使用如下命令快速安装：

pip install unslothpip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

3. 训练参数

SFTTTrainer 进行监督微调（Supervised Fine-Tuning, SFT），适用于 transformers 和 Unsloth 生态中的模型微调：1. 相关库

SFTTrainer（来自 trl 库）：
- trl（Transformer Reinforcement Learning）是 Hugging Face 旗下的 trl 库，提供监督微调（SFT）和强化学习（RLHF）相关的功能。
- SFTTrainer 主要用于有监督微调（Supervised Fine-Tuning），适用于 LoRA 等低秩适配微调方式。
TrainingArguments（来自 transformers 库）：
- 这个类用于定义训练超参数，比如批量大小、学习率、优化器、训练步数等。
is_bfloat16_supported（来自 unsloth）：
- 这个函数检查当前 GPU 是否支持 bfloat16（BF16），如果支持，则返回 True，否则返回 False
- bfloat16 是一种更高效的数值格式，在新款 NVIDIA A100/H100 等GPU上表现更优。

SFTTrainer 部分
在这里插入图片描述

TrainingArguments 部分
在这里插入图片描述

参考：从零开始的DeepSeek微调训练实战（SFT）阿里云开发社区

三、open r1项目

一个parquet文件：/root/paddlejob/workspace/env_run/gtest/rl_train/data/OpenR1-Math-220k/open-r1/OpenR1-Math-220k/all/train-00001-of-00020.parquet

SFT训练：

# Train via command line
accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path Qwen/Qwen2.5-1.5B-Instruct \
    --dataset_name open-r1/OpenR1-Math-220k \
    --learning_rate 1.0e-5 \
    --num_train_epochs 1 \
    --packing \
    --max_seq_length 16384 \
    --per_device_train_batch_size 16 \
    --gradient_checkpointing \
    --bf16 \
    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill

# Train via YAML config
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml

GRPO训练：

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
    --num_processes=7 src/open_r1/grpo.py \
    --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml

数据生成

数据生成：为了构建 OpenR1-220k，我们使用 DeepSeek R1 大语言模型生成 NuminaMath 1.5 中 40 万个问题的解决方案。我们遵循模型卡的推荐参数，并在用户提示词前添加以下指令：“请逐步推理，并将最终答案放在 \boxed{} 中。”

from datasets import load_dataset
from distilabel.models import vLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration


prompt_template = """\
You will be given a problem. Please reason step by step, and put your final answer within \boxed{}:
{{ instruction }}"""

dataset = load_dataset("AI-MO/NuminaMath-TIR", split="train").select(range(10))

model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"  # Exchange with another smol distilled r1

with Pipeline(
    name="distill-qwen-7b-r1",
    description="A pipeline to generate data from a distilled r1 model",
) as pipeline:

    llm = vLLM(
        model=model_id,
        tokenizer=model_id,
        extra_kwargs={
            "tensor_parallel_size": 1,
            "max_model_len": 8192,
        },
        generation_kwargs={
            "temperature": 0.6,
            "max_new_tokens": 8192,
        },
    )
    prompt_column = "problem"
    text_generation = TextGeneration(
        llm=llm, 
        template=prompt_template,
        num_generations=4,
        input_mappings={"instruction": prompt_column} if prompt_column is not None else {}
    )


if __name__ == "__main__":
    distiset = pipeline.run(dataset=dataset)
    distiset.push_to_hub(repo_id="username/numina-deepseek-r1-qwen-7b")

提示词：

You are a mathematical answer validator. You will be provided with a mathematical problem and you need to compare the answer in the reference solution, and the final answer in a model's solution to determine if they are equivalent, even if formatted differently.

PROBLEM:

{problem}

REFERENCE SOLUTION:

{answer}

MODEL'S SOLUTION:

{generation}

Focus ONLY on comparing the final mathematical answer provided by the model while ignoring differences in:

- Formatting (e.g., \\boxed{{}} vs plain text)
- Multiple choice formatting (e.g., "A" vs full solution)
- Order of coordinate pairs or solutions
- Equivalent mathematical expressions or notation variations
- If the model's answer is nonsense, return "Verdict: AMBIGUOUS"

Start with a brief explanation of your comparison (2-3 sentences). Then output your final answer in one of the following formats:

- "Verdict: EQUIVALENT"
- "Verdict: DIFFERENT"
- "Verdict: AMBIGUOUS"

模型训练

模型评估

在几个经典benchmark上评估：

MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=data/evals/$MODEL

# AIME 2024
TASK=aime24
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

# MATH-500
TASK=math_500
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

# GPQA Diamond
TASK=gpqa:diamond
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

# LiveCodeBench
lighteval vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

四、GRPO经验总结

奖励函数

还记得23年大部分团队都做不好rlhf，reward model的准确度一度到不了70%。大家都知道强化学习很好，但是又感觉跟我们的业务没有关系，那么怎么把rlhf应用到我们的项目中？忽略强化学习的算法细节，在reward上做点手脚，简单又重要

1、RL与reward

RLHF的精髓在于将人类偏好转化为可量化的奖励信号。奖励函数告诉模型"什么是好的输出"，而rl算法只是将这种反馈训练到模型参数中去。reward与构建高质量数据，对于rl最终的结果来说同样重要。deepseek-r1的grpo则是针对数学和代码任务设计了规则判别的奖励函数

2、reward的构造策略

任务相关性：奖励信号与任务目标相关。比如说：数学问题关注正确性，写作注重多样性，销售助手需要情商等
可量化：可量化的指标才可以交给rl进行训练学习。答案对错由规则判断给0,1布尔值。这个回答很好由reward model转化为0~1.0之间的得分
相对性：PPO中给的是某个答案的绝对奖励值(通过pairwise的方式训练了reward model)，DPO中则是构建答案间的相对偏好关系，GRPO中是计算一批样本的相对奖励优势
推理过程：是对整个过程给一个最终奖励，还是每个推理步骤评估，以及是否需要推理过程，都可以设置为奖励信号

在智能客服场景中，可以将知识准确性(计算知识库和答案的相关性)、简洁性(字数约束)、问题解决度(训练Reward Model)等作为奖励信号，有时候还可以训练多个reward model

也有很多不了解rl算法细节的面试官，经常喜欢问："如何设计RLHF中的奖励函数？"这个问题回答的需要包括：针对具体任务的分析、奖励函数的量化方法、以及如何解决偏差(kl散度约束)等。如果单纯只聊PPO的算法原理，是不够的

关于DeepseekR1的17个观点

100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models，https://arxiv.org/pdf/2505.00551

17个观点：
1、高质量、经过验证的思维链（Chain-of-Thought, CoT）数据对于监督微调（Supervised Fine-Tuning, SFT）是有效的。
2、为 SFT 挑选更难的问题（例如，基于较弱模型的低通过率筛选）能显著提升模型性能。
3、开放数据集中混杂有基准测试样本，需要仔细进行数据去污染（decontamination）以保证公平评估。
4、倾向于包含更长 CoT（通常意味着问题更复杂）的数据集，在 SFT 后往往能带来更好的推理性能。
5、SFT 能有效地赋予模型推理结构，为后续的强化学习（Reinforcement Learning, RL）奠定必要基础。
6、相较于基础模型，已经过指令微调的模型在 SFT 阶段能更有效地学习推理模式。
7、强化学习（RL）数据集受益于严格的验证过程（例如使用数学求解器、代码执行）以及筛选掉模型可能出错的“不确定性”样本。
8、使用简单的、可验证的、基于结果的奖励（例如，判断对错）是有效的，并且能降低奖励操纵（reward hacking）的风险。
9、在推理模型的强化学习（RL for Verification/Reasoning）中，明确的格式或长度奖励的必要性和益处尚存争议，有时模型可以隐式地学习这些方面。
10、PPO 和 GRPO 是最常用的 RL 算法，但它们的变体（如 DAPO、Dr. GRPO、VC-PPO、VAPO）被设计用于解决偏差（如长度偏差、难度偏差）和训练不稳定性问题。
11、KL 损失虽然常用于提升训练稳定性，但在推理模型的 RL 训练中有时会被省略，或者发现它会限制模型的探索能力和最终的性能提升。
12、在 RL 训练过程中，逐步增加训练样本的难度或模型允许的最大响应长度，有助于提升性能和稳定性。
13、将训练重点放在更难的样本上，或者剔除模型已经“学会解决”的简单样本，这类策略可以提升 RL 的训练效率。
14、集成了价值函数的方法（如 VC-PPO、VAPO）在处理长 CoT 问题时，其表现可能优于无价值函数的方法（如 GRPO）。
15、RL 训练能够提升模型的域外泛化能力，其效果可能超越单独使用 SFT，甚至能泛化到看似不相关的任务上（例如，通过数学/代码训练提升写诗能力）。
16、推理模型带来了新的安全挑战，例如奖励操纵（reward hacking）、过度思考（overthinking）以及特定的越狱（jailbreaking）漏洞。
17、对于较小规模的模型（例如 <32B 参数），相比于使用蒸馏得到的检查点（distilled checkpoints），单纯依靠 RL 来复现最佳性能通常更具挑战性。

为啥GRPO容易出现reward崩塌

在这里插入图片描述
GRPO 出现这个问题，需要详细了解强化学习（RL）的基本迭代架构，即 Actor-Critic 架构。很多中文书籍将 AC 架构翻译为“演员-评论家”架构，真是感觉好 low，信达雅的美感完全被破坏掉了。我更加喜欢另一外中文翻译，即"知行互动"架构。译文的启发来自于王阳明先生的“知行合一”，更有中国文化的历史底蕴。

“知”为 Critic，它是“行动”的评价与指导，“行”是 Actor，它根据“认知”结果进行改进。"互动"两个字则反映了算法本身不断迭代的特性。知行互动（AC）架构为什么要有 Critic 呢？这就涉及强化学习的算法稳定性问题。与监督学习（SL）相比，RL 实际上是很难稳定的一类训练机制。

大致的原因如下：

RL 本身是处理动态系统的最优控制问题，而 SL 是处理一个静态优化问题。动，就比静更难处理。
加上 RL 的数据非稳态，Env-agent 交互机制的数据采集量少，这使得梯度计算的方差更大，方差一大就容易偏离预期目标，算法就容易跑飞了。

主流的强化学习算法是怎么解决这一问题的呢？

加上 Critic，使用 State-value function 或者 Action-value function 稳定策略梯度的计算过程。更高级一些的算法是采用 Advantage Function，也就是加上了 Baseline，增加梯度计算的稳定性。这是 AC 算法总是优于 REINFORCE 算法的原因之一。然而 GRPO 并没有 Critic 部分，原因比较简单，因为 GRPO 是用于训练大模型（1000 亿级别的参数规模），若是使用“知行互动”架构的话，等于需要存储两个大模型。一个是 Critic Network，另外一个是 Actor Network，这对存储要求是极高的。

怎么节约存储呢？

把 Critic Network 去掉，替换为在线估计 Advantage function 的算法，采用了“时间（算力）”换“空间（存储）”的做法。这就是 GRPO 的设计思想。与之对比，OpenAI 提出的 PPO 算法（也是 GRPO 的基础算法），它的值函数通常是一个与策略模型大小相当的模型，这带来了显著的内存和计算负担。考虑到 OpenAI 并不缺算力资源，不缺存储资源，即使 PPO 算法设计的如此糟糕，照样用的风生水起。

除了 DeepSeek 之外，国内不少大模型团队照猫画虎，有样学样，实际上是选择了一条次优的技术路径，因为恰恰忘记了我们与 OpenAI 的最大区别是什么。回到最初的话题，从原理上看 GRPO 并非完美，与 PPO 相比实际上处于是半斤八两的水平，算法设计存在“稳定性”缺陷，但是为什么 DeepSeek 还能用的比较好呢？

因为 DeepSeek 的数据足够多，多到可以“完美”地避开 GRPO 的稳定性缺陷。每次的 Policy Gradient 计算，只要 Batch 数据足够多，就能有效降低 Policy Gradient 的方差，就能获得比较稳定的迭代了。

很明显，对于高校科研团队，对于中小规模的 RL 训练（~百万或千万级别参数规模），GRPO 并非一个好的选择，尤其是当每次使用的数据批量比较小的时候，它的稳定性缺陷将是致命的。这类规模的策略训练，建议优先选择带有 Critic 的强化学习算法。