Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

本文是LLM系列文章,针对《Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling》的翻译。

摘要

从强语言模型(LM)中训练高质量的合成数据是提高LM推理性能的常见策略。在这项工作中,我们重新审视了这种策略在固定推理预算(例如FLOP)下是否是计算最优的。为此,我们研究了使用更强但更昂贵的(SE)模型与较弱但更便宜的(WC)模型生成合成数据之间的权衡。我们评估了三个关键指标的生成数据:覆盖率、多样性和假阳性率,并表明来自WC模型的数据可能具有更高的覆盖率和多样性,但也表现出更高的假阳性率。然后,我们在不同设置下对SE和WC模型的数据进行微调:知识蒸馏、自我提升和一种新颖的弱到强的改进设置,其中较弱的LM向较强的LM教授推理。我们的研究结果表明,在WC生成的数据上进行微调的模型在多个基准和WC和SE模型的多种选择上始终优于在SE生成的数据中训练的模型。这些结果挑战了目前依赖SE模型生成合成数据的做法,表明WC可能是训练高级LM推理机的计算最优方法。

1 引言

2 前言

3 计算匹配采样和序列

### DeepSeek-R1 32-Bit Qwen Distilled Quantized Version Details The command provided indicates the serving of a specific model variant, `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`, which suggests this is a distilled and quantized version of the original Qwen model with a focus on efficiency while maintaining performance as much as possible[^1]. For models like these, distillation involves training a smaller "student" model to mimic the behavior of a larger "teacher" model. In this case, the teacher would be an unquantized or higher precision version of Qwen. The student model (the distilled one) aims to capture most of the knowledge from its teacher but operates more efficiently due to reduced size. Quantization further enhances computational efficiency by reducing the numerical precision required for representing weights within neural networks. A 32-bit quantized model implies that each weight uses single-precision floating-point format during inference operations; however, it's worth noting that typically when discussing efficient deployment through quantization, lower bit-widths such as INT8 are often utilized because they offer significant speedups without substantial loss in accuracy compared to full-precision counterparts. Yet, specifying '32-bit' here might refer specifically to how data types were handled post-distillation rather than implying any typical low-level hardware optimization seen in other forms of quantization techniques. To deploy this particular model (`deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`), parameters include setting up tensor parallelism across two GPUs using `tensor-parallel-size 2`. This configuration allows splitting large layers over multiple devices so that even very wide architectures can fit into memory constraints imposed by individual GPU capacities. Additionally, enforcing eager execution mode ensures immediate evaluation instead of building graphs first—a choice beneficial for interactive applications where responsiveness matters significantly. ```bash vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \ --tensor-parallel-size 2 \ --max-model-len 32768 \ --enforce-eager ``` This setup supports handling sequences up to length 32,768 tokens long (`--max-model-len 32768`)—a feature critical for tasks requiring context awareness over extensive text spans beyond what standard configurations usually support. --related questions-- 1. What advantages does model distillation provide for deploying AI systems? 2. How does tensor parallelism improve the scalability of large language models? 3. Can you explain why enforcing eager execution may benefit certain application scenarios? 4. Why is supporting longer sequence lengths important for some NLP tasks?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

UnknownBody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值