【Accelerate】accelerate-large-models (RuntimeError: Expected all tensors to be on the same device……)

accelerate-large-models

1. 加载和运行大模型

1.1 一般的模型

  1. 创建模型
  2. 加载权重到内存中
  3. 将权重加载到模型中
  4. 将模型移到推理设备上

1.2 大模型

  1. 创建一个空的(例如无权重)模型
  2. 确定每个层的位置(当有多个设备可用时)
  3. 在内存中加载其权重的部分
  4. 在空模型中加载这些权重
  5. 移动设备上的权重以进行推断
  6. 对下一部分权重重复步骤3,直到所有权重都被加载

2. 创建一个空模型

比如说,这个是无法创建的(默认精度FP32):

import torch

large_tensor = torch.randn(100000, 100000)

而在meta device上是可以工作的:

import torch

large_tensor = torch.randn(100000, 100000, device="meta")
# tensor(..., device='meta', size=(100000, 100000))

在实际过程中,我们无法对每一个张量的device属性进行修改。

所以,我们一般这样实例化(以BLOOM为例):

from accelerate import init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("bigscience/bloom")
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

3. 计算设备映射

from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("facebook/opt-13b")
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

device_map = infer_auto_device_map(model)

运行返回的结果如下:

{'model.decoder.embed_tokens': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 ...
 'model.decoder.layers.9': 0,
 'model.decoder.layers.10.self_attn': 0,
 'model.decoder.layers.10.activation_fn': 0,
 'model.decoder.layers.10.self_attn_layer_norm': 0,
 'model.decoder.layers.10.fc1': 'cpu',
 'model.decoder.layers.10.fc2': 'cpu',
 'model.decoder.layers.10.final_layer_norm': 'cpu',
 'model.decoder.layers.11': 'cpu',
 ...
 'model.decoder.layers.17': 'cpu',
 'model.decoder.layers.18.self_attn': 'cpu',
 'model.decoder.layers.18.activation_fn': 'cpu',
 'model.decoder.layers.18.self_attn_layer_norm': 'cpu',
 'model.decoder.layers.18.fc1': 'disk',
 'model.decoder.layers.18.fc2': 'disk',
 'model.decoder.layers.18.final_layer_norm': 'disk',
 'model.decoder.layers.19': 'disk',
 ...
 'model.decoder.layers.39': 'disk',
 'lm_head': 'disk'}

从结果中我们可以看出:

  • 从第0层到第9层位于GPU 0上。
  • 第10层的前一部分在GPU 0 上,后一部分在CPU上。
  • 第11层到第17层位于CPU上。
  • 第18层的前一部分在CPU 上,后一部分在磁盘上。
  • 第19层到第39层位于磁盘上。

由于每一层需要在同一设备上,才是可行的。

因此,应该添加:

device_map = infer_auto_device_map(model, no_split_module_classes=["OPTDecoderLayer"])

代码如下:

from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("facebook/opt-13b")
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)
device_map = infer_auto_device_map(model, no_split_module_classes=["OPTDecoderLayer"])

这将返回:

'model.decoder.embed_tokens': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 ...
 'model.decoder.layers.9': 0,
 'model.decoder.layers.10': 'cpu',
 'model.decoder.layers.11': 'cpu',
 ...
 'model.decoder.layers.17': 'cpu',
 'model.decoder.layers.18': 'disk',
 ...
 'model.decoder.layers.39': 'disk',
 'lm_head': 'disk'}

device_map 模式的解释:

  • “auto” 或 “balanced”:Accelerate 将分割权重,以确保每个 GPU 的负载相等。
  • “balanced_low_0”:Accelerate 将分割权重,以确保每个 GPU 的负载相等,但第一个 GPU 会尽量保持尽量少的权重(当您希望在一个 GPU 上处理模型的输出时很有用,比如使用生成函数时)。
  • “sequential”:Accelerate 将按顺序填充 GPU(因此最后的 GPU 可能根本不会被使用)。

4. 状态分层

4.1 传统保存/加载权重

# Save the model weights
torch.save(my_model.state_dict(), 'model_weights.pth')

# Reload them
new_model = ModelClass()
new_model.load_state_dict(torch.load('model_weights.pth'))

4.2 large-models

Hugging Face Hub 上的大型模型不是通过包含所有权重的一个大文件来保存和共享,而是使用其中的几个权重来保存和共享

您进入BLOOM 模型页面,您将看到有 72 个名为 的文件pytorch_model_xxxxx-of-00072.bin(每个大约为 7.19GB。),每个文件都包含部分模型权重。使用这种格式,我们可以将状态字典的一部分加载到内存中,将权重放入模型中,将它们移动到正确的设备上,然后在进入下一个之前丢弃此部分。

import torch
from transformers import AutoModelForCausalLM

# Will error
checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.float16)

如果没有足够的 GPU 和 CPU RAM,会收到一条错误消息,指示您需要传递一个文件夹,在该文件夹中将卸载应存储在磁盘上的权重。

错误信息如下:

ValueError: The current `device_map` had weights offloaded to the disk. Please provide an 
`offload_folder` for them.

解决办法:

import torch
from transformers import AutoModelForCausalLM

# Will go out of RAM on Colab
checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map="auto", offload_folder="offload", torch_dtype=torch.float16
)

如果尝试加载一个非常大的模型,除了 CPU 卸载之外还需要一些磁盘卸载,当加载检查点的最后几个分片时,您可能会耗尽 RAM,因为模型的某部分仍然驻留在 CPU 上并占用空间。如果是这种情况,请使用选项 offload_state_dict=True,在加载所有权重后,临时卸载留在 CPU 上的模型部分,然后在处理所有权重后重新加载到 RAM 中。

import torch
from transformers import AutoModelForCausalLM

checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map="auto", offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
)

这将适应Colab,但将非常接近使用所有可用的RAM,因此在尝试生成预测时可能会耗尽RAM。为了获得我们可以使用的模型,我们需要在磁盘上卸载一层。我们可以通过获取在前一节中计算的 device_map,稍微调整它,然后将其传递给 from_pretrained 调用来实现这一点。

import torch
from transformers import AutoModelForCausalLM

checkpoint = "facebook/opt-13b"
device_map["model.decoder.layers.37"] = "disk"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map=device_map, offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
)

5. 在多个设备上运行模型拆分

hooks是一个 PyTorch API,它添加在每次转发调用之前执行的函数

我们无法直接使用它,因为它们仅支持具有常规参数的模型,并且在前向传递中不支持关键字参数,但我们采用了相同的想法。加载模型后,该dispatch_model函数将向每个前向传递之前和之后执行的每个模块和子模块添加hooks。他们将:

  • 确保模块的所有输入与权重位于同一设备上;
  • 如果权重已卸载到 CPU,则在前向传递之前将它们移至 GPU 0,并在之后返回到 CPU;
  • 如果权重已卸载到磁盘,则在前向传递之前将它们加载到 RAM 中,然后加载到 GPU 0 上,并在之后释放该内存。

6. 总结

视频讲解:https://www.youtube.com/watch?v=MWCSGj9jEAo

此方法需要预先估计,每一层一定是在同一个设备上的

参考:https://github.com/huggingface/blog/blob/main/accelerate-large-models.md

import torch import numpy as np from transformers import AutoModelForCausalLM, AutoTokenizer class LlamaWithEncoder(torch.nn.Module): def __init__(self, base_model, encoder1_dim=18, encoder2_dim=257, hidden_dim=512): super().__init__() self.base_model = base_model # 保存encoder1的d_model self.encoder1_d_model = encoder1_dim # 持久保留线性层, 以便在forward中使用 self.linear_encoder1 = torch.nn.Linear(1, self.encoder1_d_model) encoder1_layer = torch.nn.TransformerEncoderLayer( d_model=encoder1_dim, nhead=3, # 18能被3整除 dim_feedforward=hidden_dim, batch_first=True ) self.encoder1 = torch.nn.TransformerEncoder(encoder1_layer, num_layers=2) # 保存encoder2的d_model self.encoder2_d_model = encoder2_dim encoder2_layer = torch.nn.TransformerEncoderLayer( d_model=encoder2_dim, nhead=1, # 257只能被1整除 dim_feedforward=hidden_dim, batch_first=True ) self.encoder2 = torch.nn.TransformerEncoder(encoder2_layer, num_layers=2) self.proj1 = torch.nn.Linear(encoder1_dim, base_model.config.hidden_size) self.proj2 = torch.nn.Linear(encoder2_dim, base_model.config.hidden_size) self.fusion = torch.nn.Linear(2 * base_model.config.hidden_size, base_model.config.hidden_size) def forward(self, input_ids=None, attention_mask=None, encoder1_inputs=None, encoder2_inputs=None, labels=None): if encoder1_inputs.dim() == 2: encoder1_inputs = encoder1_inputs.unsqueeze(2) encoder1_inputs = self.linear_encoder1(encoder1_inputs) # (batch, 18, 18) encoder1_inputs = encoder1_inputs.to(self.base_model.device) # 确保在设备上一致 # 处理encoder1 enc1_out = self.encoder1(encoder1_inputs) # (batch, seq_len1, encoder1_dim=18) enc1_out = enc1_out.mean(dim=1) # (batch, 18) enc1_proj = self.proj1(enc1_out) # (batch, hidden_size) # 处理encoder2(确保输入是三维 (batch, 501, 257)) encoder2_inputs = encoder2_inputs.to(self.base_model.device) # 确保在设备上一致 enc2_out = self.encoder2(encoder2_inputs) # (batch, 501, 257) enc2_out = enc2_out.mean(dim=1) # (batch, 257) enc2_proj = self.proj2(enc2_out) # (batch, hidden_size) # 融合与后续逻辑不变 fused = self.fusion(torch.cat([enc1_proj, enc2_proj], dim=1)) fused = fused.unsqueeze(1) embeddings = self.base_model.get_input_embeddings()(input_ids) if embeddings.size(1) > 0: embeddings[:, 0, :] = (embeddings[:, 0, :] + fused[:, 0, :]) / 2 outputs = self.base_model( inputs_embeds=embeddings, attention_mask=attention_mask, labels=labels ) return outputs def generate(self, **kwargs): encoder1_inputs = kwargs.pop("encoder1_inputs") encoder2_inputs = kwargs.pop("encoder2_inputs") # 对encoder1_inputs做同样的维度处理 if encoder1_inputs.dim() == 2: encoder1_inputs = encoder1_inputs.unsqueeze(2) encoder1_inputs = torch.nn.Linear(1, self.encoder1_d_model)(encoder1_inputs) enc1_out = self.encoder1(encoder1_inputs).mean(dim=1) enc1_proj = self.proj1(enc1_out) enc2_out = self.encoder2(encoder2_inputs).mean(dim=1) enc2_proj = self.proj2(enc2_out) fused = self.fusion(torch.cat([enc1_proj, enc2_proj], dim=1)).unsqueeze(1) input_ids = kwargs.get("input_ids") embeddings = self.base_model.get_input_embeddings()(input_ids) if embeddings.size(1) > 0: embeddings[:, 0, :] = (embeddings[:, 0, :] + fused[:, 0, :]) / 2 return self.base_model.generate( inputs_embeds=embeddings, **kwargs ) def test_encoder_with_random_data(): torch.manual_seed(42) np.random.seed(42) print("加载分词器...") tokenizer = AutoTokenizer.from_pretrained('/root/workspace/llama3.2-SELFIES/checkpoint-2500') if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token if tokenizer.eos_token else '[PAD]' print(f"当前pad_token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})") print("\n加载基础模型...") base_model = AutoModelForCausalLM.from_pretrained( "/root/workspace/llama3.2-SELFIES/checkpoint-2500", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto" ) base_model.eval() print("创建带编码器的自定义模型...") model = LlamaWithEncoder(base_model).to(base_model.device) model.eval() print("\n===== 生成随机测试数据 =====") batch_size = 2 random_texts = ["".join(np.random.choice(list("abcdefghijklmnopqrstuvwxyz "), size=100)) for _ in range(batch_size)] inputs = tokenizer(random_texts, padding=True, truncation=True, max_length=512, return_tensors="pt").to(base_model.device) # 生成encoder1_inputs(二维,后续会被处理为三维) encoder1_inputs = torch.randn(batch_size, 18, dtype=torch.float32).to(base_model.device) # (2, 18) encoder2_inputs = torch.randn(batch_size, 501, 257, dtype=torch.float32).to(base_model.device) # 三维,符合要求 print("\n===== 执行前向传播 =====") with torch.no_grad(): outputs = model( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], encoder1_inputs=encoder1_inputs, encoder2_inputs=encoder2_inputs, labels=inputs["input_ids"] ) print(f"模型输出logits形状: {outputs.logits.shape}") print(f"计算得到的损失值: {outputs.loss.item():.4f}") print("\n===== 测试生成功能 =====") with torch.no_grad(): generated_outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], encoder1_inputs=encoder1_inputs, encoder2_inputs=encoder2_inputs, max_new_tokens=50, temperature=0.7, do_sample=True, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id else tokenizer.pad_token_id ) print(f"生成结果形状: {generated_outputs.shape}") for i in range(batch_size): print(f"样本 {i + 1}:") generated_text = tokenizer.decode(generated_outputs[i], skip_special_tokens=True) print(f"内容前100字符: {generated_text[:100]}...\n" + "-" * 80) print("\n===== 测试完成 =====") if __name__ == "__main__": test_encoder_with_random_data()代码运行出现报错,解决报错Traceback (most recent call last): File "/root/workspace/3.py", line 166, in <module> test_encoder_with_random_data() File "/root/workspace/3.py", line 131, in test_encoder_with_random_data outputs = model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/root/workspace/3.py", line 64, in forward outputs = self.base_model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward output = module._old_forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 842, in forward outputs = self.model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 571, in forward position_embeddings = self.rotary_emb(hidden_states, position_ids) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward output = module._old_forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 132, in forward freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:2! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
最新发布
07-25
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

莫余

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值