torch-xla - op appears to be a view operator, but it has no implementation for the backend “xla:0“

本文链接：https://blog.csdn.net/liuzonrze/article/details/135297168

文章讲述了在使用PyTorch-TPU的Llama模型时，Attention层在不同torch-xla版本（2.0和2.1）中view_as_complex和view_as_real算子的行为差异，以及如何导致的性能问题。通过解决OpLowering问题，图的数量和运行时间得到了显著改善。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

背景

测试pytorch-tpu/llama推理时，其模型Attention层在前向时，会多次调用torch.view_as_complex和torch.view_as_real。
这两个算子在torch-xla 2.1上运行正常，但是在2.0上会出现以下警告信息：

…/pytorch-tpu/llama/llama/model.py:78: UserWarning: 0The operator aten::view_as_complex appears to be a view operator, but it has no implementation for the backend “xla:0”. View operators don’t support falling back to run on the CPU, since the tensor’s storage cannot be shared across devices. (Triggered internally at …/pytorch/aten/src/ATen/native/CPUFallback.cpp:179.)
xk_ = torch.view_as_complex(
…/pytorch-tpu/llama/llama/model.py:81: UserWarning: 0The operator aten::view_as_real appears to be a view operator, but it has no implementation for the backend “xla:0”. View operators don’t support falling back to run on the CPU, since the tensor’s storage cannot be shared across devices. (Triggered internally at …/pytorch/aten/src/ATen/native/CPUFallback.cpp:179.)
xq_out = torch.view_as_real(xq_ * freqs_cis)

查阅torch-xla issues，2.0版本这些算子没有对应的op lowering（实现）。结合2.0上运行非常慢（耗时~=2.1的几十倍），是否会因此将图拆的很碎？但是根据Op Lowering Guide中的描述，可以确定的是，这正是导致推理非常慢的原因：

Operations that are forwarded to the CPU will cause a significant slowdown. We must lower all operations used in the model to achieve the best performance.

具体解决参见 MR 4887或2.1版本实现。

更新

将MR 4887合并到2.0版本之后，重新测试pytorch-tpu/llama，结论如下：

图的数量显著减少（从72减少到8）。这印证了上面的猜想，即会将图拆的很碎。
运行速度提升非常多（900+ms到70ms）。

需要注意的是，由于pytorch没有对应更新，所以python脚本中的算子名称需要对应替换为_copy版本：

pytorch2.1中同时存在两个版本，不过Lowering是相同的，参见：torch/_C/_VariableFunctions.pyi

在这里插入图片描述

code to reproduce

import torch
import torch_xla.core.xla_model as xm

device = xm.xla_device()

x=torch.randn(4, dtype=torch.cfloat)
print(x)
print(torch.view_as_real(x))
x_xla = x.to(device)
print(torch.view_as_real(x_xla))
print("------------------------------------")
x=torch.randn(4, 2)
print(x)
print(torch.view_as_complex(x))
x_xla = x.to(device)
print(torch.view_as_complex(x_xla))

错误信息

在这里插入图片描述

HLO

下面为2.1（左）和2.0（右）的HLO对比。

view_as_complex:
在这里插入图片描述
view_as_real:

参考链接

OP_LOWERING_GUIDE
issue 2239
issue 4838 on op lowering
MR 4887
torch-view-as-real
torch-view-as-complex

附：torch和torch-xla符号问题

torch和torch-xla编译后遇到找不到符号问题，或者需要确认是否存在某个符号，均在对应工程根目录下：

# torch
# 查找.o文件，确有此符号
find build -name lazy_graph_executor.cpp.o
strings build/caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/lazy/core/lazy_graph_executor.cpp.o | grep ShouldSyncTensor
# 进一步查找so中，也有此符号
find build -name *torch*.so
strings build/lib.linux-x86_64-cpython-310/torch/lib/libtorch_cpu.so | grep ShouldSyncTensor
# 疑问？为何明明有符号，却找不到，是否因为python setup.py develop方式编译导致。采用bdist_wheel方式，生成whl文件到./dist目录：
USE_KINETO=0 python setup.py bdist_wheel
# 经测试这种方式，不再找不到符号，可以运行；
# 报错ImportError: libopenblas.so.0: cannot open shared object file，需要安装cublas
sudo apt-get install libopenblas-dev

# torch-xla
strings build/lib.linux-x86_64-cpython-310/_XLAC.cpython-310-x86_64-linux-gnu.so | grep view_as_real

在这里插入图片描述