背景
测试pytorch-tpu/llama推理时,其模型Attention层在前向时,会多次调用torch.view_as_complex和torch.view_as_real。
这两个算子在torch-xla 2.1上运行正常,但是在2.0上会出现以下警告信息:
…/pytorch-tpu/llama/llama/model.py:78: UserWarning: 0The operator aten::view_as_complex appears to be a view operator, but it has no implementation for the backend “xla:0”. View operators don’t support falling back to run on the CPU, since the tensor’s storage cannot be shared across devices. (Triggered internally at …/pytorch/aten/src/ATen/native/CPUFallback.cpp:179.)
xk_ = torch.view_as_complex(
…/pytorch-tpu/llama/llama/model.py:81: UserWarning: 0The operator aten::view_as_real appears to be a view operator, but it has no implementation for the backend “xla:0”. View operators don’t support falling back to run on the CPU, since the tensor’s storage cannot be shared across devices. (Triggered internally at …/pytorch/aten/src/ATen/native/CPUFallback.cpp:179.)
xq_out = torch.view_as_real(xq_ * freqs_cis)
查阅torch-xla issues,2.0版本这些算子没有对应的op lowering(实现)。结合2.0上运行非常慢(耗时~=2.1的几十倍),是否会因此将图拆的很碎?但是根据Op Lowering Guide中的描述,可以确定的是,这正是导致推理非常慢的原因:
Operations that are forwarded to the CPU will cause a significant slowdown. We must lower all operations used in the model to achieve the best performance.
具体解决参见 MR 4887或2.1版本实现。
更新
将MR 4887合并到2.0版本之后,重新测试pytorch-tpu/llama,结论如下:
- 图的数量显著减少(从72减少到8)。这印证了上面的猜想,即会将图拆的很碎。
- 运行速度提升非常多(900+ms到70ms)。
需要注意的是,由于pytorch没有对应更新,所以python脚本中的算子名称需要对应替换为_copy版本:
pytorch2.1中同时存在两个版本,不过Lowering是相同的,参见:torch/_C/_VariableFunctions.pyi
code to reproduce
import torch
import torch_xla.core.xla_model as xm
device = xm.xla_device()
x=torch.randn(4, dtype=torch.cfloat)
print(x)
print(torch.view_as_real(x))
x_xla = x.to(device)
print(torch.view_as_real(x_xla))
print("------------------------------------")
x=torch.randn(4, 2)
print(x)
print(torch.view_as_complex(x))
x_xla = x.to(device)
print(torch.view_as_complex(x_xla))
错误信息
HLO
下面为2.1(左)和2.0(右)的HLO对比。
view_as_complex:
view_as_real:
参考链接
OP_LOWERING_GUIDE
issue 2239
issue 4838 on op lowering
MR 4887
torch-view-as-real
torch-view-as-complex
附:torch和torch-xla符号问题
torch和torch-xla编译后遇到找不到符号问题,或者需要确认是否存在某个符号,均在对应工程根目录下:
# torch
# 查找.o文件,确有此符号
find build -name lazy_graph_executor.cpp.o
strings build/caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/lazy/core/lazy_graph_executor.cpp.o | grep ShouldSyncTensor
# 进一步查找so中,也有此符号
find build -name *torch*.so
strings build/lib.linux-x86_64-cpython-310/torch/lib/libtorch_cpu.so | grep ShouldSyncTensor
# 疑问?为何明明有符号,却找不到,是否因为python setup.py develop方式编译导致。采用bdist_wheel方式,生成whl文件到./dist目录:
USE_KINETO=0 python setup.py bdist_wheel
# 经测试这种方式,不再找不到符号,可以运行;
# 报错ImportError: libopenblas.so.0: cannot open shared object file,需要安装cublas
sudo apt-get install libopenblas-dev
# torch-xla
strings build/lib.linux-x86_64-cpython-310/_XLAC.cpython-310-x86_64-linux-gnu.so | grep view_as_real