Skip to content

[Offload] regression on sm_60 #138560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ye-luo opened this issue May 5, 2025 · 7 comments · Fixed by #138589
Closed

[Offload] regression on sm_60 #138560

ye-luo opened this issue May 5, 2025 · 7 comments · Fixed by #138589
Assignees
Labels

Comments

@ye-luo
Copy link
Contributor

ye-luo commented May 5, 2025

I saw a regression on sm_60 caused by #122781
after the merge I got

fatal error: error in backend: PTX does not support "atomic" for orderings different than"NotAtomic" or "Monotonic" for sm_60 or older, but order is: "seq_cst".

on the develop, error becomes.

fatal error: error in backend: PTX does not support "atomic" for orderings different than"NotAtomic" or "Monotonic" for sm_60 or older, but order is: "acquire".

reproducer, code
https://github.com/TApplencourt/OvO/blob/master/test_src/cpp/hierarchical_parallelism/memcopy-complex_double/target_teams_distribute_parallel_for.cpp

clang++ -fopenmp --offload-arch=sm_60 target_teams_distribute_parallel_for.cpp
@jhuber6
Copy link
Contributor

jhuber6 commented May 5, 2025

Hm, I didn't realize this was a literal limitation of older SMs. I don't think we even use this, but likely with -O0 it's not getting optimized out. @AlexMaclean would it be possible to make all non-monotonic scopes go to seq_cst or is monotonic literally the only one that's supported? Ideally this is solved in the backend.

@llvmbot
Copy link
Member

llvmbot commented May 5, 2025

@llvm/issue-subscribers-offload

Author: Ye Luo (ye-luo)

I saw a regression on `sm_60` caused by #122781 after the merge I got ``` fatal error: error in backend: PTX does not support "atomic" for orderings different than"NotAtomic" or "Monotonic" for sm_60 or older, but order is: "seq_cst". ``` on the develop, error becomes. ``` fatal error: error in backend: PTX does not support "atomic" for orderings different than"NotAtomic" or "Monotonic" for sm_60 or older, but order is: "acquire". ```

reproducer, code
https://github.com/TApplencourt/OvO/blob/master/test_src/cpp/hierarchical_parallelism/memcopy-complex_double/target_teams_distribute_parallel_for.cpp

clang++ -fopenmp --offload-arch=sm_60 target_teams_distribute_parallel_for.cpp

@ye-luo
Copy link
Contributor Author

ye-luo commented May 5, 2025

clang++ -fopenmp --offload-arch=sm_60 target_teams_distribute_parallel_for.cpp
fatal error: error in backend: PTX does not support "atomic" for orderings different than"NotAtomic" or "Monotonic" for sm_60 or older, but order is: "seq_cst".

with -O3

clang++ -fopenmp --offload-arch=sm_60 -O3 target_teams_distribute_parallel_for.cpp
fatal error: error in backend: PTX does not support "atomic" for orderings different than"NotAtomic" or "Monotonic" for sm_60 or older, but order is: "acquire".

with -ffast-math

clang++ -fopenmp --offload-arch=sm_60 -ffast-math target_teams_distribute_parallel_for.cpp # no error.

@jhuber6
Copy link
Contributor

jhuber6 commented May 5, 2025

clang++ -fopenmp --offload-arch=sm_60 target_teams_distribute_parallel_for.cpp
fatal error: error in backend: PTX does not support "atomic" for orderings different than"NotAtomic" or "Monotonic" for sm_60 or older, but order is: "seq_cst".

with -O3

clang++ -fopenmp --offload-arch=sm_60 -O3 target_teams_distribute_parallel_for.cpp
fatal error: error in backend: PTX does not support "atomic" for orderings different than"NotAtomic" or "Monotonic" for sm_60 or older, but order is: "acquire".

with -ffast-math

clang++ -fopenmp --offload-arch=sm_60 -ffast-math target_teams_distribute_parallel_for.cpp # no error.

That's interesting. I guess for now we can just revert that patch, since I erroneously thought it wasn't necessary because the backend handled it, but I didn't consider whether or not ptxas handled it.

@ye-luo
Copy link
Contributor Author

ye-luo commented May 5, 2025

There were codes modified on top that patch. Could you take care of the revert?

jhuber6 added a commit to jhuber6/llvm-project that referenced this issue May 5, 2025
Summary:
Different ordering modes aren't supported for an atomic load, so we just
do an add of zero as the same thing. It's less efficient, but it works.

Fixes llvm#138560
@jhuber6 jhuber6 closed this as completed in dfcb8cb May 5, 2025
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this issue May 6, 2025
Summary:
Different ordering modes aren't supported for an atomic load, so we just
do an add of zero as the same thing. It's less efficient, but it works.

Fixes llvm/llvm-project#138560
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this issue May 7, 2025
Summary:
Different ordering modes aren't supported for an atomic load, so we just
do an add of zero as the same thing. It's less efficient, but it works.

Fixes llvm#138560
@akshayrdeodhar
Copy link
Contributor

akshayrdeodhar commented May 8, 2025

Hm, I didn't realize this was a literal limitation of older SMs. I don't think we even use this, but likely with -O0 it's not getting optimized out. @AlexMaclean would it be possible to make all non-monotonic scopes go to seq_cst or is monotonic literally the only one that's supported? Ideally this is solved in the backend.

Looking at the PTX spec for fence- the fence instruction with memory orderings is supported for sm70+. Here's the snippet which inserts a fence to implement a memory ordering for load/store: Perhaps we could provide a brute-force emulation that uses membar.sys (which is supported) instead of a fence with a memory order- but that will likely be very inefficient.

@jhuber6
Copy link
Contributor

jhuber6 commented May 8, 2025

Hm, I didn't realize this was a literal limitation of older SMs. I don't think we even use this, but likely with -O0 it's not getting optimized out. @AlexMaclean would it be possible to make all non-monotonic scopes go to seq_cst or is monotonic literally the only one that's supported? Ideally this is solved in the backend.

Looking at the PTX spec for fence- the fence instruction with memory orderings is supported for sm70+. Here's the snippet which inserts a fence to implement a memory ordering for load/store: Perhaps we could provide a brute-force emulation that uses membar.sys (which is supported) instead of a fence with a memory order- but that will likely be very inefficient.

Considering that the alternative is the program not compiling, doesn't seem that bad.

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this issue May 9, 2025
Summary:
Different ordering modes aren't supported for an atomic load, so we just
do an add of zero as the same thing. It's less efficient, but it works.

Fixes llvm/llvm-project#138560

(cherry picked from commit dfcb8cb)
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this issue May 15, 2025
Summary:
Different ordering modes aren't supported for an atomic load, so we just
do an add of zero as the same thing. It's less efficient, but it works.

Fixes llvm/llvm-project#138560

(cherry picked from commit dfcb8cb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants