Louis Dionne | 491d045 | 2021-06-08 15:15:27 | [diff] [blame] | 1 | |
| 2 | ==================== |
| 3 | ``<atomic>`` Design |
| 4 | ==================== |
| 5 | |
| 6 | There were originally 3 designs under consideration. They differ in where most |
| 7 | of the implementation work is done. The functionality exposed to the customer |
| 8 | should be identical (and conforming) for all three designs. |
| 9 | |
| 10 | |
| 11 | Design A: Minimal work for the library |
| 12 | ====================================== |
| 13 | The compiler supplies all of the intrinsics as described below. This list of |
| 14 | intrinsics roughly parallels the requirements of the C and C++ atomics proposals. |
| 15 | The C and C++ library implementations simply drop through to these intrinsics. |
| 16 | Anything the platform does not support in hardware, the compiler |
| 17 | arranges for a (compiler-rt) library call to be made which will do the job with |
| 18 | a mutex, and in this case ignoring the memory ordering parameter (effectively |
| 19 | implementing ``memory_order_seq_cst``). |
| 20 | |
| 21 | Ultimate efficiency is preferred over run time error checking. Undefined |
| 22 | behavior is acceptable when the inputs do not conform as defined below. |
| 23 | |
| 24 | .. code-block:: cpp |
| 25 | |
| 26 | // In every intrinsic signature below, type* atomic_obj may be a pointer to a |
| 27 | // volatile-qualified type. Memory ordering values map to the following meanings: |
| 28 | // memory_order_relaxed == 0 |
| 29 | // memory_order_consume == 1 |
| 30 | // memory_order_acquire == 2 |
| 31 | // memory_order_release == 3 |
| 32 | // memory_order_acq_rel == 4 |
| 33 | // memory_order_seq_cst == 5 |
| 34 | |
| 35 | // type must be trivially copyable |
| 36 | // type represents a "type argument" |
| 37 | bool __atomic_is_lock_free(type); |
| 38 | |
| 39 | // type must be trivially copyable |
| 40 | // Behavior is defined for mem_ord = 0, 1, 2, 5 |
| 41 | type __atomic_load(const type* atomic_obj, int mem_ord); |
| 42 | |
| 43 | // type must be trivially copyable |
| 44 | // Behavior is defined for mem_ord = 0, 3, 5 |
| 45 | void __atomic_store(type* atomic_obj, type desired, int mem_ord); |
| 46 | |
| 47 | // type must be trivially copyable |
| 48 | // Behavior is defined for mem_ord = [0 ... 5] |
| 49 | type __atomic_exchange(type* atomic_obj, type desired, int mem_ord); |
| 50 | |
| 51 | // type must be trivially copyable |
| 52 | // Behavior is defined for mem_success = [0 ... 5], |
| 53 | // mem_failure <= mem_success |
| 54 | // mem_failure != 3 |
| 55 | // mem_failure != 4 |
| 56 | bool __atomic_compare_exchange_strong(type* atomic_obj, |
| 57 | type* expected, type desired, |
| 58 | int mem_success, int mem_failure); |
| 59 | |
| 60 | // type must be trivially copyable |
| 61 | // Behavior is defined for mem_success = [0 ... 5], |
| 62 | // mem_failure <= mem_success |
| 63 | // mem_failure != 3 |
| 64 | // mem_failure != 4 |
| 65 | bool __atomic_compare_exchange_weak(type* atomic_obj, |
| 66 | type* expected, type desired, |
| 67 | int mem_success, int mem_failure); |
| 68 | |
| 69 | // type is one of: char, signed char, unsigned char, short, unsigned short, int, |
| 70 | // unsigned int, long, unsigned long, long long, unsigned long long, |
| 71 | // char16_t, char32_t, wchar_t |
| 72 | // Behavior is defined for mem_ord = [0 ... 5] |
| 73 | type __atomic_fetch_add(type* atomic_obj, type operand, int mem_ord); |
| 74 | |
| 75 | // type is one of: char, signed char, unsigned char, short, unsigned short, int, |
| 76 | // unsigned int, long, unsigned long, long long, unsigned long long, |
| 77 | // char16_t, char32_t, wchar_t |
| 78 | // Behavior is defined for mem_ord = [0 ... 5] |
| 79 | type __atomic_fetch_sub(type* atomic_obj, type operand, int mem_ord); |
| 80 | |
| 81 | // type is one of: char, signed char, unsigned char, short, unsigned short, int, |
| 82 | // unsigned int, long, unsigned long, long long, unsigned long long, |
| 83 | // char16_t, char32_t, wchar_t |
| 84 | // Behavior is defined for mem_ord = [0 ... 5] |
| 85 | type __atomic_fetch_and(type* atomic_obj, type operand, int mem_ord); |
| 86 | |
| 87 | // type is one of: char, signed char, unsigned char, short, unsigned short, int, |
| 88 | // unsigned int, long, unsigned long, long long, unsigned long long, |
| 89 | // char16_t, char32_t, wchar_t |
| 90 | // Behavior is defined for mem_ord = [0 ... 5] |
| 91 | type __atomic_fetch_or(type* atomic_obj, type operand, int mem_ord); |
| 92 | |
| 93 | // type is one of: char, signed char, unsigned char, short, unsigned short, int, |
| 94 | // unsigned int, long, unsigned long, long long, unsigned long long, |
| 95 | // char16_t, char32_t, wchar_t |
| 96 | // Behavior is defined for mem_ord = [0 ... 5] |
| 97 | type __atomic_fetch_xor(type* atomic_obj, type operand, int mem_ord); |
| 98 | |
| 99 | // Behavior is defined for mem_ord = [0 ... 5] |
| 100 | void* __atomic_fetch_add(void** atomic_obj, ptrdiff_t operand, int mem_ord); |
| 101 | void* __atomic_fetch_sub(void** atomic_obj, ptrdiff_t operand, int mem_ord); |
| 102 | |
| 103 | // Behavior is defined for mem_ord = [0 ... 5] |
| 104 | void __atomic_thread_fence(int mem_ord); |
| 105 | void __atomic_signal_fence(int mem_ord); |
| 106 | |
| 107 | If desired the intrinsics taking a single ``mem_ord`` parameter can default |
| 108 | this argument to 5. |
| 109 | |
| 110 | If desired the intrinsics taking two ordering parameters can default ``mem_success`` |
| 111 | to 5, and ``mem_failure`` to ``translate_memory_order(mem_success)`` where |
| 112 | ``translate_memory_order(mem_success)`` is defined as: |
| 113 | |
| 114 | .. code-block:: cpp |
| 115 | |
| 116 | int translate_memory_order(int o) { |
| 117 | switch (o) { |
| 118 | case 4: |
| 119 | return 2; |
| 120 | case 3: |
| 121 | return 0; |
| 122 | } |
| 123 | return o; |
| 124 | } |
| 125 | |
| 126 | Below are representative C++ implementations of all of the operations. Their |
| 127 | purpose is to document the desired semantics of each operation, assuming |
| 128 | ``memory_order_seq_cst``. This is essentially the code that will be called |
| 129 | if the front end calls out to compiler-rt. |
| 130 | |
| 131 | .. code-block:: cpp |
| 132 | |
| 133 | template <class T> |
| 134 | T __atomic_load(T const volatile* obj) { |
| 135 | unique_lock<mutex> _(some_mutex); |
| 136 | return *obj; |
| 137 | } |
| 138 | |
| 139 | template <class T> |
| 140 | void __atomic_store(T volatile* obj, T desr) { |
| 141 | unique_lock<mutex> _(some_mutex); |
| 142 | *obj = desr; |
| 143 | } |
| 144 | |
| 145 | template <class T> |
| 146 | T __atomic_exchange(T volatile* obj, T desr) { |
| 147 | unique_lock<mutex> _(some_mutex); |
| 148 | T r = *obj; |
| 149 | *obj = desr; |
| 150 | return r; |
| 151 | } |
| 152 | |
| 153 | template <class T> |
| 154 | bool __atomic_compare_exchange_strong(T volatile* obj, T* exp, T desr) { |
| 155 | unique_lock<mutex> _(some_mutex); |
| 156 | if (std::memcmp(const_cast<T*>(obj), exp, sizeof(T)) == 0) // if (*obj == *exp) |
| 157 | { |
| 158 | std::memcpy(const_cast<T*>(obj), &desr, sizeof(T)); // *obj = desr; |
| 159 | return true; |
| 160 | } |
| 161 | std::memcpy(exp, const_cast<T*>(obj), sizeof(T)); // *exp = *obj; |
| 162 | return false; |
| 163 | } |
| 164 | |
| 165 | // May spuriously return false (even if *obj == *exp) |
| 166 | template <class T> |
| 167 | bool __atomic_compare_exchange_weak(T volatile* obj, T* exp, T desr) { |
| 168 | unique_lock<mutex> _(some_mutex); |
| 169 | if (std::memcmp(const_cast<T*>(obj), exp, sizeof(T)) == 0) // if (*obj == *exp) |
| 170 | { |
| 171 | std::memcpy(const_cast<T*>(obj), &desr, sizeof(T)); // *obj = desr; |
| 172 | return true; |
| 173 | } |
| 174 | std::memcpy(exp, const_cast<T*>(obj), sizeof(T)); // *exp = *obj; |
| 175 | return false; |
| 176 | } |
| 177 | |
| 178 | template <class T> |
| 179 | T __atomic_fetch_add(T volatile* obj, T operand) { |
| 180 | unique_lock<mutex> _(some_mutex); |
| 181 | T r = *obj; |
| 182 | *obj += operand; |
| 183 | return r; |
| 184 | } |
| 185 | |
| 186 | template <class T> |
| 187 | T __atomic_fetch_sub(T volatile* obj, T operand) { |
| 188 | unique_lock<mutex> _(some_mutex); |
| 189 | T r = *obj; |
| 190 | *obj -= operand; |
| 191 | return r; |
| 192 | } |
| 193 | |
| 194 | template <class T> |
| 195 | T __atomic_fetch_and(T volatile* obj, T operand) { |
| 196 | unique_lock<mutex> _(some_mutex); |
| 197 | T r = *obj; |
| 198 | *obj &= operand; |
| 199 | return r; |
| 200 | } |
| 201 | |
| 202 | template <class T> |
| 203 | T __atomic_fetch_or(T volatile* obj, T operand) { |
| 204 | unique_lock<mutex> _(some_mutex); |
| 205 | T r = *obj; |
| 206 | *obj |= operand; |
| 207 | return r; |
| 208 | } |
| 209 | |
| 210 | template <class T> |
| 211 | T __atomic_fetch_xor(T volatile* obj, T operand) { |
| 212 | unique_lock<mutex> _(some_mutex); |
| 213 | T r = *obj; |
| 214 | *obj ^= operand; |
| 215 | return r; |
| 216 | } |
| 217 | |
| 218 | void* __atomic_fetch_add(void* volatile* obj, ptrdiff_t operand) { |
| 219 | unique_lock<mutex> _(some_mutex); |
| 220 | void* r = *obj; |
| 221 | (char*&)(*obj) += operand; |
| 222 | return r; |
| 223 | } |
| 224 | |
| 225 | void* __atomic_fetch_sub(void* volatile* obj, ptrdiff_t operand) { |
| 226 | unique_lock<mutex> _(some_mutex); |
| 227 | void* r = *obj; |
| 228 | (char*&)(*obj) -= operand; |
| 229 | return r; |
| 230 | } |
| 231 | |
| 232 | void __atomic_thread_fence() { |
| 233 | unique_lock<mutex> _(some_mutex); |
| 234 | } |
| 235 | |
| 236 | void __atomic_signal_fence() { |
| 237 | unique_lock<mutex> _(some_mutex); |
| 238 | } |
| 239 | |
| 240 | |
| 241 | Design B: Something in between |
| 242 | ============================== |
| 243 | This is a variation of design A which puts the burden on the library to arrange |
| 244 | for the correct manipulation of the run time memory ordering arguments, and only |
| 245 | calls the compiler for well-defined memory orderings. I think of this design as |
| 246 | the worst of A and C, instead of the best of A and C. But I offer it as an |
| 247 | option in the spirit of completeness. |
| 248 | |
| 249 | .. code-block:: cpp |
| 250 | |
| 251 | // type must be trivially copyable |
| 252 | bool __atomic_is_lock_free(const type* atomic_obj); |
| 253 | |
| 254 | // type must be trivially copyable |
| 255 | type __atomic_load_relaxed(const volatile type* atomic_obj); |
| 256 | type __atomic_load_consume(const volatile type* atomic_obj); |
| 257 | type __atomic_load_acquire(const volatile type* atomic_obj); |
| 258 | type __atomic_load_seq_cst(const volatile type* atomic_obj); |
| 259 | |
| 260 | // type must be trivially copyable |
| 261 | type __atomic_store_relaxed(volatile type* atomic_obj, type desired); |
| 262 | type __atomic_store_release(volatile type* atomic_obj, type desired); |
| 263 | type __atomic_store_seq_cst(volatile type* atomic_obj, type desired); |
| 264 | |
| 265 | // type must be trivially copyable |
| 266 | type __atomic_exchange_relaxed(volatile type* atomic_obj, type desired); |
| 267 | type __atomic_exchange_consume(volatile type* atomic_obj, type desired); |
| 268 | type __atomic_exchange_acquire(volatile type* atomic_obj, type desired); |
| 269 | type __atomic_exchange_release(volatile type* atomic_obj, type desired); |
| 270 | type __atomic_exchange_acq_rel(volatile type* atomic_obj, type desired); |
| 271 | type __atomic_exchange_seq_cst(volatile type* atomic_obj, type desired); |
| 272 | |
| 273 | // type must be trivially copyable |
| 274 | bool __atomic_compare_exchange_strong_relaxed_relaxed(volatile type* atomic_obj, |
| 275 | type* expected, |
| 276 | type desired); |
| 277 | bool __atomic_compare_exchange_strong_consume_relaxed(volatile type* atomic_obj, |
| 278 | type* expected, |
| 279 | type desired); |
| 280 | bool __atomic_compare_exchange_strong_consume_consume(volatile type* atomic_obj, |
| 281 | type* expected, |
| 282 | type desired); |
| 283 | bool __atomic_compare_exchange_strong_acquire_relaxed(volatile type* atomic_obj, |
| 284 | type* expected, |
| 285 | type desired); |
| 286 | bool __atomic_compare_exchange_strong_acquire_consume(volatile type* atomic_obj, |
| 287 | type* expected, |
| 288 | type desired); |
| 289 | bool __atomic_compare_exchange_strong_acquire_acquire(volatile type* atomic_obj, |
| 290 | type* expected, |
| 291 | type desired); |
| 292 | bool __atomic_compare_exchange_strong_release_relaxed(volatile type* atomic_obj, |
| 293 | type* expected, |
| 294 | type desired); |
| 295 | bool __atomic_compare_exchange_strong_release_consume(volatile type* atomic_obj, |
| 296 | type* expected, |
| 297 | type desired); |
| 298 | bool __atomic_compare_exchange_strong_release_acquire(volatile type* atomic_obj, |
| 299 | type* expected, |
| 300 | type desired); |
| 301 | bool __atomic_compare_exchange_strong_acq_rel_relaxed(volatile type* atomic_obj, |
| 302 | type* expected, |
| 303 | type desired); |
| 304 | bool __atomic_compare_exchange_strong_acq_rel_consume(volatile type* atomic_obj, |
| 305 | type* expected, |
| 306 | type desired); |
| 307 | bool __atomic_compare_exchange_strong_acq_rel_acquire(volatile type* atomic_obj, |
| 308 | type* expected, |
| 309 | type desired); |
| 310 | bool __atomic_compare_exchange_strong_seq_cst_relaxed(volatile type* atomic_obj, |
| 311 | type* expected, |
| 312 | type desired); |
| 313 | bool __atomic_compare_exchange_strong_seq_cst_consume(volatile type* atomic_obj, |
| 314 | type* expected, |
| 315 | type desired); |
| 316 | bool __atomic_compare_exchange_strong_seq_cst_acquire(volatile type* atomic_obj, |
| 317 | type* expected, |
| 318 | type desired); |
| 319 | bool __atomic_compare_exchange_strong_seq_cst_seq_cst(volatile type* atomic_obj, |
| 320 | type* expected, |
| 321 | type desired); |
| 322 | |
| 323 | // type must be trivially copyable |
| 324 | bool __atomic_compare_exchange_weak_relaxed_relaxed(volatile type* atomic_obj, |
| 325 | type* expected, |
| 326 | type desired); |
| 327 | bool __atomic_compare_exchange_weak_consume_relaxed(volatile type* atomic_obj, |
| 328 | type* expected, |
| 329 | type desired); |
| 330 | bool __atomic_compare_exchange_weak_consume_consume(volatile type* atomic_obj, |
| 331 | type* expected, |
| 332 | type desired); |
| 333 | bool __atomic_compare_exchange_weak_acquire_relaxed(volatile type* atomic_obj, |
| 334 | type* expected, |
| 335 | type desired); |
| 336 | bool __atomic_compare_exchange_weak_acquire_consume(volatile type* atomic_obj, |
| 337 | type* expected, |
| 338 | type desired); |
| 339 | bool __atomic_compare_exchange_weak_acquire_acquire(volatile type* atomic_obj, |
| 340 | type* expected, |
| 341 | type desired); |
| 342 | bool __atomic_compare_exchange_weak_release_relaxed(volatile type* atomic_obj, |
| 343 | type* expected, |
| 344 | type desired); |
| 345 | bool __atomic_compare_exchange_weak_release_consume(volatile type* atomic_obj, |
| 346 | type* expected, |
| 347 | type desired); |
| 348 | bool __atomic_compare_exchange_weak_release_acquire(volatile type* atomic_obj, |
| 349 | type* expected, |
| 350 | type desired); |
| 351 | bool __atomic_compare_exchange_weak_acq_rel_relaxed(volatile type* atomic_obj, |
| 352 | type* expected, |
| 353 | type desired); |
| 354 | bool __atomic_compare_exchange_weak_acq_rel_consume(volatile type* atomic_obj, |
| 355 | type* expected, |
| 356 | type desired); |
| 357 | bool __atomic_compare_exchange_weak_acq_rel_acquire(volatile type* atomic_obj, |
| 358 | type* expected, |
| 359 | type desired); |
| 360 | bool __atomic_compare_exchange_weak_seq_cst_relaxed(volatile type* atomic_obj, |
| 361 | type* expected, |
| 362 | type desired); |
| 363 | bool __atomic_compare_exchange_weak_seq_cst_consume(volatile type* atomic_obj, |
| 364 | type* expected, |
| 365 | type desired); |
| 366 | bool __atomic_compare_exchange_weak_seq_cst_acquire(volatile type* atomic_obj, |
| 367 | type* expected, |
| 368 | type desired); |
| 369 | bool __atomic_compare_exchange_weak_seq_cst_seq_cst(volatile type* atomic_obj, |
| 370 | type* expected, |
| 371 | type desired); |
| 372 | |
| 373 | // type is one of: char, signed char, unsigned char, short, unsigned short, int, |
| 374 | // unsigned int, long, unsigned long, long long, unsigned long long, |
| 375 | // char16_t, char32_t, wchar_t |
| 376 | type __atomic_fetch_add_relaxed(volatile type* atomic_obj, type operand); |
| 377 | type __atomic_fetch_add_consume(volatile type* atomic_obj, type operand); |
| 378 | type __atomic_fetch_add_acquire(volatile type* atomic_obj, type operand); |
| 379 | type __atomic_fetch_add_release(volatile type* atomic_obj, type operand); |
| 380 | type __atomic_fetch_add_acq_rel(volatile type* atomic_obj, type operand); |
| 381 | type __atomic_fetch_add_seq_cst(volatile type* atomic_obj, type operand); |
| 382 | |
| 383 | // type is one of: char, signed char, unsigned char, short, unsigned short, int, |
| 384 | // unsigned int, long, unsigned long, long long, unsigned long long, |
| 385 | // char16_t, char32_t, wchar_t |
| 386 | type __atomic_fetch_sub_relaxed(volatile type* atomic_obj, type operand); |
| 387 | type __atomic_fetch_sub_consume(volatile type* atomic_obj, type operand); |
| 388 | type __atomic_fetch_sub_acquire(volatile type* atomic_obj, type operand); |
| 389 | type __atomic_fetch_sub_release(volatile type* atomic_obj, type operand); |
| 390 | type __atomic_fetch_sub_acq_rel(volatile type* atomic_obj, type operand); |
| 391 | type __atomic_fetch_sub_seq_cst(volatile type* atomic_obj, type operand); |
| 392 | |
| 393 | // type is one of: char, signed char, unsigned char, short, unsigned short, int, |
| 394 | // unsigned int, long, unsigned long, long long, unsigned long long, |
| 395 | // char16_t, char32_t, wchar_t |
| 396 | type __atomic_fetch_and_relaxed(volatile type* atomic_obj, type operand); |
| 397 | type __atomic_fetch_and_consume(volatile type* atomic_obj, type operand); |
| 398 | type __atomic_fetch_and_acquire(volatile type* atomic_obj, type operand); |
| 399 | type __atomic_fetch_and_release(volatile type* atomic_obj, type operand); |
| 400 | type __atomic_fetch_and_acq_rel(volatile type* atomic_obj, type operand); |
| 401 | type __atomic_fetch_and_seq_cst(volatile type* atomic_obj, type operand); |
| 402 | |
| 403 | // type is one of: char, signed char, unsigned char, short, unsigned short, int, |
| 404 | // unsigned int, long, unsigned long, long long, unsigned long long, |
| 405 | // char16_t, char32_t, wchar_t |
| 406 | type __atomic_fetch_or_relaxed(volatile type* atomic_obj, type operand); |
| 407 | type __atomic_fetch_or_consume(volatile type* atomic_obj, type operand); |
| 408 | type __atomic_fetch_or_acquire(volatile type* atomic_obj, type operand); |
| 409 | type __atomic_fetch_or_release(volatile type* atomic_obj, type operand); |
| 410 | type __atomic_fetch_or_acq_rel(volatile type* atomic_obj, type operand); |
| 411 | type __atomic_fetch_or_seq_cst(volatile type* atomic_obj, type operand); |
| 412 | |
| 413 | // type is one of: char, signed char, unsigned char, short, unsigned short, int, |
| 414 | // unsigned int, long, unsigned long, long long, unsigned long long, |
| 415 | // char16_t, char32_t, wchar_t |
| 416 | type __atomic_fetch_xor_relaxed(volatile type* atomic_obj, type operand); |
| 417 | type __atomic_fetch_xor_consume(volatile type* atomic_obj, type operand); |
| 418 | type __atomic_fetch_xor_acquire(volatile type* atomic_obj, type operand); |
| 419 | type __atomic_fetch_xor_release(volatile type* atomic_obj, type operand); |
| 420 | type __atomic_fetch_xor_acq_rel(volatile type* atomic_obj, type operand); |
| 421 | type __atomic_fetch_xor_seq_cst(volatile type* atomic_obj, type operand); |
| 422 | |
| 423 | void* __atomic_fetch_add_relaxed(void* volatile* atomic_obj, ptrdiff_t operand); |
| 424 | void* __atomic_fetch_add_consume(void* volatile* atomic_obj, ptrdiff_t operand); |
| 425 | void* __atomic_fetch_add_acquire(void* volatile* atomic_obj, ptrdiff_t operand); |
| 426 | void* __atomic_fetch_add_release(void* volatile* atomic_obj, ptrdiff_t operand); |
| 427 | void* __atomic_fetch_add_acq_rel(void* volatile* atomic_obj, ptrdiff_t operand); |
| 428 | void* __atomic_fetch_add_seq_cst(void* volatile* atomic_obj, ptrdiff_t operand); |
| 429 | |
| 430 | void* __atomic_fetch_sub_relaxed(void* volatile* atomic_obj, ptrdiff_t operand); |
| 431 | void* __atomic_fetch_sub_consume(void* volatile* atomic_obj, ptrdiff_t operand); |
| 432 | void* __atomic_fetch_sub_acquire(void* volatile* atomic_obj, ptrdiff_t operand); |
| 433 | void* __atomic_fetch_sub_release(void* volatile* atomic_obj, ptrdiff_t operand); |
| 434 | void* __atomic_fetch_sub_acq_rel(void* volatile* atomic_obj, ptrdiff_t operand); |
| 435 | void* __atomic_fetch_sub_seq_cst(void* volatile* atomic_obj, ptrdiff_t operand); |
| 436 | |
| 437 | void __atomic_thread_fence_relaxed(); |
| 438 | void __atomic_thread_fence_consume(); |
| 439 | void __atomic_thread_fence_acquire(); |
| 440 | void __atomic_thread_fence_release(); |
| 441 | void __atomic_thread_fence_acq_rel(); |
| 442 | void __atomic_thread_fence_seq_cst(); |
| 443 | |
| 444 | void __atomic_signal_fence_relaxed(); |
| 445 | void __atomic_signal_fence_consume(); |
| 446 | void __atomic_signal_fence_acquire(); |
| 447 | void __atomic_signal_fence_release(); |
| 448 | void __atomic_signal_fence_acq_rel(); |
| 449 | void __atomic_signal_fence_seq_cst(); |
| 450 | |
| 451 | Design C: Minimal work for the front end |
| 452 | ======================================== |
| 453 | The ``<atomic>`` header is one of the most closely coupled headers to the compiler. |
| 454 | Ideally when you invoke any function from ``<atomic>``, it should result in highly |
| 455 | optimized assembly being inserted directly into your application -- assembly that |
| 456 | is not otherwise representable by higher level C or C++ expressions. The design of |
| 457 | the libc++ ``<atomic>`` header started with this goal in mind. A secondary, but |
| 458 | still very important goal is that the compiler should have to do minimal work to |
| 459 | facilitate the implementation of ``<atomic>``. Without this second goal, then |
| 460 | practically speaking, the libc++ ``<atomic>`` header would be doomed to be a |
| 461 | barely supported, second class citizen on almost every platform. |
| 462 | |
| 463 | Goals: |
| 464 | |
| 465 | - Optimal code generation for atomic operations |
| 466 | - Minimal effort for the compiler to achieve goal 1 on any given platform |
| 467 | - Conformance to the C++0X draft standard |
| 468 | |
| 469 | The purpose of this document is to inform compiler writers what they need to do |
| 470 | to enable a high performance libc++ ``<atomic>`` with minimal effort. |
| 471 | |
| 472 | The minimal work that must be done for a conforming ``<atomic>`` |
| 473 | ---------------------------------------------------------------- |
| 474 | The only "atomic" operations that must actually be lock free in |
| 475 | ``<atomic>`` are represented by the following compiler intrinsics: |
| 476 | |
| 477 | .. code-block:: cpp |
| 478 | |
| 479 | __atomic_flag__ __atomic_exchange_seq_cst(__atomic_flag__ volatile* obj, __atomic_flag__ desr) { |
| 480 | unique_lock<mutex> _(some_mutex); |
| 481 | __atomic_flag__ result = *obj; |
| 482 | *obj = desr; |
| 483 | return result; |
| 484 | } |
| 485 | |
| 486 | void __atomic_store_seq_cst(__atomic_flag__ volatile* obj, __atomic_flag__ desr) { |
| 487 | unique_lock<mutex> _(some_mutex); |
| 488 | *obj = desr; |
| 489 | } |
| 490 | |
| 491 | Where: |
| 492 | |
| 493 | - If ``__has_feature(__atomic_flag)`` evaluates to 1 in the preprocessor then |
| 494 | the compiler must define ``__atomic_flag__`` (e.g. as a typedef to ``int``). |
| 495 | - If ``__has_feature(__atomic_flag)`` evaluates to 0 in the preprocessor then |
| 496 | the library defines ``__atomic_flag__`` as a typedef to ``bool``. |
| 497 | - To communicate that the above intrinsics are available, the compiler must |
| 498 | arrange for ``__has_feature`` to return 1 when fed the intrinsic name |
| 499 | appended with an '_' and the mangled type name of ``__atomic_flag__``. |
| 500 | |
| 501 | For example if ``__atomic_flag__`` is ``unsigned int``: |
| 502 | |
| 503 | .. code-block:: cpp |
| 504 | |
| 505 | // __has_feature(__atomic_flag) == 1 |
| 506 | // __has_feature(__atomic_exchange_seq_cst_j) == 1 |
| 507 | // __has_feature(__atomic_store_seq_cst_j) == 1 |
| 508 | |
| 509 | typedef unsigned int __atomic_flag__; |
| 510 | |
| 511 | unsigned int __atomic_exchange_seq_cst(unsigned int volatile*, unsigned int) { |
| 512 | // ... |
| 513 | } |
| 514 | |
| 515 | void __atomic_store_seq_cst(unsigned int volatile*, unsigned int) { |
| 516 | // ... |
| 517 | } |
| 518 | |
| 519 | That's it! Compiler writers do the above and you've got a fully conforming |
| 520 | (though sub-par performance) ``<atomic>`` header! |
| 521 | |
| 522 | |
| 523 | Recommended work for a higher performance ``<atomic>`` |
| 524 | ------------------------------------------------------ |
| 525 | It would be good if the above intrinsics worked with all integral types plus |
| 526 | ``void*``. Because this may not be possible to do in a lock-free manner for |
| 527 | all integral types on all platforms, a compiler must communicate each type that |
| 528 | an intrinsic works with. For example, if ``__atomic_exchange_seq_cst`` works |
| 529 | for all types except for ``long long`` and ``unsigned long long`` then: |
| 530 | |
| 531 | .. code-block:: cpp |
| 532 | |
| 533 | __has_feature(__atomic_exchange_seq_cst_b) == 1 // bool |
| 534 | __has_feature(__atomic_exchange_seq_cst_c) == 1 // char |
| 535 | __has_feature(__atomic_exchange_seq_cst_a) == 1 // signed char |
| 536 | __has_feature(__atomic_exchange_seq_cst_h) == 1 // unsigned char |
| 537 | __has_feature(__atomic_exchange_seq_cst_Ds) == 1 // char16_t |
| 538 | __has_feature(__atomic_exchange_seq_cst_Di) == 1 // char32_t |
| 539 | __has_feature(__atomic_exchange_seq_cst_w) == 1 // wchar_t |
| 540 | __has_feature(__atomic_exchange_seq_cst_s) == 1 // short |
| 541 | __has_feature(__atomic_exchange_seq_cst_t) == 1 // unsigned short |
| 542 | __has_feature(__atomic_exchange_seq_cst_i) == 1 // int |
| 543 | __has_feature(__atomic_exchange_seq_cst_j) == 1 // unsigned int |
| 544 | __has_feature(__atomic_exchange_seq_cst_l) == 1 // long |
| 545 | __has_feature(__atomic_exchange_seq_cst_m) == 1 // unsigned long |
| 546 | __has_feature(__atomic_exchange_seq_cst_Pv) == 1 // void* |
| 547 | |
| 548 | Note that only the ``__has_feature`` flag is decorated with the argument |
| 549 | type. The name of the compiler intrinsic is not decorated, but instead works |
| 550 | like a C++ overloaded function. |
| 551 | |
| 552 | Additionally, there are other intrinsics besides ``__atomic_exchange_seq_cst`` |
| 553 | and ``__atomic_store_seq_cst``. They are optional. But if the compiler can |
| 554 | generate faster code than provided by the library, then clients will benefit |
| 555 | from the compiler writer's expertise and knowledge of the targeted platform. |
| 556 | |
| 557 | Below is the complete list of *sequentially consistent* intrinsics, and |
| 558 | their library implementations. Template syntax is used to indicate the desired |
| 559 | overloading for integral and ``void*`` types. The template does not represent a |
| 560 | requirement that the intrinsic operate on **any** type! |
| 561 | |
| 562 | .. code-block:: cpp |
| 563 | |
| 564 | // T is one of: |
| 565 | // bool, char, signed char, unsigned char, short, unsigned short, |
| 566 | // int, unsigned int, long, unsigned long, |
| 567 | // long long, unsigned long long, char16_t, char32_t, wchar_t, void* |
| 568 | |
| 569 | template <class T> |
| 570 | T __atomic_load_seq_cst(T const volatile* obj) { |
| 571 | unique_lock<mutex> _(some_mutex); |
| 572 | return *obj; |
| 573 | } |
| 574 | |
| 575 | template <class T> |
| 576 | void __atomic_store_seq_cst(T volatile* obj, T desr) { |
| 577 | unique_lock<mutex> _(some_mutex); |
| 578 | *obj = desr; |
| 579 | } |
| 580 | |
| 581 | template <class T> |
| 582 | T __atomic_exchange_seq_cst(T volatile* obj, T desr) { |
| 583 | unique_lock<mutex> _(some_mutex); |
| 584 | T r = *obj; |
| 585 | *obj = desr; |
| 586 | return r; |
| 587 | } |
| 588 | |
| 589 | template <class T> |
| 590 | bool __atomic_compare_exchange_strong_seq_cst_seq_cst(T volatile* obj, T* exp, T desr) { |
| 591 | unique_lock<mutex> _(some_mutex); |
| 592 | if (std::memcmp(const_cast<T*>(obj), exp, sizeof(T)) == 0) { |
| 593 | std::memcpy(const_cast<T*>(obj), &desr, sizeof(T)); |
| 594 | return true; |
| 595 | } |
| 596 | std::memcpy(exp, const_cast<T*>(obj), sizeof(T)); |
| 597 | return false; |
| 598 | } |
| 599 | |
| 600 | template <class T> |
| 601 | bool __atomic_compare_exchange_weak_seq_cst_seq_cst(T volatile* obj, T* exp, T desr) { |
| 602 | unique_lock<mutex> _(some_mutex); |
| 603 | if (std::memcmp(const_cast<T*>(obj), exp, sizeof(T)) == 0) |
| 604 | { |
| 605 | std::memcpy(const_cast<T*>(obj), &desr, sizeof(T)); |
| 606 | return true; |
| 607 | } |
| 608 | std::memcpy(exp, const_cast<T*>(obj), sizeof(T)); |
| 609 | return false; |
| 610 | } |
| 611 | |
| 612 | // T is one of: |
| 613 | // char, signed char, unsigned char, short, unsigned short, |
| 614 | // int, unsigned int, long, unsigned long, |
| 615 | // long long, unsigned long long, char16_t, char32_t, wchar_t |
| 616 | |
| 617 | template <class T> |
| 618 | T __atomic_fetch_add_seq_cst(T volatile* obj, T operand) { |
| 619 | unique_lock<mutex> _(some_mutex); |
| 620 | T r = *obj; |
| 621 | *obj += operand; |
| 622 | return r; |
| 623 | } |
| 624 | |
| 625 | template <class T> |
| 626 | T __atomic_fetch_sub_seq_cst(T volatile* obj, T operand) { |
| 627 | unique_lock<mutex> _(some_mutex); |
| 628 | T r = *obj; |
| 629 | *obj -= operand; |
| 630 | return r; |
| 631 | } |
| 632 | |
| 633 | template <class T> |
| 634 | T __atomic_fetch_and_seq_cst(T volatile* obj, T operand) { |
| 635 | unique_lock<mutex> _(some_mutex); |
| 636 | T r = *obj; |
| 637 | *obj &= operand; |
| 638 | return r; |
| 639 | } |
| 640 | |
| 641 | template <class T> |
| 642 | T __atomic_fetch_or_seq_cst(T volatile* obj, T operand) { |
| 643 | unique_lock<mutex> _(some_mutex); |
| 644 | T r = *obj; |
| 645 | *obj |= operand; |
| 646 | return r; |
| 647 | } |
| 648 | |
| 649 | template <class T> |
| 650 | T __atomic_fetch_xor_seq_cst(T volatile* obj, T operand) { |
| 651 | unique_lock<mutex> _(some_mutex); |
| 652 | T r = *obj; |
| 653 | *obj ^= operand; |
| 654 | return r; |
| 655 | } |
| 656 | |
| 657 | void* __atomic_fetch_add_seq_cst(void* volatile* obj, ptrdiff_t operand) { |
| 658 | unique_lock<mutex> _(some_mutex); |
| 659 | void* r = *obj; |
| 660 | (char*&)(*obj) += operand; |
| 661 | return r; |
| 662 | } |
| 663 | |
| 664 | void* __atomic_fetch_sub_seq_cst(void* volatile* obj, ptrdiff_t operand) { |
| 665 | unique_lock<mutex> _(some_mutex); |
| 666 | void* r = *obj; |
| 667 | (char*&)(*obj) -= operand; |
| 668 | return r; |
| 669 | } |
| 670 | |
| 671 | void __atomic_thread_fence_seq_cst() { |
| 672 | unique_lock<mutex> _(some_mutex); |
| 673 | } |
| 674 | |
| 675 | void __atomic_signal_fence_seq_cst() { |
| 676 | unique_lock<mutex> _(some_mutex); |
| 677 | } |
| 678 | |
| 679 | One should consult the (currently draft) `C++ Standard <https://wg21.link/n3126>`_ |
| 680 | for the details of the definitions for these operations. For example, |
| 681 | ``__atomic_compare_exchange_weak_seq_cst_seq_cst`` is allowed to fail |
| 682 | spuriously while ``__atomic_compare_exchange_strong_seq_cst_seq_cst`` is not. |
| 683 | |
| 684 | If on your platform the lock-free definition of ``__atomic_compare_exchange_weak_seq_cst_seq_cst`` |
| 685 | would be the same as ``__atomic_compare_exchange_strong_seq_cst_seq_cst``, you may omit the |
| 686 | ``__atomic_compare_exchange_weak_seq_cst_seq_cst`` intrinsic without a performance cost. The |
| 687 | library will prefer your implementation of ``__atomic_compare_exchange_strong_seq_cst_seq_cst`` |
| 688 | over its own definition for implementing ``__atomic_compare_exchange_weak_seq_cst_seq_cst``. |
| 689 | That is, the library will arrange for ``__atomic_compare_exchange_weak_seq_cst_seq_cst`` to call |
| 690 | ``__atomic_compare_exchange_strong_seq_cst_seq_cst`` if you supply an intrinsic for the strong |
| 691 | version but not the weak. |
| 692 | |
| 693 | Taking advantage of weaker memory synchronization |
| 694 | ------------------------------------------------- |
| 695 | So far, all of the intrinsics presented require a **sequentially consistent** memory ordering. |
| 696 | That is, no loads or stores can move across the operation (just as if the library had locked |
| 697 | that internal mutex). But ``<atomic>`` supports weaker memory ordering operations. In all, |
| 698 | there are six memory orderings (listed here from strongest to weakest): |
| 699 | |
| 700 | .. code-block:: cpp |
| 701 | |
| 702 | memory_order_seq_cst |
| 703 | memory_order_acq_rel |
| 704 | memory_order_release |
| 705 | memory_order_acquire |
| 706 | memory_order_consume |
| 707 | memory_order_relaxed |
| 708 | |
| 709 | (See the `C++ Standard <https://wg21.link/n3126>`_ for the detailed definitions of each of these orderings). |
| 710 | |
| 711 | On some platforms, the compiler vendor can offer some or even all of the above |
| 712 | intrinsics at one or more weaker levels of memory synchronization. This might |
| 713 | lead for example to not issuing an ``mfence`` instruction on the x86. |
| 714 | |
| 715 | If the compiler does not offer any given operation, at any given memory ordering |
| 716 | level, the library will automatically attempt to call the next highest memory |
| 717 | ordering operation. This continues up to ``seq_cst``, and if that doesn't |
| 718 | exist, then the library takes over and does the job with a ``mutex``. This |
| 719 | is a compile-time search and selection operation. At run time, the application |
| 720 | will only see the few inlined assembly instructions for the selected intrinsic. |
| 721 | |
| 722 | Each intrinsic is appended with the 7-letter name of the memory ordering it |
| 723 | addresses. For example a ``load`` with ``relaxed`` ordering is defined by: |
| 724 | |
| 725 | .. code-block:: cpp |
| 726 | |
| 727 | T __atomic_load_relaxed(const volatile T* obj); |
| 728 | |
| 729 | And announced with: |
| 730 | |
| 731 | .. code-block:: cpp |
| 732 | |
| 733 | __has_feature(__atomic_load_relaxed_b) == 1 // bool |
| 734 | __has_feature(__atomic_load_relaxed_c) == 1 // char |
| 735 | __has_feature(__atomic_load_relaxed_a) == 1 // signed char |
| 736 | ... |
| 737 | |
| 738 | The ``__atomic_compare_exchange_strong(weak)`` intrinsics are parameterized |
| 739 | on two memory orderings. The first ordering applies when the operation returns |
| 740 | ``true`` and the second ordering applies when the operation returns ``false``. |
| 741 | |
| 742 | Not every memory ordering is appropriate for every operation. ``exchange`` |
| 743 | and the ``fetch_XXX`` operations support all 6. But ``load`` only supports |
| 744 | ``relaxed``, ``consume``, ``acquire`` and ``seq_cst``. ``store`` only supports |
| 745 | ``relaxed``, ``release``, and ``seq_cst``. The ``compare_exchange`` operations |
| 746 | support the following 16 combinations out of the possible 36: |
| 747 | |
| 748 | .. code-block:: cpp |
| 749 | |
| 750 | relaxed_relaxed |
| 751 | consume_relaxed |
| 752 | consume_consume |
| 753 | acquire_relaxed |
| 754 | acquire_consume |
| 755 | acquire_acquire |
| 756 | release_relaxed |
| 757 | release_consume |
| 758 | release_acquire |
| 759 | acq_rel_relaxed |
| 760 | acq_rel_consume |
| 761 | acq_rel_acquire |
| 762 | seq_cst_relaxed |
| 763 | seq_cst_consume |
| 764 | seq_cst_acquire |
| 765 | seq_cst_seq_cst |
| 766 | |
| 767 | Again, the compiler supplies intrinsics only for the strongest orderings where |
| 768 | it can make a difference. The library takes care of calling the weakest |
| 769 | supplied intrinsic that is as strong or stronger than the customer asked for. |
| 770 | |
| 771 | Note about ABI |
| 772 | ============== |
| 773 | With any design, the (back end) compiler writer should note that the decision to |
| 774 | implement lock-free operations on any given type (or not) is an ABI-binding decision. |
| 775 | One can not change from treating a type as not lock free, to lock free (or vice-versa) |
| 776 | without breaking your ABI. |
| 777 | |
| 778 | For example: |
| 779 | |
| 780 | **TU1.cpp**: |
| 781 | |
| 782 | .. code-block:: cpp |
| 783 | |
| 784 | extern atomic<long long> A; |
| 785 | int foo() { return A.compare_exchange_strong(w, x); } |
| 786 | |
| 787 | |
| 788 | **TU2.cpp**: |
| 789 | |
| 790 | .. code-block:: cpp |
| 791 | |
| 792 | extern atomic<long long> A; |
| 793 | void bar() { return A.compare_exchange_strong(y, z); } |
| 794 | |
| 795 | If only **one** of these calls to ``compare_exchange_strong`` is implemented with |
| 796 | mutex-locked code, then that mutex-locked code will not be executed mutually |
| 797 | exclusively of the one implemented in a lock-free manner. |