Joseph Huber | 6242f9b | 2021-07-20 16:04:13 | [diff] [blame] | 1 | .. _omp111: |
| 2 | |
Joseph Huber | 1616407 | 2021-07-14 21:04:54 | [diff] [blame] | 3 | Replaced globalized variable with X bytes of shared memory. [OMP111] |
| 4 | ==================================================================== |
| 5 | |
Joseph Huber | 1616407 | 2021-07-14 21:04:54 | [diff] [blame] | 6 | This optimization occurs when a globalized variable's data is shared between |
Joseph Huber | dead50d | 2021-07-26 20:01:41 | [diff] [blame] | 7 | multiple threads, but requires a constant amount of memory that can be |
| 8 | determined at compile time. This is the case when only a single thread creates |
| 9 | the memory and is then shared between every thread. The memory can then be |
| 10 | pushed to a static buffer of shared memory on the device. This optimization |
| 11 | allows users to declare shared memory on the device without using OpenMP's |
| 12 | custom allocators. |
Joseph Huber | 1616407 | 2021-07-14 21:04:54 | [diff] [blame] | 13 | |
Joseph Huber | dead50d | 2021-07-26 20:01:41 | [diff] [blame] | 14 | Globalization occurs when a pointer to a thread-local variable escapes the |
| 15 | current scope. If a single thread is known to be responsible for creating and |
| 16 | sharing the data it can instead be mapped directly to the device's shared |
| 17 | memory. Checking if only a single thread can execute an instruction requires |
Joseph Huber | 1616407 | 2021-07-14 21:04:54 | [diff] [blame] | 18 | that the parent functions have internal linkage. Otherwise, an external caller |
| 19 | could invalidate this analysis but having multiple threads call that function. |
Joseph Huber | dead50d | 2021-07-26 20:01:41 | [diff] [blame] | 20 | The optimization pass will make internal copies of each function to use for this |
| 21 | reason, but it is still recommended to mark them as internal using keywords like |
Joseph Huber | 1616407 | 2021-07-14 21:04:54 | [diff] [blame] | 22 | ``static`` whenever possible. |
| 23 | |
| 24 | Example |
| 25 | ------- |
| 26 | |
| 27 | This optimization should apply to any variable declared in an OpenMP target |
| 28 | region that is then shared with every thread in a parallel region. This allows |
| 29 | the user to declare shared memory without using custom allocators. A simple |
| 30 | stencil calculation shows how this can be used. |
| 31 | |
| 32 | .. code-block:: c++ |
| 33 | |
| 34 | void stencil(int M, int N, double *X, double *Y) { |
| 35 | #pragma omp target teams distribute collapse(2) \ |
| 36 | map(to : X [0:M * N]) map(tofrom : Y [0:M * N]) |
| 37 | for (int i0 = 0; i0 < M; i0 += MC) { |
| 38 | for (int j0 = 0; j0 < N; j0 += NC) { |
| 39 | double sX[MC][NC]; |
Shao-Ce SUN | 0c66025 | 2021-11-15 01:17:08 | [diff] [blame] | 40 | |
Joseph Huber | dead50d | 2021-07-26 20:01:41 | [diff] [blame] | 41 | #pragma omp parallel for collapse(2) shared(sX) default(firstprivate) |
Joseph Huber | 1616407 | 2021-07-14 21:04:54 | [diff] [blame] | 42 | for (int i1 = 0; i1 < MC; ++i1) |
| 43 | for (int j1 = 0; j1 < NC; ++j1) |
| 44 | sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)]; |
Shao-Ce SUN | 0c66025 | 2021-11-15 01:17:08 | [diff] [blame] | 45 | |
Joseph Huber | dead50d | 2021-07-26 20:01:41 | [diff] [blame] | 46 | #pragma omp parallel for collapse(2) shared(sX) default(firstprivate) |
Joseph Huber | 1616407 | 2021-07-14 21:04:54 | [diff] [blame] | 47 | for (int i1 = 1; i1 < MC - 1; ++i1) |
| 48 | for (int j1 = 1; j1 < NC - 1; ++j1) |
| 49 | Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] + |
| 50 | sX[i1][j1 + 1] + sX[i1][j1 - 1] + |
| 51 | -4.0 * sX[i1][j1]) / (dX * dX); |
Shao-Ce SUN | 0c66025 | 2021-11-15 01:17:08 | [diff] [blame] | 52 | } |
Joseph Huber | 1616407 | 2021-07-14 21:04:54 | [diff] [blame] | 53 | } |
| 54 | } |
| 55 | |
| 56 | .. code-block:: console |
| 57 | |
| 58 | |
Shao-Ce SUN | 0c66025 | 2021-11-15 01:17:08 | [diff] [blame] | 59 | $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp |
Joseph Huber | 1616407 | 2021-07-14 21:04:54 | [diff] [blame] | 60 | omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111] |
| 61 | double sX[MC][NC]; |
| 62 | ^ |
| 63 | |
Joseph Huber | dead50d | 2021-07-26 20:01:41 | [diff] [blame] | 64 | The default mapping for variables captured in an OpenMP parallel region is |
| 65 | ``shared``. This means taking a pointer to the object which will ultimately |
| 66 | result in globalization that will be mapped to shared memory when it could have |
| 67 | been placed in registers. To avoid this, make sure each variable that can be |
| 68 | copied into the region is marked ``firstprivate`` either explicitly or using the |
| 69 | OpenMP 5.1 feature ``default(firstprivate)``. |
| 70 | |
Joseph Huber | 1616407 | 2021-07-14 21:04:54 | [diff] [blame] | 71 | Diagnostic Scope |
| 72 | ---------------- |
| 73 | |
| 74 | OpenMP target offloading optimization remark. |