peter klausler | c2bccd6 | 2020-08-25 16:43:16 | [diff] [blame] | 1 | <!--===- docs/DoConcurrent.md |
| 2 | |
| 3 | Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. |
| 4 | See https://llvm.org/LICENSE.txt for license information. |
| 5 | SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception |
| 6 | |
| 7 | --> |
| 8 | |
| 9 | # `DO CONCURRENT` isn't necessarily concurrent |
| 10 | |
| 11 | ```eval_rst |
| 12 | .. contents:: |
| 13 | :local: |
| 14 | ``` |
| 15 | |
| 16 | A variant form of Fortran's primary looping construct was |
| 17 | added to the Fortran 2008 language standard with the apparent |
| 18 | intent of enabling more effective automatic parallel execution of code |
| 19 | written in the standard language without the use of |
| 20 | non-standard directives. |
| 21 | Spelled `DO CONCURRENT`, the construct takes a rectilinear iteration |
| 22 | space specification like `FORALL` and allows us to write |
| 23 | a multidimensional loop nest construct with a single `DO CONCURRENT` |
| 24 | statement and a single terminating `END DO` statement. |
| 25 | |
| 26 | Within the body of a `DO CONCURRENT` loop the program must respect |
| 27 | a long list of restrictions on its use of Fortran language features. |
| 28 | Actions that obviously can't be executed in parallel or that |
| 29 | don't allow all iterations to execute are prohibited. |
| 30 | These include: |
| 31 | * Control flow statements that would prevent the loop nest from |
| 32 | executing all its iterations: `RETURN`, `EXIT`, and any |
| 33 | `GOTO` or `CYCLE` that leaves the construct. |
| 34 | * Image control statements: `STOP`, `SYNC`, `LOCK`/`UNLOCK`, `EVENT`, |
| 35 | and `ALLOCATE`/`DEALLOCATE` of a coarray. |
| 36 | * Calling a procedure that is not `PURE`. |
| 37 | * Deallocation of any polymorphic entity, as that could cause |
| 38 | an impure FINAL subroutine to be called. |
| 39 | * Messing with the IEEE floating-point control and status flags. |
| 40 | * Accepting some restrictions on data flow between iterations |
| 41 | (i.e., none) and on liveness of modified objects after the loop. |
| 42 | (The details are spelled out later.) |
| 43 | |
| 44 | In return for accepting these restrictions, a `DO CONCURRENT` might |
| 45 | compile into code that exploits the parallel features of the target |
| 46 | machine to run the iterations of the `DO CONCURRENT` construct. |
| 47 | One needn't necessarily require OpenACC or OpenMP directives. |
| 48 | |
| 49 | But it turns out that these rules, though *necessary* for safe parallel |
| 50 | execution, are not *sufficient*. |
| 51 | One may write conforming `DO CONCURRENT` constructs that cannot |
| 52 | be safely parallelized by a compiler; worse, one may write conforming |
| 53 | `DO CONCURRENT` constructs whose parallelizability a compiler cannot |
| 54 | determine even in principle -- forcing a conforming compiler to |
| 55 | assume the worst and generate sequential code. |
| 56 | |
| 57 | ## Localization |
| 58 | |
| 59 | The Fortran language standard does not actually define `DO CONCURRENT` as a |
| 60 | concurrent construct, or even as a construct that imposes sufficient |
| 61 | requirements on the programmer to allow for parallel execution. |
| 62 | `DO CONCURRENT` is instead defined as executing the iterations |
| 63 | of the loop in some arbitrary order (see subclause 11.1.7.4.3 paragraph 3). |
| 64 | |
| 65 | A `DO CONCURRENT` construct cannot modify an object in one iteration |
| 66 | and expect to be able to read it in another, or read it in one before it gets |
| 67 | modified by another -- there's no way to synchronize inter-iteration |
| 68 | communication with critical sections or atomics. |
| 69 | |
| 70 | But a conforming `DO CONCURRENT` construct *can* modify an object in |
| 71 | multiple iterations of the loop so long as its only reads from that |
| 72 | object *after* having modified it earler in the *same* iteration. |
| 73 | (See 11.1.7.5 paragraph 4 for the details.) |
| 74 | |
| 75 | For example: |
| 76 | |
| 77 | ``` |
| 78 | DO CONCURRENT (J=1:N) |
| 79 | TMP = A(J) + B(J) |
| 80 | C(J) = TMP |
| 81 | END DO |
| 82 | ! And TMP is undefined afterwards |
| 83 | ``` |
| 84 | |
| 85 | The scalar variable `TMP` is used in this loop in a way that conforms |
| 86 | to the standard, as every use of `TMP` follows a definition that appears |
| 87 | earlier in the same iteration. |
| 88 | |
| 89 | The idea, of course, is that a parallelizing compiler isn't required to |
| 90 | use the same word of memory to hold the value of `TMP`; |
| 91 | for parallel execution, `TMP` can be _localized_. |
| 92 | This means that the loop can be internally rewritten as if it had been |
| 93 | ``` |
| 94 | DO CONCURRENT (J=1:N) |
| 95 | BLOCK |
| 96 | REAL :: TMP |
| 97 | TMP = A(J) + B(J) |
| 98 | C(J) = TMP |
| 99 | END BLOCK |
| 100 | END DO |
| 101 | ``` |
| 102 | and thus any risk of data flow between the iterations is removed. |
| 103 | |
| 104 | ## The identification problem |
| 105 | |
| 106 | The automatic localization rules of `DO CONCURRENT` that allow |
| 107 | usage like `TMP` above are not limited to simple local scalar |
| 108 | variables. |
| 109 | They also apply to arbitrary variables, and thus may apply |
| 110 | in cases that a compiler cannot determine exactly due to |
| 111 | the presence of indexing, indirection, and interprocedural data flow. |
| 112 | |
| 113 | Let's see why this turns out to be a problem. |
| 114 | |
| 115 | Examples: |
| 116 | ``` |
| 117 | DO CONCURRENT (J=1:N) |
| 118 | T(IX(J)) = A(J) + B(J) |
| 119 | C(J) = T(IY(J)) |
| 120 | END DO |
| 121 | ``` |
| 122 | This loop conforms to the standard language if, |
| 123 | whenever `IX(J)` equals `IY(J')` for any distinct pair of iterations |
| 124 | `J` and `J'`, |
| 125 | then the load must be reading a value stored earlier in the |
| 126 | same iteration -- so `IX(J')==IY(J')`, and hence `IX(J)==IX(J')` too, |
| 127 | in this example. |
| 128 | Otherwise, a load in one iteration might depend on a store |
| 129 | in another. |
| 130 | |
| 131 | When all values of `IX(J)` are distinct, and the program conforms |
| 132 | to the restrictions of `DO CONCURRENT`, a compiler can parallelize |
| 133 | the construct easily without applying localization to `T(...)`. |
| 134 | And when some values of `IX(J)` are duplicates, a compiler can parallelize |
| 135 | the loop by forwarding the stored value to the load in those |
| 136 | iterations. |
| 137 | But at compilation time, there's _no way to distinguish_ these |
| 138 | cases in general, and a conservative implementation has to assume |
| 139 | the worst and run the loop's iterations serially. |
| 140 | (Or compare `IX(J)` with `IY(J)` at runtime and forward the |
| 141 | stored value conditionally, which adds overhead and becomes |
| 142 | quickly impractical in loops with multiple loads and stores.) |
| 143 | |
| 144 | In |
| 145 | ``` |
| 146 | TYPE :: T |
| 147 | REAL, POINTER :: P |
| 148 | END TYPE |
| 149 | TYPE(T) :: T1(N), T2(N) |
| 150 | DO CONCURRENT (J=1:N) |
| 151 | T1(J)%P = A(J) + B(J) |
| 152 | C(J) = T2(J)%P |
| 153 | END DO |
| 154 | ``` |
| 155 | we have the same kind of ambiguity from the compiler's perspective. |
| 156 | Are the targets of the pointers used for the stores all distinct |
| 157 | from the targets of the pointers used for the loads? |
| 158 | The programmer may know that they are so, but a compiler |
| 159 | cannot; and there is no syntax by which one can stipulate |
| 160 | that they are so. |
| 161 | |
| 162 | ## The global variable localization problem |
| 163 | |
| 164 | Here's another case: |
| 165 | ``` |
| 166 | MODULE M |
| 167 | REAL :: T |
| 168 | END MODULE |
| 169 | ... |
| 170 | USE M |
| 171 | INTERFACE |
| 172 | PURE REAL FUNCTION F(X) |
| 173 | REAL, INTENT(IN) :: X |
| 174 | END FUNCTION |
| 175 | END INTERFACE |
| 176 | DO CONCURRENT (J=1:N) |
| 177 | T = A(J) + B(J) |
| 178 | D(J) = F(A(J)) + T |
| 179 | END DO |
| 180 | ``` |
| 181 | The variable `T` is obviously meant to be localized. |
| 182 | However, a compiler can't be sure that the pure function `F` |
| 183 | doesn't read from `T`; if it does, there wouldn't be a |
| 184 | practical way to convey the localized copy to it. |
| 185 | |
| 186 | In summary, standard Fortran defines `DO CONCURRENT` as a serial |
| 187 | construct with a sheaf of constraints that we assume are intended |
| 188 | to enable straightforward parallelization without |
| 189 | all of the complexity of defining threading models or shared memory semantics, |
| 190 | with the addition of an automatic localization rule that provides |
| 191 | convenient temporaries objects without requiring the use of nested |
| 192 | `BLOCK` or `ASSOCIATE` constructs. |
| 193 | But the language allows ambiguous cases in which a compiler can neither |
| 194 | 1. prove that automatic localization *is* required for a given |
| 195 | object in every iteration, nor |
| 196 | 1. prove that automatic localization *isn't* required in any iteration. |
| 197 | |
| 198 | ## Locality specifiers |
| 199 | |
| 200 | The Fortran 2018 standard added "locality specifiers" to the |
| 201 | `DO CONCURRENT` statement. |
| 202 | These allow one to define some variable names as being `LOCAL` or |
| 203 | `SHARED`, overriding the automatic localization rule so that it |
| 204 | applies only in the remaining cases of "unspecified" locality. |
| 205 | |
| 206 | `LOCAL` variables are those that can be defined by more than one |
| 207 | iteration but are referenced only after having been defined |
| 208 | earlier in the same iteration. |
| 209 | `SHARED` variables are those that, if defined in |
| 210 | any iteration, are not defined or referenced in any other iteration. |
| 211 | |
| 212 | (There is also a `LOCAL_INIT` specifier that is not relevant to the |
| 213 | problem at hand, and a `DEFAULT(NONE)` specifier that requires a |
| 214 | locality specifier be present for every variable mentioned in the |
| 215 | `DO CONCURRENT` construct.) |
| 216 | |
| 217 | These locality specifiers can help resolve some otherwise ambiguous |
| 218 | cases of localization, but they're not a complete solution to the problems |
| 219 | described above. |
| 220 | |
| 221 | First, the specifiers allow explicit localization of objects |
| 222 | (like the scalar `T` in `MODULE M` above) that are not local variables |
| 223 | of the subprogram. |
| 224 | `DO CONCURRENT` still allows a pure procedure called from the loop |
| 225 | to reference `T`, and so explicit localization just confirms the |
| 226 | worst-case assumptions about interprocedural data flow |
| 227 | within an iteration that a compiler must make anyway. |
| 228 | |
| 229 | Second, the specifiers allow arbitary variables to be localized, |
| 230 | not just scalars. |
| 231 | One may localize a million-element array of derived type |
| 232 | with allocatable components to be created in each iteration, |
| 233 | for example. |
| 234 | (It is not clear whether localized objects are finalized; |
| 235 | probably not.) |
| 236 | |
| 237 | Third, as Fortran uses context to distinguish references to |
| 238 | pointers from (de)references to their targets, it's not clear |
| 239 | whether `LOCAL(PTR)` localizes a pointer, its target, or both. |
| 240 | |
| 241 | Fourth, the specifiers can be applied only to variable _names_, |
| 242 | not to any designator with subscripts or component references. |
| 243 | One may have defined a derived type to hold a representation |
| 244 | of a sparse matrix, using `ALLOCATABLE` components to store its |
| 245 | packed data and indexing structures, but a program cannot localize |
| 246 | some parts of it and share the rest. |
| 247 | (Perhaps one may wrap `ASSOCIATE` constructs around the |
| 248 | `DO CONCURRENT` construct; |
| 249 | the interaction between locality specifiers and construct entities is |
| 250 | not clearly defined in the language.) |
| 251 | |
| 252 | In the example above that defines `T(IX(J))` and reads from `T(IY(J))`, |
| 253 | the locality specifiers can't be used to share those elements of `T()` |
| 254 | that are modified at most once and localize the cases where |
| 255 | `IX(J)` is a duplicate and `IY(J)==IX(J)`. |
| 256 | |
| 257 | Last, when a loop both defines and references many shared objects, |
| 258 | including potential references to globally accessible object |
| 259 | in called procedures, one may need to name all of them in a `SHARED` |
| 260 | specifier. |
| 261 | |
| 262 | ## What to do now |
| 263 | |
| 264 | These problems have been presented to the J3 Fortran language |
| 265 | standard committee. |
| 266 | Their responses in |
| 267 | recent [e-mail discussions](https://mailman.j3-fortran.org/pipermail/j3/2020-July/thread.html) |
| 268 | did not include an intent to address them in future standards or corrigenda. |
| 269 | The most effective-looking response -- which was essentially "just use |
| 270 | `DEFAULT(SHARED)` to disable all automatic localization" -- is not an |
| 271 | viable option, since the language does not include such a specifier! |
| 272 | |
| 273 | Programmers writing `DO CONCURRENT` loops that are safely parallelizable |
| 274 | need an effective means to convey to compilers that those compilers |
| 275 | do not have to assume only the weaker stipulations required by |
| 276 | today's `DO CONCURRENT` without having to write verbose and |
| 277 | error-prone locality specifiers (when those would suffice). |
| 278 | Specifically, an easy means is required that stipulates that localization |
| 279 | should apply at most only to the obvious cases of local non-pointer |
| 280 | non-allocatable scalars. |
| 281 | |
| 282 | In the LLVM Fortran compiler project (a/k/a "flang", "f18") we considered |
| 283 | several solutions to this problem. |
| 284 | 1. Add syntax (e.g., `DO PARALLEL` or `DO CONCURRENT() DEFAULT(PARALLEL)`) |
| 285 | by which one can inform the compiler that it should localize only |
| 286 | the obvious cases of simple local scalars. |
| 287 | Such syntax seems unlikely to ever be standardized, so its usage |
| 288 | would be nonportable. |
| 289 | 1. Add a command-line option &/or a source directive to stipulate |
| 290 | the stronger guarantees. Obvious non-parallelizable usage in the construct |
| 291 | would elicit a stern warning. The `DO CONCURRENT` loops in the source |
| 292 | would continue to be portable to other compilers. |
| 293 | 1. Assume that these stronger conditions hold by default, and add a command-line |
| 294 | option &/or a source directive to "opt out" back to the weaker |
| 295 | requirements of the standard language |
| 296 | in the event that the program contains one of those inherently |
| 297 | non-parallelizable `DO CONCURRENT` loops that perhaps should never have |
| 298 | been possible to write in a conforming program in the first place. |
| 299 | Actual parallel `DO CONCURRENT` constructs would produce parallel |
| 300 | code for users who would otherwise be surprised to learn about these |
| 301 | problems in the language. |
| 302 | But this option could lead to non-standard behavior for codes that depend, |
| 303 | accidentally or not, on non-parallelizable implicit localization. |
| 304 | 1. Accept the standard as it exists, do the best job of automatic |
| 305 | parallelization that can be done, and refer dissatisfied users to J3. |
| 306 | This would be avoiding the problem. |
| 307 | |
| 308 | None of these options is without a fairly obvious disadvantage. |
| 309 | The best option seems to be the one that assumes that users who write |
| 310 | `DO CONCURRENT` constructs are doing so with the intent to write parallel code. |
| 311 | |
| 312 | ## Other precedents |
| 313 | |
| 314 | As of August 2020, we observe that the GNU Fortran compiler (10.1) does not |
| 315 | yet implement the Fortran 2018 locality clauses, but will parallelize some |
| 316 | `DO CONCURRENT` constructs without ambiguous data dependences when the automatic |
| 317 | parallelization option is enabled. |
| 318 | |
| 319 | The Intel Fortran compiler supports the new locality clauses and will parallelize |
| 320 | some `DO CONCURRENT` constructs when automatic parallelization option is enabled. |
| 321 | When OpenMP is enabled, ifort reports that all `DO CONCURRENT` constructs are |
| 322 | parallelized, but they seem to execute in a serial fashion when data flow |
| 323 | hazards are present. |