Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 1 | ============================= |
| 2 | Offloading Design & Internals |
| 3 | ============================= |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | |
| 8 | Introduction |
| 9 | ============ |
| 10 | |
| 11 | This document describes the Clang driver and code generation steps for creating |
| 12 | offloading applications. Clang supports offloading to various architectures |
| 13 | using programming models like CUDA, HIP, and OpenMP. The purpose of this |
| 14 | document is to illustrate the steps necessary to create an offloading |
| 15 | application using Clang. |
| 16 | |
| 17 | OpenMP Offloading |
| 18 | ================= |
| 19 | |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 20 | Clang supports OpenMP target offloading to several different architectures such |
| 21 | as NVPTX, AMDGPU, X86_64, Arm, and PowerPC. Offloading code is generated by |
| 22 | Clang and then executed using the ``libomptarget`` runtime and the associated |
| 23 | plugin for the target architecture, e.g. ``libomptarget.rtl.cuda``. This section |
| 24 | describes the steps necessary to create a functioning device image that can be |
| 25 | loaded by the OpenMP runtime. More information on the OpenMP runtimes can be |
| 26 | found at the `OpenMP documentation page <https://openmp.llvm.org>`__. |
| 27 | |
| 28 | .. _Offloading Overview: |
| 29 | |
| 30 | Offloading Overview |
| 31 | ------------------- |
| 32 | |
| 33 | The goal of offloading compilation is to create an executable device image that |
| 34 | can be run on the target device. OpenMP offloading creates executable images by |
| 35 | compiling the input file for both the host and the target device. The output |
| 36 | from the device phase then needs to be embedded into the host to create a fat |
| 37 | object. A special tool then needs to extract the device code from the fat |
| 38 | objects, run the device linking step, and embed the final image in a symbol the |
Joseph Huber | 5c9ee35 | 2022-02-07 20:23:50 | [diff] [blame] | 39 | host runtime library can use to register the library and access the symbols on |
| 40 | the device. |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 41 | |
| 42 | Compilation Process |
| 43 | ^^^^^^^^^^^^^^^^^^^ |
| 44 | |
Joseph Huber | 5c9ee35 | 2022-02-07 20:23:50 | [diff] [blame] | 45 | The compiler performs the following high-level actions to generate OpenMP |
| 46 | offloading code: |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 47 | |
| 48 | * Compile the input file for the host to produce a bitcode file. Lower ``#pragma |
| 49 | omp target`` declarations to :ref:`offloading entries <Generating Offloading |
| 50 | Entries>` and create metadata to indicate which entries are on the device. |
| 51 | * Compile the input file for the target :ref:`device <Device Compilation>` using |
| 52 | the :ref:`offloading entry <Generating Offloading Entries>` metadata created |
| 53 | by the host. |
| 54 | * Link the OpenMP device runtime library and run the backend to create a device |
| 55 | object file. |
| 56 | * Run the backend on the host bitcode file and create a :ref:`fat object file |
| 57 | <Creating Fat Objects>` using the device object file. |
| 58 | * Pass the fat object file to the :ref:`linker wrapper tool <Device Linking>` |
| 59 | and extract the device objects. Run the device linking action on the extracted |
| 60 | objects. |
| 61 | * :ref:`Wrap <Device Binary Wrapping>` the :ref:`device images <Device linking>` |
| 62 | and :ref:`offload entries <Generating Offloading Entries>` in a symbol that |
| 63 | can be accessed by the host. |
| 64 | * Add the :ref:`wrapped binary <Device Binary Wrapping>` to the linker input and |
| 65 | run the host linking action. Link with ``libomptarget`` to register and |
| 66 | execute the images. |
| 67 | |
| 68 | .. _Generating Offloading Entries: |
| 69 | |
| 70 | Generating Offloading Entries |
| 71 | ----------------------------- |
| 72 | |
| 73 | The first step in compilation is to generate offloading entries for the host. |
| 74 | This information is used to identify function kernels or global values that will |
| 75 | be provided by the device. Blocks contained in a ``#pragma omp target`` or |
| 76 | symbols inside a ``#pragma omp declare target`` directive will have offloading |
| 77 | entries generated. The following table shows the :ref:`offload entry structure |
| 78 | <table-tgt_offload_entry_structure>`. |
| 79 | |
| 80 | .. table:: __tgt_offload_entry Structure |
| 81 | :name: table-tgt_offload_entry_structure |
| 82 | |
| 83 | +---------+------------+------------------------------------------------------------------------+ |
| 84 | | Type | Identifier | Description | |
| 85 | +=========+============+========================================================================+ |
| 86 | | void* | addr | Address of global symbol within device image (function or global) | |
| 87 | +---------+------------+------------------------------------------------------------------------+ |
| 88 | | char* | name | Name of the symbol | |
| 89 | +---------+------------+------------------------------------------------------------------------+ |
| 90 | | size_t | size | Size of the entry info (0 if it is a function) | |
| 91 | +---------+------------+------------------------------------------------------------------------+ |
| 92 | | int32_t | flags | Flags associated with the entry (see :ref:`table-offload_entry_flags`) | |
| 93 | +---------+------------+------------------------------------------------------------------------+ |
| 94 | | int32_t | reserved | Reserved, to be used by the runtime library. | |
| 95 | +---------+------------+------------------------------------------------------------------------+ |
| 96 | |
Joseph Huber | 5c9ee35 | 2022-02-07 20:23:50 | [diff] [blame] | 97 | The address of the global symbol will be set to the device pointer value by the |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 98 | runtime once the device image is loaded. The flags are set to indicate the |
| 99 | handling required for the offloading entry. If the offloading entry is an entry |
Joseph Huber | 5c9ee35 | 2022-02-07 20:23:50 | [diff] [blame] | 100 | to a target region it can have one of the following :ref:`entry flags |
| 101 | <table-offload_entry_flags>`. |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 102 | |
| 103 | .. table:: Target Region Entry Flags |
| 104 | :name: table-offload_entry_flags |
| 105 | |
| 106 | +----------------------------------+-------+-----------------------------------------+ |
| 107 | | Name | Value | Description | |
| 108 | +==================================+=======+=========================================+ |
| 109 | | OMPTargetRegionEntryTargetRegion | 0x00 | Mark the entry as generic target region | |
| 110 | +----------------------------------+-------+-----------------------------------------+ |
| 111 | | OMPTargetRegionEntryCtor | 0x02 | Mark the entry as a global constructor | |
| 112 | +----------------------------------+-------+-----------------------------------------+ |
| 113 | | OMPTargetRegionEntryDtor | 0x04 | Mark the entry as a global destructor | |
| 114 | +----------------------------------+-------+-----------------------------------------+ |
| 115 | |
| 116 | If the offloading entry is a global variable, indicated by a non-zero size, it |
| 117 | will instead have one of the following :ref:`global |
| 118 | <table-offload_global_flags>` flags. |
| 119 | |
| 120 | .. table:: Target Region Global |
| 121 | :name: table-offload_global_flags |
| 122 | |
| 123 | +-----------------------------+-------+---------------------------------------------------------------+ |
| 124 | | Name | Value | Description | |
| 125 | +=============================+=======+===============================================================+ |
| 126 | | OMPTargetGlobalVarEntryTo | 0x00 | Mark the entry as a 'to' attribute (w.r.t. the to clause) | |
| 127 | +-----------------------------+-------+---------------------------------------------------------------+ |
| 128 | | OMPTargetGlobalVarEntryLink | 0x01 | Mark the entry as a 'link' attribute (w.r.t. the link clause) | |
| 129 | +-----------------------------+-------+---------------------------------------------------------------+ |
| 130 | |
| 131 | The target offload entries are used by the runtime to access the device kernels |
| 132 | and globals that will be provided by the final device image. Each offloading |
| 133 | entry is set to use the ``omp_offloading_entries`` section. When the final |
| 134 | application is created the linker will provide the |
| 135 | ``__start_omp_offloading_entries`` and ``__stop_omp_offloading_entries`` symbols |
| 136 | which are used to create the :ref:`final image <Device Binary Wrapping>`. |
| 137 | |
Joseph Huber | 5c9ee35 | 2022-02-07 20:23:50 | [diff] [blame] | 138 | This information is used by the device compilation stage to determine which |
| 139 | symbols need to be exported from the device. We use the ``omp_offload.info`` |
| 140 | metadata node to pass this information device compilation stage. |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 141 | |
| 142 | Accessing Entries on the Device |
| 143 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 144 | |
| 145 | Accessing the entries in the device is done using the address field in the |
| 146 | :ref:`offload entry<table-tgt_offload_entry_structure>`. The runtime will set |
| 147 | the address to the pointer associated with the device image during runtime |
| 148 | initialization. This is used to call the corresponding kernel function when |
| 149 | entering a ``#pragma omp target`` region. For variables, the runtime maintains a |
| 150 | table mapping host pointers to device pointers. Global variables inside a |
Joseph Huber | 5c9ee35 | 2022-02-07 20:23:50 | [diff] [blame] | 151 | ``#pragma omp target declare`` directive are first initialized to the host's |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 152 | address. Once the device address is initialized we insert it into the table to |
| 153 | map the host address to the device address. |
| 154 | |
| 155 | Debugging Information |
| 156 | ^^^^^^^^^^^^^^^^^^^^^ |
| 157 | |
| 158 | We generate structures to hold debugging information that is passed to |
| 159 | ``libomptarget``. This allows the front-end to generate information the runtime |
| 160 | library uses for more informative error messages. This is done using the |
| 161 | standard :ref:`identifier structure <table-ident_t_structure>` used in |
| 162 | ``libomp`` and ``libomptarget``. This is used to pass information and source |
| 163 | locations to the runtime. |
| 164 | |
| 165 | .. table:: ident_t Structure |
| 166 | :name: table-ident_t_structure |
| 167 | |
| 168 | +---------+------------+-----------------------------------------------------------------------------+ |
| 169 | | Type | Identifier | Description | |
| 170 | +=========+============+=============================================================================+ |
| 171 | | int32_t | reserved | Reserved, to be used by the runtime library. | |
| 172 | +---------+------------+-----------------------------------------------------------------------------+ |
| 173 | | int32_t | flags | Flags used to indicate some features, mostly unused. | |
| 174 | +---------+------------+-----------------------------------------------------------------------------+ |
| 175 | | int32_t | reserved | Reserved, to be used by the runtime library. | |
| 176 | +---------+------------+-----------------------------------------------------------------------------+ |
| 177 | | int32_t | reserved | Reserved, to be used by the runtime library. | |
| 178 | +---------+------------+-----------------------------------------------------------------------------+ |
| 179 | | char* | psource | Program source information, stored as ";filename;function;line;column;;\\0" | |
| 180 | +---------+------------+-----------------------------------------------------------------------------+ |
| 181 | |
| 182 | If debugging information is enabled, we will also create strings to indicate the |
| 183 | names and declarations of variables mapped in target regions. These have the |
| 184 | same format as the source location in the :ref:`identifier structure |
Joseph Huber | 5c9ee35 | 2022-02-07 20:23:50 | [diff] [blame] | 185 | <table-ident_t_structure>`, but the function name is replaced with the variable |
| 186 | name. |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 187 | |
| 188 | .. _Device Compilation: |
| 189 | |
| 190 | Offload Device Compilation |
| 191 | -------------------------- |
| 192 | |
| 193 | The input file is compiled for each active device toolchain. The device |
| 194 | compilation stage is performed differently from the host stage. Namely, we do |
| 195 | not generate any offloading entries. This is set by passing the |
Sergio Afonso | 63ca93c | 2023-07-06 09:28:24 | [diff] [blame] | 196 | ``-fopenmp-is-target-device`` flag to the front-end. We use the host bitcode to |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 197 | determine which symbols to export from the device. The bitcode file is passed in |
| 198 | from the previous stage using the ``-fopenmp-host-ir-file-path`` flag. |
| 199 | Compilation is otherwise performed as it would be for any other target triple. |
| 200 | |
| 201 | When compiling for the OpenMP device, we set the visibility of all device |
| 202 | symbols to be ``protected`` by default. This improves performance and prevents a |
| 203 | class of errors where a symbol in the target device could preempt a host |
| 204 | library. |
| 205 | |
| 206 | The OpenMP runtime library is linked in during compilation to provide the |
| 207 | implementations for standard OpenMP functionality. For GPU targets this is done |
| 208 | by linking in a special bitcode library during compilation, (e.g. |
| 209 | ``libomptarget-nvptx64-sm_70.bc``) using the ``-mlink-builtin-bitcode`` flag. |
| 210 | Other device libraries, such as CUDA's libdevice, are also linked this way. If |
| 211 | the target is a standard architecture with an existing ``libomp`` |
| 212 | implementation, that will be linked instead. Finally, device tools are used to |
| 213 | create a relocatable device object file that can be embedded in the host. |
| 214 | |
| 215 | .. _Creating Fat Objects: |
| 216 | |
| 217 | Creating Fat Objects |
| 218 | -------------------- |
| 219 | |
| 220 | A fat binary is a binary file that contains information intended for another |
| 221 | device. We create a fat object by embedding the output of the device compilation |
| 222 | stage into the host as a named section. The output from the device compilation |
| 223 | is passed to the host backend using the ``-fembed-offload-object`` flag. This |
Joseph Huber | 15e62062 | 2022-04-18 19:01:55 | [diff] [blame] | 224 | embeds the device image into the ``.llvm.offloading`` section using a special |
| 225 | binary format that behaves like a string map. This binary format is used to |
| 226 | bundle metadata about the image so the linker can associate the proper device |
| 227 | linking action with the image. Each device image will start with the magic bytes |
| 228 | ``0x10FF10AD``. |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 229 | |
| 230 | .. code-block:: llvm |
| 231 | |
Joseph Huber | 15e62062 | 2022-04-18 19:01:55 | [diff] [blame] | 232 | @llvm.embedded.object = private constant [1 x i8] c"\00", section ".llvm.offloading" |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 233 | |
| 234 | The device code will then be placed in the corresponding section one the backend |
| 235 | is run on the host, creating a fat object. Using fat objects allows us to treat |
| 236 | offloading objects as standard host objects. The final object file should |
| 237 | contain the following :ref:`offloading sections <table-offloading_sections>`. We |
| 238 | will use this information when :ref:`Device Linking`. |
| 239 | |
| 240 | .. table:: Offloading Sections |
| 241 | :name: table-offloading_sections |
| 242 | |
Aaron Ballman | 16cea04 | 2023-01-24 13:21:53 | [diff] [blame] | 243 | +----------------------------------+------------------------------------------------------------------------------+ |
| 244 | | Section | Description | |
| 245 | +==================================+==============================================================================+ |
| 246 | | omp_offloading_entries | Offloading entry information (see :ref:`table-tgt_offload_entry_structure`) | |
| 247 | +----------------------------------+------------------------------------------------------------------------------+ |
| 248 | | .llvm.offloading | Embedded device object file for the target device and architecture | |
| 249 | +----------------------------------+------------------------------------------------------------------------------+ |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 250 | |
| 251 | .. _Device Linking: |
| 252 | |
| 253 | Linking Target Device Code |
| 254 | -------------------------- |
| 255 | |
| 256 | Objects containing :ref:`table-offloading_sections` require special handling to |
| 257 | create an executable device image. This is done using a Clang tool, see |
| 258 | :doc:`ClangLinkerWrapper` for more information. This tool works as a wrapper |
| 259 | over the host linking job. It scans the input object files for the offloading |
Joseph Huber | 15e62062 | 2022-04-18 19:01:55 | [diff] [blame] | 260 | section ``.llvm.offloading``. The device files stored in this section are then |
Jose Manuel Monsalve Diaz | d6f6cd5 | 2022-06-08 23:35:11 | [diff] [blame] | 261 | extracted and passed to the appropriate linking job. The linked device image is |
Joseph Huber | 15e62062 | 2022-04-18 19:01:55 | [diff] [blame] | 262 | then :ref:`wrapped <Device Binary Wrapping>` to create the symbols used to load |
| 263 | the device image and link it with the host. |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 264 | |
| 265 | The linker wrapper tool supports linking bitcode files through link time |
| 266 | optimization (LTO). This is used whenever the object files embedded in the host |
| 267 | contain LLVM bitcode. Bitcode will be embedded for architectures that do not |
| 268 | support a relocatable object format, such as AMDGPU or SPIR-V, or if the user |
Joseph Huber | 5c9ee35 | 2022-02-07 20:23:50 | [diff] [blame] | 269 | requested it using the ``-foffload-lto`` flag. |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 270 | |
| 271 | .. _Device Binary Wrapping: |
| 272 | |
| 273 | Device Binary Wrapping |
| 274 | ---------------------- |
| 275 | |
| 276 | Various structures and functions are used to create the information necessary to |
| 277 | offload code on the device. We use the :ref:`linked device executable <Device |
| 278 | Linking>` with the corresponding offloading entries to create the symbols |
| 279 | necessary to load and execute the device image. |
| 280 | |
| 281 | Structure Types |
| 282 | ^^^^^^^^^^^^^^^ |
| 283 | |
| 284 | Several different structures are used to store offloading information. The |
| 285 | :ref:`device image structure <table-device_image_structure>` stores a single |
| 286 | linked device image and its associated offloading entries. The offloading |
| 287 | entries are stored using the ``__start_omp_offloading_entries`` and |
| 288 | ``__stop_omp_offloading_entries`` symbols generated by the linker using the |
Aaron Ballman | 16cea04 | 2023-01-24 13:21:53 | [diff] [blame] | 289 | :ref:`table-tgt_offload_entry_structure`. |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 290 | |
| 291 | .. table:: __tgt_device_image Structure |
| 292 | :name: table-device_image_structure |
| 293 | |
| 294 | +----------------------+--------------+----------------------------------------+ |
| 295 | | Type | Identifier | Description | |
| 296 | +======================+==============+========================================+ |
| 297 | | void* | ImageStart | Pointer to the target code start | |
| 298 | +----------------------+--------------+----------------------------------------+ |
| 299 | | void* | ImageEnd | Pointer to the target code end | |
| 300 | +----------------------+--------------+----------------------------------------+ |
| 301 | | __tgt_offload_entry* | EntriesBegin | Begin of table with all target entries | |
| 302 | +----------------------+--------------+----------------------------------------+ |
| 303 | | __tgt_offload_entry* | EntriesEnd | End of table (non inclusive) | |
| 304 | +----------------------+--------------+----------------------------------------+ |
| 305 | |
| 306 | The target :ref:`target binary descriptor <table-target_binary_descriptor>` is |
| 307 | used to store all binary images and offloading entries in an array. |
| 308 | |
| 309 | .. table:: __tgt_bin_desc Structure |
| 310 | :name: table-target_binary_descriptor |
| 311 | |
| 312 | +----------------------+------------------+------------------------------------------+ |
| 313 | | Type | Identifier | Description | |
| 314 | +======================+==================+==========================================+ |
| 315 | | int32_t | NumDeviceImages | Number of device types supported | |
| 316 | +----------------------+------------------+------------------------------------------+ |
| 317 | | __tgt_device_image* | DeviceImages | Array of device images (1 per dev. type) | |
| 318 | +----------------------+------------------+------------------------------------------+ |
| 319 | | __tgt_offload_entry* | HostEntriesBegin | Begin of table with all host entries | |
| 320 | +----------------------+------------------+------------------------------------------+ |
| 321 | | __tgt_offload_entry* | HostEntriesEnd | End of table (non inclusive) | |
| 322 | +----------------------+------------------+------------------------------------------+ |
| 323 | |
| 324 | Global Variables |
| 325 | ---------------- |
| 326 | |
| 327 | :ref:`table-global_variables` lists various global variables, along with their |
| 328 | type and their explicit ELF sections, which are used to store device images and |
| 329 | related symbols. |
| 330 | |
| 331 | .. table:: Global Variables |
| 332 | :name: table-global_variables |
| 333 | |
| 334 | +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| 335 | | Variable | Type | ELF Section | Description | |
| 336 | +================================+=====================+=========================+=========================================================+ |
| 337 | | __start_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | Begin symbol for the offload entries table. | |
| 338 | +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| 339 | | __stop_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | End symbol for the offload entries table. | |
| 340 | +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| 341 | | __dummy.omp_offloading.entry | __tgt_offload_entry | .omp_offloading_entries | Dummy zero-sized object in the offload entries | |
| 342 | | | | | section to force linker to define begin/end | |
| 343 | | | | | symbols defined above. | |
| 344 | +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| 345 | | .omp_offloading.device_image | __tgt_device_image | .omp_offloading_entries | ELF device code object of the first image. | |
| 346 | +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| 347 | | .omp_offloading.device_image.N | __tgt_device_image | .omp_offloading_entries | ELF device code object of the (N+1)th image. | |
| 348 | +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| 349 | | .omp_offloading.device_images | __tgt_device_image | .omp_offloading_entries | Array of images. | |
| 350 | +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| 351 | | .omp_offloading.descriptor | __tgt_bin_desc | .omp_offloading_entries | Binary descriptor object (see :ref:`binary_descriptor`) | |
| 352 | +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| 353 | |
| 354 | .. _binary_descriptor: |
| 355 | |
| 356 | Binary Descriptor for Device Images |
| 357 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 358 | |
| 359 | This object is passed to the offloading runtime at program startup and it |
| 360 | describes all device images available in the executable or shared library. It |
| 361 | is defined as follows: |
| 362 | |
| 363 | .. code-block:: c |
| 364 | |
| 365 | __attribute__((visibility("hidden"))) |
| 366 | extern __tgt_offload_entry *__start_omp_offloading_entries; |
| 367 | __attribute__((visibility("hidden"))) |
| 368 | extern __tgt_offload_entry *__stop_omp_offloading_entries; |
| 369 | static const char Image0[] = { <Bufs.front() contents> }; |
| 370 | ... |
| 371 | static const char ImageN[] = { <Bufs.back() contents> }; |
| 372 | static const __tgt_device_image Images[] = { |
| 373 | { |
| 374 | Image0, /*ImageStart*/ |
| 375 | Image0 + sizeof(Image0), /*ImageEnd*/ |
| 376 | __start_omp_offloading_entries, /*EntriesBegin*/ |
| 377 | __stop_omp_offloading_entries /*EntriesEnd*/ |
| 378 | }, |
| 379 | ... |
| 380 | { |
| 381 | ImageN, /*ImageStart*/ |
| 382 | ImageN + sizeof(ImageN), /*ImageEnd*/ |
| 383 | __start_omp_offloading_entries, /*EntriesBegin*/ |
| 384 | __stop_omp_offloading_entries /*EntriesEnd*/ |
| 385 | } |
| 386 | }; |
| 387 | static const __tgt_bin_desc BinDesc = { |
| 388 | sizeof(Images) / sizeof(Images[0]), /*NumDeviceImages*/ |
| 389 | Images, /*DeviceImages*/ |
| 390 | __start_omp_offloading_entries, /*HostEntriesBegin*/ |
| 391 | __stop_omp_offloading_entries /*HostEntriesEnd*/ |
| 392 | }; |
| 393 | |
Joseph Huber | 5c9ee35 | 2022-02-07 20:23:50 | [diff] [blame] | 394 | |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 395 | Global Constructor and Destructor |
| 396 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 397 | |
| 398 | The global constructor (``.omp_offloading.descriptor_reg()``) registers the |
| 399 | device images with the runtime by calling the ``__tgt_register_lib()`` runtime |
| 400 | function. The constructor is explicitly defined in ``.text.startup`` section and |
| 401 | is run once when the program starts. Similarly, the global destructor |
| 402 | (``.omp_offloading.descriptor_unreg()``) calls ``__tgt_unregister_lib()`` for |
| 403 | the destructor and is also defined in ``.text.startup`` section and run when the |
| 404 | program exits. |
| 405 | |
| 406 | Offloading Example |
| 407 | ------------------ |
| 408 | |
| 409 | This section contains a simple example of generating offloading code using |
| 410 | OpenMP offloading. We will use a simple ``ZAXPY`` BLAS routine. |
| 411 | |
| 412 | .. code-block:: c++ |
| 413 | |
| 414 | #include <complex> |
| 415 | |
| 416 | using complex = std::complex<double>; |
| 417 | |
| 418 | void zaxpy(complex *X, complex *Y, complex D, std::size_t N) { |
| 419 | #pragma omp target teams distribute parallel for |
| 420 | for (std::size_t i = 0; i < N; ++i) |
| 421 | Y[i] = D * X[i] + Y[i]; |
| 422 | } |
| 423 | |
| 424 | int main() { |
| 425 | const std::size_t N = 1024; |
| 426 | complex X[N], Y[N], D; |
| 427 | #pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N]) |
| 428 | zaxpy(X, Y, D, N); |
| 429 | } |
| 430 | |
| 431 | This code is compiled using the following Clang flags. |
| 432 | |
| 433 | .. code-block:: console |
| 434 | |
| 435 | $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 zaxpy.cpp -c |
| 436 | |
Joseph Huber | 15e62062 | 2022-04-18 19:01:55 | [diff] [blame] | 437 | The output section in the object file can be seen using the ``readelf`` utility. |
| 438 | The ``.llvm.offloading`` section has the ``SHF_EXCLUDE`` flag so it will be |
| 439 | removed from the final executable or shared library by the linker. |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 440 | |
| 441 | .. code-block:: text |
| 442 | |
| 443 | $ llvm-readelf -WS zaxpy.o |
Joseph Huber | 15e62062 | 2022-04-18 19:01:55 | [diff] [blame] | 444 | Section Headers: |
| 445 | [Nr] Name Type Address Off Size ES Flg Lk Inf Al |
| 446 | [11] omp_offloading_entries PROGBITS 0000000000000000 0001f0 000040 00 A 0 0 1 |
| 447 | [12] .llvm.offloading PROGBITS 0000000000000000 000260 030950 00 E 0 0 8 |
| 448 | |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 449 | |
| 450 | Compiling this file again will invoke the ``clang-linker-wrapper`` utility to |
| 451 | extract and link the device code stored at the section named |
Joseph Huber | 15e62062 | 2022-04-18 19:01:55 | [diff] [blame] | 452 | ``.llvm.offloading`` and then use entries stored in |
Joseph Huber | 28ab594 | 2022-02-02 18:06:55 | [diff] [blame] | 453 | the section named ``omp_offloading_entries`` to create the symbols necessary for |
| 454 | ``libomptarget`` to register the device image and call the entry function. |
| 455 | |
| 456 | .. code-block:: console |
| 457 | |
| 458 | $ clang++ -fopenmp -fopenmp-targets=nvptx64 zaxpy.o -o zaxpy |
| 459 | $ ./zaxpy |
| 460 | |
| 461 | We can see the steps created by clang to generate the offloading code using the |
| 462 | ``-ccc-print-phases`` option in Clang. This matches the description in |
| 463 | :ref:`Offloading Overview`. |
| 464 | |
| 465 | .. code-block:: console |
| 466 | |
| 467 | $ clang++ -fopenmp -fopenmp-targets=nvptx64 -ccc-print-phases zaxpy.cpp |
| 468 | # "x86_64-unknown-linux-gnu" - "clang", inputs: ["zaxpy.cpp"], output: "/tmp/zaxpy-host.bc" |
| 469 | # "nvptx64-nvidia-cuda" - "clang", inputs: ["zaxpy.cpp", "/tmp/zaxpy-e6a41b.bc"], output: "/tmp/zaxpy-07f434.s" |
| 470 | # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/tmp/zaxpy-07f434.s"], output: "/tmp/zaxpy-0af7b7.o" |
| 471 | # "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/zaxpy-e6a41b.bc", "/tmp/zaxpy-0af7b7.o"], output: "/tmp/zaxpy-416cad.o" |
| 472 | # "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["/tmp/zaxpy-416cad.o"], output: "a.out" |
Joseph Huber | 5c84054 | 2024-02-07 14:20:07 | [diff] [blame] | 473 | |
| 474 | Relocatable Linking |
| 475 | ------------------- |
| 476 | |
Joseph Huber | 347ab99 | 2024-02-07 19:03:14 | [diff] [blame] | 477 | The offloading compilation pipeline normally will defer the final device linking |
| 478 | and runtime registration until the ``clang-linker-wrapper`` is run to create the |
| 479 | executable. This is the standard behaviour when compiling for OpenMP offloading |
| 480 | or CUDA and HIP in ``-fgpu-rdc`` mode. However, there are some cases where the |
| 481 | user may wish to perform this device handling prematurely. This is described in |
Joseph Huber | 5c84054 | 2024-02-07 14:20:07 | [diff] [blame] | 482 | the :doc:`linker wrapper documentation<ClangLinkerWrapper>`. |
| 483 | |
Joseph Huber | 347ab99 | 2024-02-07 19:03:14 | [diff] [blame] | 484 | Effectively, this allows the user to handle offloading specific linking ahead of |
| 485 | time when shipping objects or static libraries. This can be thought of as |
| 486 | performing a standard ``-fno-gpu-rdc`` compilation on a subset of object files. |
| 487 | This can be useful to reduce link time, prevent users from interacting with the |
Joseph Huber | 5c84054 | 2024-02-07 14:20:07 | [diff] [blame] | 488 | library's device code, or for shipping libraries to incompatible compilers. |
| 489 | |
Joseph Huber | 347ab99 | 2024-02-07 19:03:14 | [diff] [blame] | 490 | Normally, if a relocatable link is done using ``clang -r`` it will simply merge |
| 491 | the ``.llvm.offloading`` sections which will then be linked later when the |
| 492 | executable is created. However, if the ``-r`` flag is used with the offloading |
| 493 | toolchain, it will perform the device linking and registration phases and then |
Joseph Huber | 5c84054 | 2024-02-07 14:20:07 | [diff] [blame] | 494 | merge the registration code into the final relocatable object file. |
| 495 | |
Joseph Huber | 347ab99 | 2024-02-07 19:03:14 | [diff] [blame] | 496 | The following example shows how using the relocatable link with the offloading |
| 497 | pipeline can create a static library with offloading code that can be |
Joseph Huber | 5c84054 | 2024-02-07 14:20:07 | [diff] [blame] | 498 | redistributed without requiring any additional handling. |
| 499 | |
| 500 | .. code-block:: console |
| 501 | |
| 502 | $ clang++ -fopenmp -fopenmp-targets=nvptx64 foo.cpp -c |
| 503 | $ clang++ -lomptarget.devicertl --offload-link -r foo.o -o merged.o |
| 504 | $ llvm-ar rcs libfoo.a merged.o |
| 505 | # g++ app.cpp -L. -lfoo |
| 506 | |