| # GPU Bot Details |
| |
| This page describes in detail how the GPU bots are set up, which files affect |
| their configuration, and how to both modify their behavior and add new bots. |
| |
| [TOC] |
| |
| ## Overview of the GPU bots' setup |
| |
| Chromium's GPU bots, compared to the majority of the project's test machines, |
| are physical pieces of hardware. When end users run the Chrome browser, they |
| are almost surely running it on a physical piece of hardware with a real |
| graphics processor. There are some portions of the code base which simply can |
| not be exercised by running the browser in a virtual machine, or on a software |
| implementation of the underlying graphics libraries. The GPU bots were |
| developed and deployed in order to cover these code paths, and avoid |
| regressions that are otherwise inevitable in a project the size of the Chromium |
| browser. |
| |
| The GPU bots are utilized on the [chromium.gpu] and [chromium.gpu.fyi] |
| waterfalls, and various tryservers, as described in [Using the GPU Bots]. |
| |
| [chromium.gpu]: https://ci.chromium.org/p/chromium/g/chromium.gpu/console |
| [chromium.gpu.fyi]: https://ci.chromium.org/p/chromium/g/chromium.gpu.fyi/console |
| [Using the GPU Bots]: gpu_testing.md#Using-the-GPU-Bots |
| |
| All of the physical hardware for the bots lives in the Swarming pool, and most |
| of it in the chromium.tests.gpu Swarming pool. The waterfall bots are simply |
| virtual machines which spawn Swarming tasks with the appropriate tags to get |
| them to run on the desired GPU and operating system type. So, for example, the |
| [Win10 x64 Release (NVIDIA)] bot is actually a virtual machine which spawns all |
| of its jobs with the Swarming parameters: |
| |
| [Win10 x64 Release (NVIDIA)]: https://ci.chromium.org/p/chromium/builders/ci/Win10%20x64%20Release%20%28NVIDIA%29 |
| |
| ```json |
| { |
| "gpu": "10de:2184", |
| "os": "Windows-10", |
| "pool": "chromium.tests.gpu" |
| } |
| ``` |
| |
| Since the GPUs in the Swarming pool are mostly homogeneous, this is sufficient |
| to target the pool of Windows 10-like NVIDIA machines. (There are a few Windows |
| 7-like NVIDIA bots in the pool, which necessitates the OS specifier.) |
| |
| Details about the bots can be found on [chromium-swarm.appspot.com] and by |
| using `src/tools/luci-go/swarming`, for example `swarming bots`. |
| If you are authenticated with @google.com credentials you will be able to make |
| queries of the bots and see, for example, which GPUs are available. |
| |
| [chromium-swarm.appspot.com]: https://chromium-swarm.appspot.com/ |
| |
| The waterfall bots run tests on a single GPU type in order to make it easier to |
| see regressions or flakiness that affect only a certain type of GPU. |
| |
| The tryservers like `win-rel` which include GPU tests, on the other hand, run |
| tests on more than one GPU type. As of this writing, the Windows tryservers ran |
| tests on NVIDIA and Intel GPUs; the Mac tryservers ran tests on Intel and AMD |
| GPUs. The way these tryservers' tests are specified is simply by *mirroring* |
| how one or more waterfall bots work. This is an inherent property of the |
| [`chromium_trybot` recipe][chromium_trybot.py], which was designed to eliminate |
| differences in behavior between the tryservers and waterfall bots. Since the |
| tryservers mirror waterfall bots, if the waterfall bot is working, the |
| tryserver must almost inherently be working as well. |
| |
| [chromium_trybot.py]: https://chromium.googlesource.com/chromium/tools/build/+/main/recipes/recipes/chromium_trybot.py |
| |
| There are some GPU configurations on the waterfall backed by only one machine, |
| or a very small number of machines in the Swarming pool. A few examples are: |
| |
| <!-- XXX: update this list --> |
| * [Mac Pro Release (AMD)](https://ci.chromium.org/p/chromium/builders/ci/Mac%20Pro%20FYI%20Release%20(AMD)) |
| * [Linux Release (AMD RX 5500 XT)](https://ci.chromium.org/p/chromium/builders/ci/Linux%20FYI%20Release%20(AMD%20RX%205500%20XT)) |
| |
| There are a couple of reasons to continue to support running tests on a |
| specific machine: it might be too expensive to deploy the required multiple |
| copies of said hardware, the hardware pool may have naturally died over time, or |
| the configuration might not be reliable enough to begin scaling it up. |
| |
| ## Adding a new isolated test to the bots |
| |
| Adding a new test step to the bots requires that the test run via an isolate. |
| Isolates describe both the binary and data dependencies of an executable, and |
| are the underpinning of how the Swarming system works. See the [LUCI] |
| documentation for background on [Isolates] and [Swarming]. Note that with the |
| transition towards less Chromium-specific tools, you may see terms such as |
| "CAS inputs" instead of "isolate". These newer systems are functionally |
| identical to the older ones from a user's perspective, so the terms can be |
| safely interchanged. |
| |
| [LUCI]: https://github.com/luci/luci-py |
| [Isolates]: https://github.com/luci/luci-py/blob/master/appengine/isolate/doc/README.md |
| [Swarming]: https://github.com/luci/luci-py/blob/master/appengine/swarming/doc/README.md |
| |
| ### Adding a new isolate |
| |
| 1. Define your target using the `template("test")` template in |
| [`src/testing/test.gni`][testing/test.gni]. See `test("gl_tests")` in |
| [`src/gpu/BUILD.gn`][gpu/BUILD.gn] for an example. For a more complex |
| example which invokes a series of scripts which finally launches the |
| browser, see `telemetry_gpu_integration_test` in |
| [`chrome/test/BUILD.gn`][chrome/test/BUILD.gn]. |
| 2. Add an entry to |
| [`src/testing/buildbot/gn_isolate_map.pyl`][gn_isolate_map.pyl] that refers |
| to your target. Find a similar target to yours in order to determine the |
| `type`. The type is referenced in [`src/tools/mb/mb.py`][mb.py]. |
| |
| [testing/test.gni]: https://chromium.googlesource.com/chromium/src/+/main/testing/test.gni |
| [gpu/BUILD.gn]: https://chromium.googlesource.com/chromium/src/+/main/gpu/BUILD.gn |
| [chrome/test/BUILD.gn]: https://chromium.googlesource.com/chromium/src/+/main/chrome/test/BUILD.gn |
| [gn_isolate_map.pyl]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot/gn_isolate_map.pyl |
| [mb.py]: https://chromium.googlesource.com/chromium/src/+/main/tools/mb/mb.py |
| |
| At this point you can build and upload your isolate to the isolate server. |
| |
| See [Isolated Testing for SWEs] for the most up-to-date instructions. These |
| instructions are a copy which show how to run an isolate that's been uploaded |
| to the isolate server on your local machine rather than on Swarming. |
| |
| [Isolated Testing for SWEs]: https://www.chromium.org/developers/testing/isolated-testing/for-swes |
| |
| If `cd`'d into `src/`: |
| |
| 1. `./tools/mb/mb.py isolate //out/Release [target name]` |
| * For example: `./tools/mb/mb.py isolate //out/Release angle_end2end_tests` |
| 1. `./tools/luci-go/isolate batcharchive -cas-instance chromium-swarm out/Release/[target name].isolated.gen.json` |
| * For example: `./tools/luci-go/isolate batcharchive -cas-instance chromium-swarm out/Release/angle_end2end_tests.isolated.gen.json` |
| See the section below on [isolate server credentials](#Isolate-server-credentials). |
| |
| ### Adding your new isolate to the tests that are run on the bots |
| |
| See [Adding new steps to the GPU bots] for details on this process. |
| |
| [Adding new steps to the GPU bots]: gpu_testing.md#Adding-new-steps-to-the-GPU-Bots |
| |
| ## Relevant files that control the operation of the GPU bots |
| |
| In the [`chromium/src`][chromium/src] workspace: |
| |
| * [`src/testing/buildbot`][src/testing/buildbot]: |
| * [`chromium.gpu.json`][chromium.gpu.json] and |
| [`chromium.gpu.fyi.json`][chromium.gpu.fyi.json] define which steps are |
| run on which bots. These files are autogenerated. Don't modify them |
| directly! |
| * [`waterfalls.pyl`][waterfalls.pyl], |
| [`test_suites.pyl`][test_suites.pyl], [`mixins.pyl`][mixins.pyl], |
| [`test_suite_exceptions.pyl`][test_suite_exceptions.pyl], and |
| [`buildbot_json_magic_substitutions.py`][buildbot_substitutions.py] |
| define the configuration for the autogenerated json files above. |
| Run [`generate_buildbot_json.py`][generate_buildbot_json.py] to |
| generate the json files after you modify these pyl files. |
| * [`generate_buildbot_json.py`][generate_buildbot_json.py] |
| * The generator script for all the waterfalls, including |
| `chromium.gpu.json` and `chromium.gpu.fyi.json`. |
| * See the [README for generate_buildbot_json.py] for documentation |
| on this script and the descriptions of the waterfalls and test |
| suites. |
| * When modifying this script, don't forget to also run it, to |
| regenerate the JSON files. Don't worry; the presubmit step will |
| catch this if you forget. |
| * See [Adding new steps to the GPU bots] for more details. |
| * [`gn_isolate_map.pyl`][gn_isolate_map.pyl] defines all of the isolates' |
| behavior in the GN build. |
| * [`src/tools/mb/mb_config.pyl`][mb_config.pyl] |
| * Defines the GN arguments for all of the bots. |
| * [`src/infra/config`][src/infra/config]: |
| * Definitions of how bots are organized on the waterfall, |
| how builds are triggered, which VMs or machines are used for the |
| builder itself, i.e. for compilation and scheduling swarmed tasks |
| on GPU hardware, and underlying build configuration details, including |
| which CI bots are mirrored on which trybots. The build config/mirroring |
| information was previously in the [`tools/build`][tools/build] repo, but |
| all GPU-related configurations have been migrated to be fully src-side. |
| See |
| [README.md](https://chromium.googlesource.com/chromium/src/+/main/infra/config/README.md) |
| in this directory for up to date information. |
| |
| [chromium/src]: https://chromium.googlesource.com/chromium/src/ |
| [src/testing/buildbot]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot |
| [src/infra/config]: https://chromium.googlesource.com/chromium/src/+/main/infra/config |
| [chromium.gpu.json]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot/chromium.gpu.json |
| [chromium.gpu.fyi.json]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot/chromium.gpu.fyi.json |
| [gn_isolate_map.pyl]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot/gn_isolate_map.pyl |
| [mb_config.pyl]: https://chromium.googlesource.com/chromium/src/+/main/tools/mb/mb_config.pyl |
| [generate_buildbot_json.py]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot/generate_buildbot_json.py |
| [mixins.pyl]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot/mixins.pyl |
| [waterfalls.pyl]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot/waterfalls.pyl |
| [test_suites.pyl]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot/test_suites.pyl |
| [test_suite_exceptions.pyl]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot/test_suite_exceptions.pyl |
| [buildbot_substitutions.py]: https://chromium.googlesource.com/chromium/src/+/main/testing/buildbot/buildbot_json_magic_substitutions.py |
| [tools/build]: https://chromium.googlesource.com/chromium/tools/build/ |
| [README for generate_buildbot_json.py]: ../../testing/buildbot/README.md |
| |
| In the [`infradata/config`][infradata/config] workspace (Google internal only, |
| sorry): |
| |
| * [`gpu.star`][gpu.star] |
| * Defines a `chromium.tests.gpu` Swarming pool which contains all of the |
| specialized hardware, except some hardware shared with Chromium: |
| for example, the Windows and Linux NVIDIA |
| bots, the Windows AMD bots, and the MacBook Pros with NVIDIA and AMD |
| GPUs. New GPU hardware should be added to this pool. |
| * Also defines the GCEs, Mac VMs and Mac machines used for CI builders |
| on GPU and GPU.FYI waterfalls and trybots. |
| * [`pools.cfg`][pools.cfg] |
| * Defines the Swarming pools for GCEs and Mac VMs used for manually |
| triggered trybots. |
| |
| [infradata/config]: https://chrome-internal.googlesource.com/infradata/config |
| [gpu.star]: https://chrome-internal.googlesource.com/infradata/config/+/main/configs/chromium-swarm/starlark/bots/chromium/gpu.star |
| [chromium.star]: https://chrome-internal.googlesource.com/infradata/config/+/main/configs/chromium-swarm/starlark/bots/chromium/chromium.star |
| [pools.cfg]: https://chrome-internal.googlesource.com/infradata/config/+/main/configs/chromium-swarm/pools.cfg |
| [main.star]: https://chrome-internal.googlesource.com/infradata/config/+/main/main.star |
| [vms.cfg]: https://chrome-internal.googlesource.com/infradata/config/+/main/configs/gce-provider/vms.cfg |
| |
| ## Walkthroughs of various maintenance scenarios |
| |
| This section describes various common scenarios that might arise when |
| maintaining the GPU bots, and how they'd be addressed. |
| |
| ### How to add a new test or an entire new step to the bots |
| |
| This is described in [Adding new tests to the GPU bots]. |
| |
| [Adding new tests to the GPU bots]: https://chromium.googlesource.com/chromium/src/+/main/docs/gpu/gpu_testing.md#Adding-New-Tests-to-the-GPU-Bots |
| |
| ### How to set up new virtual machine instances |
| |
| The tests use virtual machines to build binaries and to trigger tests on |
| physical hardware. VMs don't run any tests themselves. There are 3 types of |
| bots: |
| |
| * Builders - these bots build test binaries, upload them to storage and trigger |
| tester bots (see below). Builds must be done on the same OS on which the |
| tests will run, except for Android tests, which are built on Linux. |
| * Testers - these bots trigger tests to execute in Swarming and merge results |
| from multiple shards. 2-core Linux GCEs are sufficient for this task. |
| * Builder/testers - these are the combination of the above and have same OS |
| constraints as builders. All trybots are of this type, while for CI bots |
| it is optional. |
| |
| The process is: |
| |
| 1. Follow [go/request-chrome-resources](go/request-chrome-resources) to get |
| approval for the VMs. Use `GPU` project resource group. |
| See this [example ticket](http://crbug.com/1012805). |
| You'll need to determine how many VMs are required, which OSes, how many |
| cores and in which swarming pools they will be (see below for different |
| scenarios). |
| * If setting up a new GPU hardware pool, some VMs will also be needed |
| for manual trybots, usually 2 VMs as of this writing. |
| * Additional action is needed for Mac VMs, the GPU resource owner will |
| assign the bug to Labs to deploy them. See this |
| [example ticket](http://crbug.com/964355). |
| 1. Once GCE resource request is approved / Mac VMs are deployed, the VMs need |
| to be added to the right Swarming pools in a CL in the |
| [`infradata/config`][infradata/config] (Google internal) workspace. |
| 1. GCEs for Windows CI builders and builder/testers should be added to |
| `luci-chromium-gpu-ci-win10-8` group in [`gpu.star`][gpu.star]. |
| 1. GCEs for Linux and Android CI builders and builder/testers should be added to |
| `luci-chromium-gpu-ci-xenial-8` group in [`gpu.star`][gpu.star]. |
| 1. VMs for Mac CI builders and builder/testers should be added to |
| `builderfull_gpu_ci_bots` group in [`gpu.star`][gpu.star]. |
| [Example](https://chrome-internal-review.googlesource.com/c/infradata/config/+/1166889). |
| 1. GCEs for CI testers for all OSes should be added to |
| `luci-chromium-gpu-ci-xenial-2` group in [`gpu.star`][gpu.star]. |
| [Example](https://chrome-internal-review.googlesource.com/c/infradata/config/+/2016410). |
| 1. GCEs and VMs for CQ and optional CQ GPU trybots for should be added to |
| a corresponding `gpu_try_bots` group in [`gpu.star`][gpu.star]. |
| [Example](https://chrome-internal-review.googlesource.com/c/infradata/config/+/1561384). |
| These trybots are "builderful", i.e. these GCEs can't be shared among |
| different bots. This is done in order to limit the number of concurrent |
| builds on these bots (until [crbug.com/949379](crbug.com/949379) is |
| fixed) to prevent oversubscribing GPU hardware. |
| `win_optional_gpu_tests_rel` is an exception, its GCEs come from |
| `luci-chromium-try-win10-*-8` groups in |
| [`chromium.star`][chromium.star], see |
| [CL](https://chrome-internal-review.googlesource.com/c/infradata/config/+/1708723). |
| This can cause oversubscription to Windows GPU hardware, however, |
| Chrome Infra insisted on making this bot builderless due to frequent |
| interruptions they get from limiting the number of concurrent builds on |
| it, see discussion in |
| [CL](https://chromium-review.googlesource.com/c/chromium/src/+/1775098). |
| 1. GCEs and VMs for manual GPU trybots should be added to a corresponding |
| pool in "Manually-triggered GPU trybots" in [`gpu.star`][gpu.star]. |
| If adding a new pool, it should also be added to |
| [`pools.cfg`][pools.cfg]. |
| [Example](https://chrome-internal-review.googlesource.com/c/infradata/config/+/2433332). |
| This is a different mechanism to limit the load on GPU hardware, |
| by having a small pool of GCEs which corresponds to some GPU hardware |
| resource, and all trybots that target this GPU hardware compete for |
| GCEs from this small pool. |
| 1. Run [`main.star`][main.star] to regenerate |
| `configs/chromium-swarm/bots.cfg` and `configs/gce-provider/vms.cfg`. |
| Double-check your work there. |
| Note that previously [`vms.cfg`][vms.cfg] had to be edited manually. |
| Part of the difficulty was in choosing a zone. This should soon no |
| longer be necessary per [crbug.com/942301](http://crbug.com/942301), |
| but consult with the Chrome Infra team to find out which of the |
| [zones](https://cloud.google.com/compute/docs/regions-zones/) has |
| available capacity. This also can be checked on viceroy |
| [dashboard](https://viceroy.corp.google.com/chrome_infra/Quota/chrome?duration=7d). |
| 1. Get this reviewed and landed. This step associates the VM or pool of VMs |
| with the bot's name on the waterfall for "builderful" bots or increases |
| swarmed pool capacity for "builderless" bots. |
| Note: CR+1 is not sticky in this repo, so you'll have to ping for |
| re-review after every change, like rebase. |
| |
| ### How to add a new tester bot to the chromium.gpu.fyi waterfall |
| |
| When deploying a new GPU configuration, it should be added to the |
| chromium.gpu.fyi waterfall first. The chromium.gpu waterfall should be reserved |
| for those GPUs which are tested on the commit queue. (Some of the bots violate |
| this rule – namely, the Debug bots – though we should strive to eliminate these |
| differences.) Once the new configuration is ready to be fully deployed on |
| tryservers, bots can be added to the chromium.gpu waterfall, and the tryservers |
| changed to mirror them. |
| |
| In order to add Release and Debug waterfall bots for a new configuration, |
| experience has shown that at least 4 physical machines are needed in the |
| swarming pool. The reason is that the tests all run in parallel on the Swarming |
| cluster, so the load induced on the swarming bots is higher than it would be |
| if the tests were run strictly serially. |
| |
| With these prerequisites, these are the steps to add a new (swarmed) tester bot. |
| (Actually, pair of bots -- Release and Debug. If deploying just one or the |
| other, ignore the other configuration.) These instructions assume that you are |
| reusing one of the existing builders, like [`GPU FYI Win Builder`][GPU FYI Win |
| Builder]. |
| |
| 1. Work with the Chrome Infrastructure Labs team to get the (minimum 4) |
| physical machines added to the Swarming pool. Use |
| [chromium-swarm.appspot.com] or `src/tools/luci-go/swarming bots` |
| to determine the PCI IDs of the GPUs in the bots. (These instructions will |
| need to be updated for Android bots which don't have PCI buses.) |
| |
| 1. Make sure to add these new machines to the chromium.tests.gpu Swarming |
| pool by creating a CL against [`gpu.star`][gpu.star] in the |
| [`infradata/config`][infradata/config] (Google internal) workspace. |
| Git configure your user.email to @google.com if necessary. Here is one |
| [example CL](https://chrome-internal-review.googlesource.com/913528) |
| and a |
| [second example](https://chrome-internal-review.googlesource.com/1111456). |
| |
| 1. Run [`main.star`][main.star] to regenerate |
| `configs/chromium-swarm/bots.cfg`. Double-check your work there. |
| |
| 1. Allocate new virtual machines for the bots as described in [How to set up |
| new virtual machine |
| instances](#How-to-set-up-new-virtual-machine-instances). |
| |
| 1. Create a CL in the Chromium workspace which does the following. Here's an |
| [example CL](https://chromium-review.googlesource.com/c/chromium/src/+/1752291). |
| 1. Adds the new machines to [`waterfalls.pyl`][waterfalls.pyl] directly or |
| to [`mixins.pyl`][mixins.pyl], referencing the new mixin in |
| [`waterfalls.pyl`][waterfalls.pyl]. |
| 1. The swarming dimensions are crucial. These must match the GPU and |
| OS type of the physical hardware in the Swarming pool. This is what |
| causes the VMs to spawn their tests on the correct hardware. Make |
| sure to use the chromium.tests.gpu pool, and that the new machines |
| were specifically added to that pool. |
| 1. Make triply sure that there are no collisions between the new |
| hardware you're adding and hardware already in the Swarming pool. |
| For example, it used to be the case that all of the Windows NVIDIA |
| bots ran the same OS version. Later, the Windows 8 flavor bots were |
| added. In order to avoid accidentally running tests on Windows 8 |
| when Windows 7 was intended, the OS in the swarming dimensions of |
| the Win7 bots had to be changed from `win` to |
| `Windows-2008ServerR2-SP1` (the Win7-like flavor running in our |
| data center). Similarly, the Win8 bots had to have a very precise |
| OS description (`Windows-2012ServerR2-SP0`). |
| 1. If you're deploying a new bot that's similar to another existing |
| configuration, please search around in |
| [`test_suite_exceptions.pyl`][test_suite_exceptions.pyl] for |
| references to the other bot's name and see if your new bot needs |
| to be added to any exclusion lists. For example, some of the tests |
| don't run on certain Win bots because of missing OpenGL extensions. |
| 1. Run [`generate_buildbot_json.py`][generate_buildbot_json.py] to |
| regenerate `src/testing/buildbot/chromium.gpu.fyi.json`. |
| 1. Updates [`chromium.gpu.star`][chromium.gpu.star] or |
| [`chromium.gpu.fyi.star`][chromium.gpu.fyi.star] and its related |
| generated files [`cr-buildbucket.cfg`][cr-buildbucket.cfg], |
| [`luci-scheduler.cfg`][luci-scheduler.cfg], and |
| [`luci-milo.cfg`][luci-milo.cfg]: |
| * Use the appropriate definition for the type of the bot being added, |
| for example, `ci.gpu_fyi_thin_tester()` should be used for all CI |
| tester bots on GPU FYI waterfall. |
| * Make sure to set `triggered_by` property to the builder which |
| triggers the testers (like `'GPU Win FYI Builder'`). |
| * Include a `ci.console_view_entry` for the builder's |
| `console_view_entry` argument. Look at the short names and |
| categories to try and come up with a reasonable organization. |
| * Make sure to set the `serialize_tests` property to `True` in the |
| builder config. This is specified for waterfall bots but not trybots |
| and helps avoid overloading the physical hardware. Additionally, |
| if the bot is configured as a split builder/tester pair, ensure that |
| the tester's builder config matches the parent builder and the |
| tester is marked as being triggered by the parent builder. |
| 1. Run `main.star` in [`src/infra/config`][src/infra/config] to update the |
| generated files. Double-check your work there. |
| 1. If you were adding a new builder, you would need to also add the new |
| machine to [`src/tools/mb/mb_config.pyl`][mb_config.pyl]. |
| |
| 1. If the number of physical machines for the new bot permits, you should also |
| add a manually-triggered trybot at the same time that the CI bot is added. |
| This is described in [How to add a new manually-triggered trybot]. |
| |
| While the above instructions assume that an existing parent builder will be |
| be used, a new one can be set up by doing the same steps, but also adding the |
| new parent builder at the same time. There are plenty of existing parent |
| builder/child tester bots that you can use as a reference. |
| |
| [How to add a new manually-triggered trybot]: https://chromium.googlesource.com/chromium/src/+/main/docs/gpu/gpu_testing_bot_details.md#How-to-add-a-new-manually_triggered-trybot |
| |
| [chromium.gpu.star]: https://chromium.googlesource.com/chromium/src/+/main/infra/config/subprojects/ci/chromium.gpu.star |
| [chromium.gpu.fyi.star]: https://chromium.googlesource.com/chromium/src/+/main/infra/config/subprojects/ci/chromium.gpu.fyi.star |
| [cr-buildbucket.cfg]: https://chromium.googlesource.com/chromium/src/+/main/infra/config/generated/cr-buildbucket.cfg |
| [luci-scheduler.cfg]: https://chromium.googlesource.com/chromium/src/+/main/infra/config/generated/luci-scheduler.cfg |
| [luci-milo.cfg]: https://chromium.googlesource.com/chromium/src/+/main/infra/config/generated/luci-milo.cfg |
| [GPU FYI Win Builder]: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/GPU%20FYI%20Win%20Builder |
| |
| ### How to remove an existing bot from the chromium.gpu.fyi waterfall |
| |
| Basically, one needs to follow |
| [How to add a new tester bot to the chromium.gpu.fyi waterfall](#how-to-add-a-new-tester-bot-to-the-chromium_gpu_fyi-waterfall) |
| step in reverse. |
| To prevent bot failures during deletion process, pause the bot on |
| https://luci-scheduler.appspot.com/. |
| |
| ### How to start running tests on a new GPU type on an existing try bot |
| |
| Let's say that you want to cause the `win-rel` try bot to run tests on |
| CoolNewGPUType in addition to the types it currently runs (as of this writing |
| only NVIDIA). To do this: |
| |
| 1. Make sure there is enough hardware capacity using the available tools to |
| report utilization of the Swarming pool. |
| 1. Deploy Release and Debug testers on the `chromium.gpu` waterfall, following |
| the instructions for the `chromium.gpu.fyi` waterfall above. Make sure |
| the flakiness on the new bots is comparable to existing `chromium.gpu` bots |
| before proceeding. |
| 1. Create a CL in the [`chromium/src`][chromium/src] workspace that adds the |
| new Release tester to `win-rel`'s `mirrors` list. Rerun |
| `infra/config/main.star`. |
| 1. Once the above CL lands, the commit queue will **immediately** start |
| running tests on the CoolNewGPUType configuration. Be vigilant and make |
| sure that tryjobs are green. If they are red for any reason, revert the CL |
| and figure out offline what went wrong. |
| |
| ### How to add a new manually-triggered trybot |
| |
| Manually-triggered trybots are needed for investigating failures on a GPU type |
| which doesn't have a corresponding CQ trybot (due to lack of GPU resources). |
| Even for GPU types that have CQ trybots, it is convenient to have |
| manually-triggered trybots as well, since the CQ trybot often runs on more than |
| one GPU type, or some test suites which run on CI bot can be disabled on CQ |
| trybot (when the CQ bot has |
| [no CI equivalent](https://chromium.googlesource.com/chromium/src/+/main/docs/gpu/gpu_testing_bot_details.md#how-to-add-a-new-try-bot-that-runs-a-subset-of-tests-or-extra-tests)). |
| Thus, all CI bots in `chromium.gpu` and `chromium.gpu.fyi` have corresponding |
| manually-triggered trybots, except a few which don't have enough hardware |
| to support it. A manually-triggered trybot should be added at the same time |
| a CI bot is added. |
| |
| Here are the steps to set up a new trybot which runs tests just on one |
| particular GPU type. Let's consider that we are adding a manually-triggered |
| trybot for the Win7 NVIDIA GPUs in Release mode. We will call the new bot |
| `gpu-fyi-try-win7-nvidia-rel-64`. |
| |
| 1. If there already exist some manually-triggered trybot which runs tests on |
| the same group of machines (i.e. same GPU, OS and driver), the new trybot |
| will have to share the VMs with it. Otherwise, create a new pool of VMs for |
| the new hardware and allocate the VMs as described in |
| [How to set up new virtual machine instances](#How-to-set-up-new-virtual-machine-instances), |
| following the "Manually-triggered GPU trybots" instructions. |
| |
| 1. Create a CL in the Chromium workspace which does the following. Here's a |
| [reference CL](https://chromium-review.googlesource.com/c/chromium/src/+/2191276) |
| exemplifying the new "GCE pool per GPU hardware pool" way. |
| 1. Updates [`gpu.try.star`][gpu.try.star] and its related generated file |
| [`cr-buildbucket.cfg`][cr-buildbucket.cfg]: |
| * Add the new trybot with the right `builder` define and VMs pool. |
| For `gpu-fyi-try-win7-nvidia-rel-64` this would be |
| `gpu_win_builder()` and `luci.chromium.gpu.win7.nvidia.try`. |
| * Add the relevant CI bots to the new trybot's `mirrors` list. If the |
| CI tester has a parent builder, the parent should be in the list as |
| well. |
| 1. Run `main.star` in [`src/infra/config`][src/infra/config] to update the |
| generated files. Double-check your work there. |
| 1. Adds the new trybot to [`src/tools/mb/mb_config.pyl`][mb_config.pyl] |
| and [`src/tools/mb/mb_config_buckets.pyl`][mb_config_buckets.pyl]. |
| Use the same mixin as does the builder for the CI bot this trybot |
| mirrors, in case of `gpu-fyi-try-win7-nvidia-rel-64` this is |
| `GPU FYI Win x64 Builder` and thus `gpu_fyi_tests_release_trybot`. |
| 1. Get this CL reviewed and landed. |
| |
| At this point the new trybot should automatically show up in the |
| "Choose tryjobs" pop-up in the Gerrit UI, under the |
| `luci.chromium.try` heading, because it was deployed via LUCI. It |
| should be possible to send a CL to it. |
| |
| (It should not be necessary to modify buildbucket.config as is |
| mentioned at the bottom of the "Choose tryjobs" pop-up. Contact the |
| chrome-infra team if this doesn't work as expected.) |
| |
| [gpu.try.star]: https://chromium.googlesource.com/chromium/src/+/main/infra/config/subprojects/gpu.try.star |
| [luci.chromium.try.star]: https://chromium.googlesource.com/chromium/src/+/main/infra/config/consoles/luci.chromium.try.star |
| [tryserver.chromium.win.star]: https://chromium.googlesource.com/chromium/src/+/main/infra/config/consoles/tryserver.chromium.win.star |
| |
| |
| ### How to add a new try bot that runs a subset of tests or extra tests |
| |
| Several projects (ANGLE, Dawn) run custom tests using the Chromium recipes. They |
| use try bot bot configs that run subsets of Chromium or additional slower tests |
| that can't be run on the main CQ. |
| |
| These trybots are a little different because they do not mirror any waterfall |
| bots. |
| |
| Let's say the `android_optional_gpu_tests_rel` bot did not exist yet and you |
| wanted to add it. The process is similar to adding a CI bot, but modifying |
| slightly different files. |
| |
| 1. Allocate new virtual machines for the bots as described in |
| [How to set up new virtual machine instances](#How-to-set-up-new-virtual-machine-instances). |
| 1. Make sure there is enough hardware capacity using the available tools to |
| report utilization of the Swarming pool. |
| 1. Create a CL in the Chromium workspace the does the following. |
| 1. Add your new bot `android_optional_gpu_tests_rel` to the |
| tryserver.chromium.android waterfall in |
| [`waterfalls.pyl`][waterfalls.pyl]. |
| [Here][android_optional_waterfalls_pyl] is an explicit example using the |
| real bot. |
| 1. Re-run |
| [`src/testing/buildbot/generate_buildbot_json.py`][generate_buildbot_json.py] |
| to regenerate the JSON files. |
| 1. Add the bot definition to the relevant tryserver `.star` file. In this |
| example, that would be `tryserver.chromium.android.star`. |
| [Here][android_optional_tryserver_star] is an explicit example using the |
| real bot. |
| 1. Run `main.star` in [`src/infra/config`][src/infra/config] to update the |
| generated files: [`luci-milo.cfg`][luci-milo.cfg], |
| [`luci-scheduler.cfg`][luci-scheduler.cfg], |
| [`cr-buildbucket.cfg`][cr-buildbucket.cfg]. Double-check your work |
| there. |
| 1. Update [`src/tools/mb/mb_config.pyl`][mb_config.pyl] |
| to include `android_optional_gpu_tests_rel`. |
| 1. After your CL lands you should be able to find and run |
| `android_optional_gpu_tests_rel` on CLs using Choose Trybots in Gerrit. |
| |
| [android_optional_waterfalls_pyl]: https://chromium.googlesource.com/chromium/src/+/024e15a6bb8b2e74ba3a5782831be6a1c11ddf43/testing/buildbot/waterfalls.pyl#6665 |
| [android_optional_tryserver_star]: https://chromium.googlesource.com/chromium/src/+/b05a42bd3d0e84d55392ae984a69946e56203c71/infra/config/subprojects/chromium/try/tryserver.chromium.android.star#703 |
| |
| |
| ### How to test and deploy a driver and/or OS update |
| |
| Let's say that you want to roll out an update to the graphics drivers or the OS |
| on one of the configurations like the Linux NVIDIA bots. In order to verify |
| that the new driver or OS won't destabilize Chromium's commit queue, |
| it's necessary to run the new driver or OS on one of the waterfalls for a day |
| or two to make sure the tests are reliably green before rolling out the driver |
| or OS update. To do this: |
| |
| 1. Make sure that all of the current Swarming jobs for this OS and GPU |
| configuration are targeted at the "stable" version of the driver and the OS |
| in [`waterfalls.pyl`][waterfalls.pyl] and [`mixins.pyl`][mixins.pyl]. |
| 1. File a `Build Infrastructure` bug, component `Infra>Labs`, to have ~4 of |
| the physical machines already in the Swarming pool upgraded to the new |
| version of the driver or the OS. |
| 1. If an "experimental" version of this bot doesn't yet exist, follow the |
| instructions above for [How to add a new tester bot to the chromium.gpu.fyi |
| waterfall](#How-to-add-a-new-tester-bot-to-the-chromium_gpu_fyi-waterfall) |
| to deploy one. However, you do not need to request additional GCE resources |
| since there should be enough spare capacity in the GPU builderless pool to |
| handle them. Additionally, ensure that the bot definition in |
| [`chromium.gpu.fyi.star`][chromium.gpu.fyi.star] includes a `list_view` |
| argument specifying `chromium.gpu.experimental`. |
| 1. If an "experimental" version does already exist, re-add it to its default |
| console in [`chromium.gpu.fyi.star`][chromium.gpu.fyi.star] by uncommenting |
| its `console_view_entry` argument and unpause it in the [luci scheduler]. |
| 1. Have this experimental bot target the new version of the driver or the OS |
| in [`waterfalls.pyl`][waterfalls.pyl] and [`mixins.pyl`][mixins.pyl]. |
| [Sample CL][sample driver cl]. |
| 1. Hopefully, the new machine will pass the pixel tests. If it doesn't, then |
| it'll be necessary to follow the instructions on |
| [updating Gold baselines (step #4)][updating gold baselines]. |
| 1. Watch the new machine for a day or two to make sure it's stable. |
| 1. When it is, add the experimental driver/OS to the `_stable` mixin using the |
| swarming OR operator `|`. For example: |
| |
| ``` |
| 'win10_intel_hd_630_stable': { |
| 'swarming': { |
| 'dimensions': { |
| 'gpu': '8086:5912-26.20.100.7870|8086:5912-26.20.100.8141', |
| 'os': 'Windows-10', |
| 'pool': 'chromium.tests.gpu', |
| }, |
| }, |
| } |
| ``` |
| |
| This will cause tests triggered using the `_stable` mixin to run on either |
| the old stable dimension or the experimental/new stable dimension. |
| |
| **NOTE** There is a hard cap of 8 combinations in swarming, so you can only |
| use the OR operator in up to 3 dimensions if each dimension only has two |
| options. More than two options per dimension is allowed as long as the total |
| number of combinations is 8 or less. |
| 1. After it lands, ask the Chrome Infrastructure Labs team to roll out the |
| driver update across all of the similarly configured bots in the swarming |
| pool. |
| 1. If necessary, update pixel test expectations and remove the suppressions |
| added above. |
| 1. Remove the old driver or OS version from the `_stable` mixin, leaving just |
| the new stable version. |
| 1. Clean up the "experimental" version of the bot by pausing it in the |
| [luci scheduler] and commenting out its `console_view_entry` argument in |
| [`chromium.gpu.fyi.star`][chromium.gpu.fyi.star]. |
| |
| Note that we leave the experimental bot in place. We could reclaim it, but it |
| seems worthwhile to continuously test the "next" version of graphics drivers as |
| well as the current stable ones. |
| |
| [sample driver cl]: https://chromium-review.googlesource.com/c/chromium/src/+/1726875 |
| [updating gold baselines]: http://go/gpu-pixel-wrangler-info#how-to-keep-the-bots-green |
| [luci scheduler]: https://luci-scheduler.appspot.com/ |
| |
| ## Credentials for various servers |
| |
| Working with the GPU bots requires credentials to various services: the isolate |
| server, the swarming server, and cloud storage. |
| |
| ### Isolate server credentials |
| |
| To upload and download isolates you must first authenticate to the isolate |
| server. From a Chromium checkout, run: |
| |
| * `./src/tools/luci-go/isolate login` |
| |
| This will open a web browser to complete the authentication flow. A @google.com |
| email address is required in order to properly authenticate. |
| |
| To test your authentication, find a hash for a recent isolate. Consult the |
| instructions on [Running Binaries from the Bots Locally] to find a random hash |
| from a target like `gl_tests`. Then run the following: |
| |
| [Running Binaries from the Bots Locally]: https://www.chromium.org/developers/testing/gpu-testing#TOC-Running-Binaries-from-the-Bots-Locally |
| |
| ### Swarming server credentials |
| |
| The swarming server uses the same `auth.py` script as the isolate server. You |
| will need to authenticate if you want to manually download the results of |
| previous swarming jobs, trigger your own jobs, or run `swarming.py reproduce` |
| to re-run a remote job on your local workstation. Follow the instructions |
| above, replacing the service with `https://chromium-swarm.appspot.com`. |
| |
| ### Cloud storage credentials |
| |
| Authentication to Google Cloud Storage is needed for a couple of reasons: |
| uploading pixel test results to the cloud, and potentially uploading and |
| downloading builds as well, at least in Debug mode. Use the copy of gsutil in |
| `depot_tools/third_party/gsutil/gsutil`, and follow the [Google Cloud Storage |
| instructions] to authenticate. You must use your @google.com email address and |
| be a member of the Chrome GPU team in order to receive read-write access to the |
| appropriate cloud storage buckets. Roughly: |
| |
| 1. Run `gsutil config` |
| 2. Copy/paste the URL into your browser |
| 3. Log in with your @google.com account |
| 4. Allow the app to access the information it requests |
| 5. Copy-paste the resulting key back into your Terminal |
| 6. Press "enter" when prompted for a project-id (i.e., leave it empty) |
| |
| At this point you should be able to write to the cloud storage bucket. |
| |
| Navigate to |
| <https://console.developers.google.com/storage/chromium-gpu-archive> to view |
| the contents of the cloud storage bucket. |
| |
| [Google Cloud Storage instructions]: https://developers.google.com/storage/docs/gsutil |