Kai Ninomiya | a6429fb3 | 2018-03-30 01:30:56 | [diff] [blame] | 1 | # GPU Testing |
| 2 | |
| 3 | This set of pages documents the setup and operation of the GPU bots and try |
| 4 | servers, which verify the correctness of Chrome's graphically accelerated |
| 5 | rendering pipeline. |
| 6 | |
| 7 | [TOC] |
| 8 | |
| 9 | ## Overview |
| 10 | |
| 11 | The GPU bots run a different set of tests than the majority of the Chromium |
| 12 | test machines. The GPU bots specifically focus on tests which exercise the |
| 13 | graphics processor, and whose results are likely to vary between graphics card |
| 14 | vendors. |
| 15 | |
| 16 | Most of the tests on the GPU bots are run via the [Telemetry framework]. |
| 17 | Telemetry was originally conceived as a performance testing framework, but has |
| 18 | proven valuable for correctness testing as well. Telemetry directs the browser |
| 19 | to perform various operations, like page navigation and test execution, from |
| 20 | external scripts written in Python. The GPU bots launch the full Chromium |
| 21 | browser via Telemetry for the majority of the tests. Using the full browser to |
| 22 | execute tests, rather than smaller test harnesses, has yielded several |
| 23 | advantages: testing what is shipped, improved reliability, and improved |
| 24 | performance. |
| 25 | |
| 26 | [Telemetry framework]: https://github.com/catapult-project/catapult/tree/master/telemetry |
| 27 | |
| 28 | A subset of the tests, called "pixel tests", grab screen snapshots of the web |
| 29 | page in order to validate Chromium's rendering architecture end-to-end. Where |
| 30 | necessary, GPU-specific results are maintained for these tests. Some of these |
| 31 | tests verify just a few pixels, using handwritten code, in order to use the |
| 32 | same validation for all brands of GPUs. |
| 33 | |
| 34 | The GPU bots use the Chrome infrastructure team's [recipe framework], and |
| 35 | specifically the [`chromium`][recipes/chromium] and |
| 36 | [`chromium_trybot`][recipes/chromium_trybot] recipes, to describe what tests to |
| 37 | execute. Compared to the legacy master-side buildbot scripts, recipes make it |
| 38 | easy to add new steps to the bots, change the bots' configuration, and run the |
| 39 | tests locally in the same way that they are run on the bots. Additionally, the |
| 40 | `chromium` and `chromium_trybot` recipes make it possible to send try jobs which |
| 41 | add new steps to the bots. This single capability is a huge step forward from |
| 42 | the previous configuration where new steps were added blindly, and could cause |
| 43 | failures on the tryservers. For more details about the configuration of the |
| 44 | bots, see the [GPU bot details]. |
| 45 | |
| 46 | [recipe framework]: https://chromium.googlesource.com/external/github.com/luci/recipes-py/+/master/doc/user_guide.md |
| 47 | [recipes/chromium]: https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave/recipes/chromium.py |
| 48 | [recipes/chromium_trybot]: https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave/recipes/chromium_trybot.py |
| 49 | [GPU bot details]: gpu_testing_bot_details.md |
| 50 | |
| 51 | The physical hardware for the GPU bots lives in the Swarming pool\*. The |
| 52 | Swarming infrastructure ([new docs][new-testing-infra], [older but currently |
| 53 | more complete docs][isolated-testing-infra]) provides many benefits: |
| 54 | |
| 55 | * Increased parallelism for the tests; all steps for a given tryjob or |
| 56 | waterfall build run in parallel. |
| 57 | * Simpler scaling: just add more hardware in order to get more capacity. No |
| 58 | manual configuration or distribution of hardware needed. |
| 59 | * Easier to run certain tests only on certain operating systems or types of |
| 60 | GPUs. |
| 61 | * Easier to add new operating systems or types of GPUs. |
| 62 | * Clearer description of the binary and data dependencies of the tests. If |
| 63 | they run successfully locally, they'll run successfully on the bots. |
| 64 | |
| 65 | (\* All but a few one-off GPU bots are in the swarming pool. The exceptions to |
| 66 | the rule are described in the [GPU bot details].) |
| 67 | |
| 68 | The bots on the [chromium.gpu.fyi] waterfall are configured to always test |
| 69 | top-of-tree ANGLE. This setup is done with a few lines of code in the |
| 70 | [tools/build workspace]; search the code for "angle". |
| 71 | |
| 72 | These aspects of the bots are described in more detail below, and in linked |
| 73 | pages. There is a [presentation][bots-presentation] which gives a brief |
| 74 | overview of this documentation and links back to various portions. |
| 75 | |
| 76 | <!-- XXX: broken link --> |
| 77 | [new-testing-infra]: https://github.com/luci/luci-py/wiki |
| 78 | [isolated-testing-infra]: https://www.chromium.org/developers/testing/isolated-testing/infrastructure |
Kenneth Russell | 8a386d4 | 2018-06-02 09:48:01 | [diff] [blame] | 79 | [chromium.gpu]: https://ci.chromium.org/p/chromium/g/chromium.gpu/console |
| 80 | [chromium.gpu.fyi]: https://ci.chromium.org/p/chromium/g/chromium.gpu.fyi/console |
Kai Ninomiya | a6429fb3 | 2018-03-30 01:30:56 | [diff] [blame] | 81 | [tools/build workspace]: https://code.google.com/p/chromium/codesearch#chromium/build/scripts/slave/recipe_modules/chromium_tests/chromium_gpu_fyi.py |
| 82 | [bots-presentation]: https://docs.google.com/presentation/d/1BC6T7pndSqPFnituR7ceG7fMY7WaGqYHhx5i9ECa8EI/edit?usp=sharing |
| 83 | |
| 84 | ## Fleet Status |
| 85 | |
| 86 | Please see the [GPU Pixel Wrangling instructions] for links to dashboards |
| 87 | showing the status of various bots in the GPU fleet. |
| 88 | |
| 89 | [GPU Pixel Wrangling instructions]: pixel_wrangling.md#Fleet-Status |
| 90 | |
| 91 | ## Using the GPU Bots |
| 92 | |
| 93 | Most Chromium developers interact with the GPU bots in two ways: |
| 94 | |
| 95 | 1. Observing the bots on the waterfalls. |
| 96 | 2. Sending try jobs to them. |
| 97 | |
| 98 | The GPU bots are grouped on the [chromium.gpu] and [chromium.gpu.fyi] |
| 99 | waterfalls. Their current status can be easily observed there. |
| 100 | |
| 101 | To send try jobs, you must first upload your CL to the codereview server. Then, |
| 102 | either clicking the "CQ dry run" link or running from the command line: |
| 103 | |
| 104 | ```sh |
| 105 | git cl try |
| 106 | ``` |
| 107 | |
| 108 | Sends your job to the default set of try servers. |
| 109 | |
| 110 | The GPU tests are part of the default set for Chromium CLs, and are run as part |
| 111 | of the following tryservers' jobs: |
| 112 | |
| 113 | * [linux_chromium_rel_ng] on the [tryserver.chromium.linux] waterfall |
| 114 | * [mac_chromium_rel_ng] on the [tryserver.chromium.mac] waterfall |
| 115 | * [win_chromium_rel_ng] on the [tryserver.chromium.win] waterfall |
| 116 | |
| 117 | [linux_chromium_rel_ng]: http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng?numbuilds=100 |
| 118 | [mac_chromium_rel_ng]: http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng?numbuilds=100 |
| 119 | [win_chromium_rel_ng]: http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng?numbuilds=100 |
| 120 | [tryserver.chromium.linux]: http://build.chromium.org/p/tryserver.chromium.linux/waterfall?numbuilds=100 |
| 121 | [tryserver.chromium.mac]: http://build.chromium.org/p/tryserver.chromium.mac/waterfall?numbuilds=100 |
| 122 | [tryserver.chromium.win]: http://build.chromium.org/p/tryserver.chromium.win/waterfall?numbuilds=100 |
| 123 | |
| 124 | Scan down through the steps looking for the text "GPU"; that identifies those |
| 125 | tests run on the GPU bots. For each test the "trigger" step can be ignored; the |
| 126 | step further down for the test of the same name contains the results. |
| 127 | |
| 128 | It's usually not necessary to explicitly send try jobs just for verifying GPU |
| 129 | tests. If you want to, you must invoke "git cl try" separately for each |
| 130 | tryserver master you want to reference, for example: |
| 131 | |
| 132 | ```sh |
| 133 | git cl try -b linux_chromium_rel_ng |
| 134 | git cl try -b mac_chromium_rel_ng |
| 135 | git cl try -b win_chromium_rel_ng |
| 136 | ``` |
| 137 | |
| 138 | Alternatively, the Gerrit UI can be used to send a patch set to these try |
| 139 | servers. |
| 140 | |
| 141 | Three optional tryservers are also available which run additional tests. As of |
| 142 | this writing, they ran longer-running tests that can't run against all Chromium |
| 143 | CLs due to lack of hardware capacity. They are added as part of the included |
| 144 | tryservers for code changes to certain sub-directories. |
| 145 | |
Corentin Wallez | b78c44a | 2018-04-12 14:29:47 | [diff] [blame] | 146 | * [linux_optional_gpu_tests_rel] on the [luci.chromium.try] waterfall |
| 147 | * [mac_optional_gpu_tests_rel] on the [luci.chromium.try] waterfall |
| 148 | * [win_optional_gpu_tests_rel] on the [luci.chromium.try] waterfall |
Kai Ninomiya | a6429fb3 | 2018-03-30 01:30:56 | [diff] [blame] | 149 | |
Corentin Wallez | b78c44a | 2018-04-12 14:29:47 | [diff] [blame] | 150 | [linux_optional_gpu_tests_rel]: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_optional_gpu_tests_rel |
| 151 | [mac_optional_gpu_tests_rel]: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_optional_gpu_tests_rel |
| 152 | [win_optional_gpu_tests_rel]: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_optional_gpu_tests_rel |
Kai Ninomiya | a6429fb3 | 2018-03-30 01:30:56 | [diff] [blame] | 153 | |
| 154 | Tryservers for the [ANGLE project] are also present on the |
| 155 | [tryserver.chromium.angle] waterfall. These are invoked from the Gerrit user |
| 156 | interface. They are configured similarly to the tryservers for regular Chromium |
| 157 | patches, and run the same tests that are run on the [chromium.gpu.fyi] |
| 158 | waterfall, in the same way (e.g., against ToT ANGLE). |
| 159 | |
| 160 | If you find it necessary to try patches against other sub-repositories than |
| 161 | Chromium (`src/`) and ANGLE (`src/third_party/angle/`), please |
| 162 | [file a bug](http://crbug.com/new) with component Internals\>GPU\>Testing. |
| 163 | |
| 164 | [ANGLE project]: https://chromium.googlesource.com/angle/angle/+/master/README.md |
| 165 | [tryserver.chromium.angle]: https://build.chromium.org/p/tryserver.chromium.angle/waterfall |
| 166 | [file a bug]: http://crbug.com/new |
| 167 | |
| 168 | ## Running the GPU Tests Locally |
| 169 | |
| 170 | All of the GPU tests running on the bots can be run locally from a Chromium |
| 171 | build. Many of the tests are simple executables: |
| 172 | |
| 173 | * `angle_unittests` |
| 174 | * `content_gl_tests` |
| 175 | * `gl_tests` |
| 176 | * `gl_unittests` |
| 177 | * `tab_capture_end2end_tests` |
| 178 | |
| 179 | Some run only on the chromium.gpu.fyi waterfall, either because there isn't |
| 180 | enough machine capacity at the moment, or because they're closed-source tests |
| 181 | which aren't allowed to run on the regular Chromium waterfalls: |
| 182 | |
| 183 | * `angle_deqp_gles2_tests` |
| 184 | * `angle_deqp_gles3_tests` |
| 185 | * `angle_end2end_tests` |
| 186 | * `audio_unittests` |
| 187 | |
| 188 | The remaining GPU tests are run via Telemetry. In order to run them, just |
| 189 | build the `chrome` target and then |
| 190 | invoke `src/content/test/gpu/run_gpu_integration_test.py` with the appropriate |
| 191 | argument. The tests this script can invoke are |
| 192 | in `src/content/test/gpu/gpu_tests/`. For example: |
| 193 | |
| 194 | * `run_gpu_integration_test.py context_lost --browser=release` |
| 195 | * `run_gpu_integration_test.py pixel --browser=release` |
| 196 | * `run_gpu_integration_test.py webgl_conformance --browser=release --webgl-conformance-version=1.0.2` |
| 197 | * `run_gpu_integration_test.py maps --browser=release` |
| 198 | * `run_gpu_integration_test.py screenshot_sync --browser=release` |
| 199 | * `run_gpu_integration_test.py trace_test --browser=release` |
| 200 | |
| 201 | **Note:** If you are on Linux and see this test harness exit immediately with |
| 202 | `**Non zero exit code**`, it's probably because of some incompatible Python |
| 203 | packages being installed. Please uninstall the `python-egenix-mxdatetime` and |
| 204 | `python-logilab-common` packages in this case; see |
| 205 | [Issue 716241](http://crbug.com/716241). |
| 206 | |
| 207 | You can also run a subset of tests with this harness: |
| 208 | |
| 209 | * `run_gpu_integration_test.py webgl_conformance --browser=release |
| 210 | --test-filter=conformance_attribs` |
| 211 | |
| 212 | Figuring out the exact command line that was used to invoke the test on the |
| 213 | bots can be a little tricky. The bots all\* run their tests via Swarming and |
| 214 | isolates, meaning that the invocation of a step like `[trigger] |
| 215 | webgl_conformance_tests on NVIDIA GPU...` will look like: |
| 216 | |
| 217 | * `python -u |
| 218 | 'E:\b\build\slave\Win7_Release__NVIDIA_\build\src\tools\swarming_client\swarming.py' |
| 219 | trigger --swarming https://chromium-swarm.appspot.com |
| 220 | --isolate-server https://isolateserver.appspot.com |
| 221 | --priority 25 --shards 1 --task-name 'webgl_conformance_tests on NVIDIA GPU...'` |
| 222 | |
| 223 | You can figure out the additional command line arguments that were passed to |
| 224 | each test on the bots by examining the trigger step and searching for the |
| 225 | argument separator (<code> -- </code>). For a recent invocation of |
| 226 | `webgl_conformance_tests`, this looked like: |
| 227 | |
| 228 | * `webgl_conformance --show-stdout '--browser=release' -v |
| 229 | '--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc' |
| 230 | '--isolated-script-test-output=${ISOLATED_OUTDIR}/output.json'` |
| 231 | |
| 232 | You can leave off the --isolated-script-test-output argument, so this would |
| 233 | leave a full command line of: |
| 234 | |
| 235 | * `run_gpu_integration_test.py |
| 236 | webgl_conformance --show-stdout '--browser=release' -v |
| 237 | '--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc'` |
| 238 | |
| 239 | The Maps test requires you to authenticate to cloud storage in order to access |
| 240 | the Web Page Reply archive containing the test. See [Cloud Storage Credentials] |
| 241 | for documentation on setting this up. |
| 242 | |
| 243 | [Cloud Storage Credentials]: gpu_testing_bot_details.md#Cloud-storage-credentials |
| 244 | |
| 245 | Pixel tests use reference images from cloud storage, bots pass |
| 246 | `--upload-refimg-to-cloud-storage` argument, but to run locally you need to pass |
| 247 | `--download-refimg-from-cloud-storage` argument, as well as other arguments bot |
| 248 | uses, like `--refimg-cloud-storage-bucket` and `--os-type`. |
| 249 | |
| 250 | Sample command line for Android: |
| 251 | |
| 252 | * `run_gpu_integration_test.py pixel --show-stdout --browser=android-chromium |
| 253 | -v --passthrough --extra-browser-args='--enable-logging=stderr |
| 254 | --js-flags=--expose-gc' --refimg-cloud-storage-bucket |
| 255 | chromium-gpu-archive/reference-images --os-type android |
| 256 | --download-refimg-from-cloud-storage` |
| 257 | |
| 258 | <!-- XXX: update this section; these isolates don't exist anymore --> |
| 259 | You can find the isolates for the various tests in |
| 260 | [src/chrome/](http://src.chromium.org/viewvc/chrome/trunk/src/chrome/): |
| 261 | |
| 262 | * [angle_unittests.isolate](https://chromium.googlesource.com/chromium/src/+/master/chrome/angle_unittests.isolate) |
| 263 | * [content_gl_tests.isolate](https://chromium.googlesource.com/chromium/src/+/master/content/content_gl_tests.isolate) |
| 264 | * [gl_tests.isolate](https://chromium.googlesource.com/chromium/src/+/master/chrome/gl_tests.isolate) |
| 265 | * [gles2_conform_test.isolate](https://chromium.googlesource.com/chromium/src/+/master/chrome/gles2_conform_test.isolate) |
| 266 | * [tab_capture_end2end_tests.isolate](https://chromium.googlesource.com/chromium/src/+/master/chrome/tab_capture_end2end_tests.isolate) |
| 267 | * [telemetry_gpu_test.isolate](https://chromium.googlesource.com/chromium/src/+/master/chrome/telemetry_gpu_test.isolate) |
| 268 | |
| 269 | The isolates contain the full or partial command line for invoking the target. |
| 270 | The complete command line for any test can be deduced from the contents of the |
| 271 | isolate plus the stdio output from the test's run on the bot. |
| 272 | |
| 273 | Note that for the GN build, the isolates are simply described by build targets, |
| 274 | and [gn_isolate_map.pyl] describes the mapping between isolate name and build |
| 275 | target, as well as the command line used to invoke the isolate. Once all |
| 276 | platforms have switched to GN, the .isolate files will be obsolete and be |
| 277 | removed. |
| 278 | |
| 279 | (\* A few of the one-off GPU configurations on the chromium.gpu.fyi waterfall |
| 280 | run their tests locally rather than via swarming, in order to decrease the |
| 281 | number of physical machines needed.) |
| 282 | |
| 283 | [gn_isolate_map.pyl]: https://chromium.googlesource.com/chromium/src/+/master/testing/buildbot/gn_isolate_map.pyl |
| 284 | |
| 285 | ## Running Binaries from the Bots Locally |
| 286 | |
| 287 | Any binary run remotely on a bot can also be run locally, assuming the local |
| 288 | machine loosely matches the architecture and OS of the bot. |
| 289 | |
| 290 | The easiest way to do this is to find the ID of the swarming task and use |
| 291 | "swarming.py reproduce" to re-run it: |
| 292 | |
| 293 | * `./src/tools/swarming_client/swarming.py reproduce -S https://chromium-swarm.appspot.com [task ID]` |
| 294 | |
| 295 | The task ID can be found in the stdio for the "trigger" step for the test. For |
| 296 | example, look at a recent build from the [Mac Release (Intel)] bot, and |
| 297 | look at the `gl_unittests` step. You will see something like: |
| 298 | |
| 299 | [Mac Release (Intel)]: https://ci.chromium.org/buildbot/chromium.gpu/Mac%20Release%20%28Intel%29/ |
| 300 | |
| 301 | ``` |
| 302 | Triggered task: gl_unittests on Intel GPU on Mac/Mac-10.12.6/[TRUNCATED_ISOLATE_HASH]/Mac Release (Intel)/83664 |
| 303 | To collect results, use: |
| 304 | swarming.py collect -S https://chromium-swarm.appspot.com --json /var/folders/[PATH_TO_TEMP_FILE].json |
| 305 | Or visit: |
| 306 | https://chromium-swarm.appspot.com/user/task/[TASK_ID] |
| 307 | ``` |
| 308 | |
| 309 | There is a difference between the isolate's hash and Swarming's task ID. Make |
| 310 | sure you use the task ID and not the isolate's hash. |
| 311 | |
| 312 | As of this writing, there seems to be a |
| 313 | [bug](https://github.com/luci/luci-py/issues/250) |
| 314 | when attempting to re-run the Telemetry based GPU tests in this way. For the |
| 315 | time being, this can be worked around by instead downloading the contents of |
| 316 | the isolate. To do so, look more deeply into the trigger step's log: |
| 317 | |
| 318 | * <code>python -u |
| 319 | /b/build/slave/Mac_10_10_Release__Intel_/build/src/tools/swarming_client/swarming.py |
| 320 | trigger [...more args...] --tag data:[ISOLATE_HASH] [...more args...] |
| 321 | [ISOLATE_HASH] -- **[...TEST_ARGS...]**</code> |
| 322 | |
| 323 | As of this writing, the isolate hash appears twice in the command line. To |
| 324 | download the isolate's contents into directory `foo` (note, this is in the |
| 325 | "Help" section associated with the page for the isolate's task, but I'm not |
| 326 | sure whether that's accessible only to Google employees or all members of the |
| 327 | chromium.org organization): |
| 328 | |
| 329 | * `python isolateserver.py download -I https://isolateserver.appspot.com |
| 330 | --namespace default-gzip -s [ISOLATE_HASH] --target foo` |
| 331 | |
| 332 | `isolateserver.py` will tell you the approximate command line to use. You |
| 333 | should concatenate the `TEST_ARGS` highlighted in red above with |
| 334 | `isolateserver.py`'s recommendation. The `ISOLATED_OUTDIR` variable can be |
| 335 | safely replaced with `/tmp`. |
| 336 | |
| 337 | Note that `isolateserver.py` downloads a large number of files (everything |
| 338 | needed to run the test) and may take a while. There is a way to use |
| 339 | `run_isolated.py` to achieve the same result, but as of this writing, there |
| 340 | were problems doing so, so this procedure is not documented at this time. |
| 341 | |
| 342 | Before attempting to download an isolate, you must ensure you have permission |
| 343 | to access the isolate server. Full instructions can be [found |
| 344 | here][isolate-server-credentials]. For most cases, you can simply run: |
| 345 | |
| 346 | * `./src/tools/swarming_client/auth.py login |
| 347 | --service=https://isolateserver.appspot.com` |
| 348 | |
| 349 | The above link requires that you log in with your @google.com credentials. It's |
| 350 | not known at the present time whether this works with @chromium.org accounts. |
| 351 | Email kbr@ if you try this and find it doesn't work. |
| 352 | |
| 353 | [isolate-server-credentials]: gpu_testing_bot_details.md#Isolate-server-credentials |
| 354 | |
| 355 | ## Running Locally Built Binaries on the GPU Bots |
| 356 | |
| 357 | See the [Swarming documentation] for instructions on how to upload your binaries to the isolate server and trigger execution on Swarming. |
| 358 | |
| 359 | [Swarming documentation]: https://www.chromium.org/developers/testing/isolated-testing/for-swes#TOC-Run-a-test-built-locally-on-Swarming |
| 360 | |
| 361 | ## Adding New Tests to the GPU Bots |
| 362 | |
| 363 | The goal of the GPU bots is to avoid regressions in Chrome's rendering stack. |
| 364 | To that end, let's add as many tests as possible that will help catch |
| 365 | regressions in the product. If you see a crazy bug in Chrome's rendering which |
| 366 | would be easy to catch with a pixel test running in Chrome and hard to catch in |
| 367 | any of the other test harnesses, please, invest the time to add a test! |
| 368 | |
| 369 | There are a couple of different ways to add new tests to the bots: |
| 370 | |
| 371 | 1. Adding a new test to one of the existing harnesses. |
| 372 | 2. Adding an entire new test step to the bots. |
| 373 | |
| 374 | ### Adding a new test to one of the existing test harnesses |
| 375 | |
| 376 | Adding new tests to the GTest-based harnesses is straightforward and |
| 377 | essentially requires no explanation. |
| 378 | |
| 379 | As of this writing it isn't as easy as desired to add a new test to one of the |
| 380 | Telemetry based harnesses. See [Issue 352807](http://crbug.com/352807). Let's |
| 381 | collectively work to address that issue. It would be great to reduce the number |
| 382 | of steps on the GPU bots, or at least to avoid significantly increasing the |
| 383 | number of steps on the bots. The WebGL conformance tests should probably remain |
| 384 | a separate step, but some of the smaller Telemetry based tests |
| 385 | (`context_lost_tests`, `memory_test`, etc.) should probably be combined into a |
| 386 | single step. |
| 387 | |
| 388 | If you are adding a new test to one of the existing tests (e.g., `pixel_test`), |
| 389 | all you need to do is make sure that your new test runs correctly via isolates. |
| 390 | See the documentation from the GPU bot details on [adding new isolated |
| 391 | tests][new-isolates] for the `GYP_DEFINES` and authentication needed to upload |
| 392 | isolates to the isolate server. Most likely the new test will be Telemetry |
| 393 | based, and included in the `telemetry_gpu_test_run` isolate. You can then |
| 394 | invoke it via: |
| 395 | |
| 396 | * `./src/tools/swarming_client/run_isolated.py -s [HASH] |
| 397 | -I https://isolateserver.appspot.com -- [TEST_NAME] [TEST_ARGUMENTS]` |
| 398 | |
| 399 | [new-isolates]: gpu_testing_bot_details.md#Adding-a-new-isolated-test-to-the-bots |
| 400 | |
| 401 | o## Adding new steps to the GPU Bots |
| 402 | |
| 403 | The tests that are run by the GPU bots are described by a couple of JSON files |
| 404 | in the Chromium workspace: |
| 405 | |
| 406 | * [`chromium.gpu.json`](https://chromium.googlesource.com/chromium/src/+/master/testing/buildbot/chromium.gpu.json) |
| 407 | * [`chromium.gpu.fyi.json`](https://chromium.googlesource.com/chromium/src/+/master/testing/buildbot/chromium.gpu.fyi.json) |
| 408 | |
| 409 | These files are autogenerated by the following script: |
| 410 | |
Kenneth Russell | 8a386d4 | 2018-06-02 09:48:01 | [diff] [blame] | 411 | * [`generate_buildbot_json.py`](https://chromium.googlesource.com/chromium/src/+/master/testing/buildbot/generate_buildbot_json.py) |
Kai Ninomiya | a6429fb3 | 2018-03-30 01:30:56 | [diff] [blame] | 412 | |
Kenneth Russell | 8a386d4 | 2018-06-02 09:48:01 | [diff] [blame] | 413 | This script is documented in |
| 414 | [`testing/buildbot/README.md`](https://chromium.googlesource.com/chromium/src/+/master/testing/buildbot/README.md). The |
| 415 | JSON files are parsed by the chromium and chromium_trybot recipes, and describe |
| 416 | two basic types of tests: |
Kai Ninomiya | a6429fb3 | 2018-03-30 01:30:56 | [diff] [blame] | 417 | |
| 418 | * GTests: those which use the Googletest and Chromium's `base/test/launcher/` |
| 419 | frameworks. |
Kenneth Russell | 8a386d4 | 2018-06-02 09:48:01 | [diff] [blame] | 420 | * Isolated scripts: tests whose initial entry point is a Python script which |
| 421 | follows a simple convention of command line argument parsing. |
| 422 | |
| 423 | The majority of the GPU tests are however: |
| 424 | |
| 425 | * Telemetry based tests: an isolated script test which is built on the |
| 426 | Telemetry framework and which launches the entire browser. |
Kai Ninomiya | a6429fb3 | 2018-03-30 01:30:56 | [diff] [blame] | 427 | |
| 428 | A prerequisite of adding a new test to the bots is that that test [run via |
Kenneth Russell | 8a386d4 | 2018-06-02 09:48:01 | [diff] [blame] | 429 | isolates][new-isolates]. Once that is done, modify `test_suites.pyl` to add the |
| 430 | test to the appropriate set of bots. Be careful when adding large new test steps |
| 431 | to all of the bots, because the GPU bots are a limited resource and do not |
| 432 | currently have the capacity to absorb large new test suites. It is safer to get |
| 433 | new tests running on the chromium.gpu.fyi waterfall first, and expand from there |
| 434 | to the chromium.gpu waterfall (which will also make them run against every |
| 435 | Chromium CL by virtue of the `linux_chromium_rel_ng`, `mac_chromium_rel_ng`, |
| 436 | `win7_chromium_rel_ng` and `android-marshmallow-arm64-rel` tryservers' mirroring |
| 437 | of the bots on this waterfall – so be careful!). |
Kai Ninomiya | a6429fb3 | 2018-03-30 01:30:56 | [diff] [blame] | 438 | |
| 439 | Tryjobs which add new test steps to the chromium.gpu.json file will run those |
| 440 | new steps during the tryjob, which helps ensure that the new test won't break |
| 441 | once it starts running on the waterfall. |
| 442 | |
| 443 | Tryjobs which modify chromium.gpu.fyi.json can be sent to the |
| 444 | `win_optional_gpu_tests_rel`, `mac_optional_gpu_tests_rel` and |
| 445 | `linux_optional_gpu_tests_rel` tryservers to help ensure that they won't |
| 446 | break the FYI bots. |
| 447 | |
| 448 | ## Updating and Adding New Pixel Tests to the GPU Bots |
| 449 | |
| 450 | Adding new pixel tests which require reference images is a slightly more |
| 451 | complex process than adding other kinds of tests which can validate their own |
| 452 | correctness. There are a few reasons for this. |
| 453 | |
| 454 | * Reference image based pixel tests require different golden images for |
| 455 | different combinations of operating system, GPU, driver version, OS |
| 456 | version, and occasionally other variables. |
| 457 | * The reference images must be generated by the main waterfall. The try |
| 458 | servers are not allowed to produce new reference images, only consume them. |
| 459 | The reason for this is that a patch sent to the try servers might cause an |
| 460 | incorrect reference image to be generated. For this reason, the main |
| 461 | waterfall bots upload reference images to cloud storage, and the try |
| 462 | servers download them and verify their results against them. |
| 463 | * The try servers will fail if they run a pixel test requiring a reference |
| 464 | image that doesn't exist in cloud storage. This is deliberate, but needs |
| 465 | more thought; see [Issue 349262](http://crbug.com/349262). |
| 466 | |
| 467 | If a reference image based pixel test's result is going to change because of a |
| 468 | change in ANGLE or Blink (for example), updating the reference images is a |
| 469 | slightly tricky process. Here's how to do it: |
| 470 | |
| 471 | * Mark the pixel test as failing in the [pixel tests]' [test expectations] |
| 472 | * Commit the change to ANGLE, Blink, etc. which will change the test's |
| 473 | results |
| 474 | * Note that without the failure expectation, this commit would turn some bots |
| 475 | red; a Blink change will turn the GPU bots on the chromium.webkit waterfall |
| 476 | red, and an ANGLE change will turn the chromium.gpu.fyi bots red |
| 477 | * Wait for Blink/ANGLE/etc. to roll |
| 478 | * Commit a change incrementing the revision number associated with the test |
| 479 | in the [test pages] |
| 480 | * Commit a second change removing the failure expectation, once all of the |
| 481 | bots on the main waterfall have generated new reference images. This change |
| 482 | should go through the commit queue cleanly. |
| 483 | |
| 484 | [pixel tests]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/pixel_test_pages.py |
| 485 | [test expectations]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/pixel_expectations.py |
| 486 | [test pages]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/pixel_test_pages.py |
| 487 | |
| 488 | When adding a brand new pixel test that uses a reference image, the steps are |
| 489 | similar, but simpler: |
| 490 | |
| 491 | * Mark the test as failing in the same commit which introduces the new test |
| 492 | * Wait for the reference images to be produced by all of the GPU bots on the |
| 493 | waterfalls (see [chromium-gpu-archive/reference-images]) |
| 494 | * Commit a change un-marking the test as failing |
| 495 | |
| 496 | When making a Chromium-side change which changes the pixel tests' results: |
| 497 | |
| 498 | * In your CL, both mark the pixel test as failing in the pixel test's test |
| 499 | expectations and increment the test's version number in the page set (see |
| 500 | above) |
| 501 | * After your CL lands, land another CL removing the failure expectations. If |
| 502 | this second CL goes through the commit queue cleanly, you know reference |
| 503 | images were generated properly. |
| 504 | |
| 505 | In general, when adding a new pixel test, it's better to spot check a few |
| 506 | pixels in the rendered image rather than using a reference image per platform. |
| 507 | The [GPU rasterization test] is a good example of a recently added test which |
| 508 | performs such spot checks. |
| 509 | |
| 510 | [cloud storage bucket]: https://console.developers.google.com/storage/chromium-gpu-archive/reference-images |
| 511 | <!-- XXX: old link --> |
| 512 | [GPU rasterization test]: http://src.chromium.org/viewvc/chrome/trunk/src/content/test/gpu/gpu_tests/gpu_rasterization.py |
| 513 | |
| 514 | ## Stamping out Flakiness |
| 515 | |
| 516 | It's critically important to aggressively investigate and eliminate the root |
| 517 | cause of any flakiness seen on the GPU bots. The bots have been known to run |
| 518 | reliably for days at a time, and any flaky failures that are tolerated on the |
| 519 | bots translate directly into instability of the browser experienced by |
| 520 | customers. Critical bugs in subsystems like WebGL, affecting high-profile |
| 521 | products like Google Maps, have escaped notice in the past because the bots |
| 522 | were unreliable. After much re-work, the GPU bots are now among the most |
| 523 | reliable automated test machines in the Chromium project. Let's keep them that |
| 524 | way. |
| 525 | |
| 526 | Flakiness affecting the GPU tests can come in from highly unexpected sources. |
| 527 | Here are some examples: |
| 528 | |
| 529 | * Intermittent pixel_test failures on Linux where the captured pixels were |
| 530 | black, caused by the Display Power Management System (DPMS) kicking in. |
| 531 | Disabled the X server's built-in screen saver on the GPU bots in response. |
| 532 | * GNOME dbus-related deadlocks causing intermittent timeouts ([Issue |
| 533 | 309093](http://crbug.com/309093) and related bugs). |
| 534 | * Windows Audio system changes causing intermittent assertion failures in the |
| 535 | browser ([Issue 310838](http://crbug.com/310838)). |
| 536 | * Enabling assertion failures in the C++ standard library on Linux causing |
| 537 | random assertion failures ([Issue 328249](http://crbug.com/328249)). |
| 538 | * V8 bugs causing random crashes of the Maps pixel test (V8 issues |
| 539 | [3022](https://code.google.com/p/v8/issues/detail?id=3022), |
| 540 | [3174](https://code.google.com/p/v8/issues/detail?id=3174)). |
| 541 | * TLS changes causing random browser process crashes ([Issue |
| 542 | 264406](http://crbug.com/264406)). |
| 543 | * Isolated test execution flakiness caused by failures to reliably clean up |
| 544 | temporary directories ([Issue 340415](http://crbug.com/340415)). |
| 545 | * The Telemetry-based WebGL conformance suite caught a bug in the memory |
| 546 | allocator on Android not caught by any other bot ([Issue |
| 547 | 347919](http://crbug.com/347919)). |
| 548 | * context_lost test failures caused by the compositor's retry logic ([Issue |
| 549 | 356453](http://crbug.com/356453)). |
| 550 | * Multiple bugs in Chromium's support for lost contexts causing flakiness of |
| 551 | the context_lost tests ([Issue 365904](http://crbug.com/365904)). |
| 552 | * Maps test timeouts caused by Content Security Policy changes in Blink |
| 553 | ([Issue 395914](http://crbug.com/395914)). |
| 554 | * Weak pointer assertion failures in various webgl\_conformance\_tests caused |
| 555 | by changes to the media pipeline ([Issue 399417](http://crbug.com/399417)). |
| 556 | * A change to a default WebSocket timeout in Telemetry causing intermittent |
| 557 | failures to run all WebGL conformance tests on the Mac bots ([Issue |
| 558 | 403981](http://crbug.com/403981)). |
| 559 | * Chrome leaking suspended sub-processes on Windows, apparently a preexisting |
| 560 | race condition that suddenly showed up ([Issue |
| 561 | 424024](http://crbug.com/424024)). |
| 562 | * Changes to Chrome's cross-context synchronization primitives causing the |
| 563 | wrong tiles to be rendered ([Issue 584381](http://crbug.com/584381)). |
| 564 | * A bug in V8's handling of array literals causing flaky failures of |
| 565 | texture-related WebGL 2.0 tests ([Issue 606021](http://crbug.com/606021)). |
| 566 | * Assertion failures in sync point management related to lost contexts that |
| 567 | exposed a real correctness bug ([Issue 606112](http://crbug.com/606112)). |
| 568 | * A bug in glibc's `sem_post`/`sem_wait` primitives breaking V8's parallel |
| 569 | garbage collection ([Issue 609249](http://crbug.com/609249)). |
Kenneth Russell | d5efb3f | 2018-05-11 01:40:45 | [diff] [blame] | 570 | * A change to Blink's memory purging primitive which caused intermittent |
| 571 | timeouts of WebGL conformance tests on all platforms ([Issue |
| 572 | 840988](http://crbug.com/840988)). |
Kai Ninomiya | a6429fb3 | 2018-03-30 01:30:56 | [diff] [blame] | 573 | |
| 574 | If you notice flaky test failures either on the GPU waterfalls or try servers, |
| 575 | please file bugs right away with the component Internals>GPU>Testing and |
| 576 | include links to the failing builds and copies of the logs, since the logs |
| 577 | expire after a few days. [GPU pixel wranglers] should give the highest priority |
| 578 | to eliminating flakiness on the tree. |
| 579 | |
| 580 | [GPU pixel wranglers]: pixel_wrangling.md |