Kai Ninomiya | a6429fb3 | 2018-03-30 01:30:56 | [diff] [blame] | 1 | # GPU Bots & Pixel Wrangling |
| 2 | |
| 3 |  |
| 4 | |
| 5 | (December 2017: presentation on GPU bots and pixel wrangling: see [slides].) |
| 6 | |
| 7 | GPU Pixel Wrangling is the process of keeping various GPU bots green. On the |
| 8 | GPU bots, tests run on physical hardware with real GPUs, not in VMs like the |
| 9 | majority of the bots on the Chromium waterfall. |
| 10 | |
| 11 | [slides]: https://docs.google.com/presentation/d/1sZjyNe2apUhwr5sinRfPs7eTzH-3zO0VQ-Cj-8DlEDQ/edit?usp=sharing |
| 12 | |
| 13 | [TOC] |
| 14 | |
| 15 | ## Fleet Status |
| 16 | |
| 17 | The following links (sorry, Google employees only) show the status of various |
| 18 | GPU bots in the fleet. |
| 19 | |
| 20 | Primary configurations: |
| 21 | |
| 22 | * [Windows 10 Quadro P400 Pool](http://shortn/_dmtaFfY2Jq) |
| 23 | * [Windows 10 Intel HD 630 Pool](http://shortn/_QsoGIGIFYd) |
| 24 | * [Linux Quadro P400 Pool](http://shortn/_fNgNs1uROQ) |
| 25 | * [Linux Intel HD 630 Pool](http://shortn/_dqEGjCGMHT) |
| 26 | * [Mac AMD Retina 10.12.6 GPU Pool](http://shortn/_BcrVmfRoSo) |
| 27 | * [Mac Mini Chrome Pool](http://shortn/_Ru8NESapPM) |
| 28 | * [Android Nexus 5X Chrome Pool](http://shortn/_G3j7AVmuNR) |
| 29 | |
| 30 | Secondary configurations: |
| 31 | |
| 32 | * [Windows 7 Quadro P400 Pool](http://shortn/_cuxSKC15UX) |
| 33 | * [Windows AMD R7 240 GPU Pool](http://shortn/_XET7RTMHQm) |
| 34 | * [Mac NVIDIA Retina 10.12.6 GPU Pool](http://shortn/_jQWG7W71Ek) |
| 35 | |
| 36 | ## GPU Bots' Waterfalls |
| 37 | |
| 38 | The waterfalls work much like any other; see the [Tour of the Chromium Buildbot |
| 39 | Waterfall] for a more detailed explanation of how this is laid out. We have |
| 40 | more subtle configurations because the GPU matters, not just the OS and release |
| 41 | v. debug. Hence we have Windows Nvidia Release bots, Mac Intel Debug bots, and |
| 42 | so on. The waterfalls we’re interested in are: |
| 43 | |
| 44 | * [Chromium GPU] |
| 45 | * Various operating systems, configurations, GPUs, etc. |
| 46 | * [Chromium GPU FYI] |
| 47 | * These bots run less-standard configurations like Windows with AMD GPUs, |
| 48 | Linux with Intel GPUs, etc. |
| 49 | * These bots build with top of tree ANGLE rather than the `DEPS` version. |
| 50 | * The [ANGLE tryservers] help ensure that these bots stay green. However, |
| 51 | it is possible that due to ANGLE changes these bots may be red while |
| 52 | the chromium.gpu bots are green. |
| 53 | * The [ANGLE Wrangler] is on-call to help resolve ANGLE-related breakage |
| 54 | on this watefall. |
| 55 | * To determine if a different ANGLE revision was used between two builds, |
| 56 | compare the `got_angle_revision` buildbot property on the GPU builders |
| 57 | or `parent_got_angle_revision` on the testers. This revision can be |
| 58 | used to do a `git log` in the `third_party/angle` repository. |
| 59 | |
| 60 | <!-- TODO(kainino): update link when the page is migrated --> |
| 61 | [Tour of the Chromium Buildbot Waterfall]: http://www.chromium.org/developers/testing/chromium-build-infrastructure/tour-of-the-chromium-buildbot |
| 62 | [Chromium GPU]: https://ci.chromium.org/p/chromium/g/chromium.gpu/console?reload=120 |
| 63 | [Chromium GPU FYI]: https://ci.chromium.org/p/chromium/g/chromium.gpu.fyi/console?reload=120 |
| 64 | [ANGLE tryservers]: https://build.chromium.org/p/tryserver.chromium.angle/waterfall |
| 65 | <!-- TODO(kainino): update link when the page is migrated --> |
| 66 | [ANGLE Wrangler]: https://sites.google.com/a/chromium.org/dev/developers/how-tos/angle-wrangling |
| 67 | |
| 68 | ## Test Suites |
| 69 | |
| 70 | The bots run several test suites. The majority of them have been migrated to |
| 71 | the Telemetry harness, and are run within the full browser, in order to better |
| 72 | test the code that is actually shipped. As of this writing, the tests included: |
| 73 | |
| 74 | * Tests using the Telemetry harness: |
| 75 | * The WebGL conformance tests: `webgl_conformance_integration_test.py` |
| 76 | * A Google Maps test: `maps_integration_test.py` |
| 77 | * Context loss tests: `context_lost_integration_test.py` |
| 78 | * Depth capture tests: `depth_capture_integration_test.py` |
| 79 | * GPU process launch tests: `gpu_process_integration_test.py` |
| 80 | * Hardware acceleration validation tests: |
| 81 | `hardware_accelerated_feature_integration_test.py` |
| 82 | * Pixel tests validating the end-to-end rendering pipeline: |
| 83 | `pixel_integration_test.py` |
| 84 | * Stress tests of the screenshot functionality other tests use: |
| 85 | `screenshot_sync_integration_test.py` |
| 86 | * `angle_unittests`: see `src/gpu/gpu.gyp` |
| 87 | * drawElements tests (on the chromium.gpu.fyi waterfall): see |
| 88 | `src/third_party/angle/src/tests/BUILD.gn` |
| 89 | * `gles2_conform_test` (requires internal sources): see |
| 90 | `src/gpu/gles2_conform_support/gles2_conform_test.gyp` |
| 91 | * `gl_tests`: see `src/gpu/BUILD.gn` |
| 92 | * `gl_unittests`: see `src/ui/gl/BUILD.gn` |
| 93 | |
| 94 | And more. See `src/content/test/gpu/generate_buildbot_json.py` for the |
| 95 | complete description of bots and tests. |
| 96 | |
| 97 | Additionally, the Release bots run: |
| 98 | |
| 99 | * `tab_capture_end2end_tests:` see |
| 100 | `src/chrome/browser/extensions/api/tab_capture/tab_capture_apitest.cc` and |
| 101 | `src/chrome/browser/extensions/api/cast_streaming/cast_streaming_apitest.cc` |
| 102 | |
| 103 | ### More Details |
| 104 | |
| 105 | More details about the bots' setup can be found on the [GPU Testing] page. |
| 106 | |
| 107 | [GPU Testing]: https://sites.google.com/a/chromium.org/dev/developers/testing/gpu-testing |
| 108 | |
| 109 | ## Wrangling |
| 110 | |
| 111 | ### Prerequisites |
| 112 | |
| 113 | 1. Ideally a wrangler should be a Chromium committer. If you're on the GPU |
| 114 | pixel wrangling rotation, there will be an email notifying you of the upcoming |
| 115 | shift, and a calendar appointment. |
| 116 | * If you aren't a committer, don't panic. It's still best for everyone on |
| 117 | the team to become acquainted with the procedures of maintaining the |
| 118 | GPU bots. |
| 119 | * In this case you'll upload CLs to Gerrit to perform reverts (optionally |
| 120 | using the new "Revert" button in the UI), and might consider using |
| 121 | `TBR=` to speed through trivial and urgent CLs. In general, try to send |
| 122 | all CLs through the commit queue. |
| 123 | * Contact bajones, kainino, kbr, vmiura, zmo, or another member of the |
| 124 | Chrome GPU team who's already a committer for help landing patches or |
| 125 | reverts during your shift. |
| 126 | 2. Apply for [access to the bots]. |
| 127 | |
| 128 | [access to the bots]: https://sites.google.com/a/google.com/chrome-infrastructure/golo/remote-access?pli=1 |
| 129 | |
| 130 | ### How to Keep the Bots Green |
| 131 | |
| 132 | 1. Watch for redness on the tree. |
| 133 | 1. [Sheriff-O-Matic now has support for the chromium.gpu.fyi waterfall]! |
| 134 | 1. The chromium.gpu bots are covered under Sheriff-O-Matic's [Chromium |
| 135 | tab]. As pixel wrangler, ignore any non-GPU test failures in this tab. |
| 136 | 1. The bots are expected to be green all the time. Flakiness on these bots |
| 137 | is neither expected nor acceptable. |
| 138 | 1. If a bot goes consistently red, it's necessary to figure out whether a |
| 139 | recent CL caused it, or whether it's a problem with the bot or |
| 140 | infrastructure. |
| 141 | 1. If it looks like a problem with the bot (deep problems like failing to |
| 142 | check out the sources, the isolate server failing, etc.) notify the |
| 143 | Chromium troopers and file a P1 bug with labels: Infra\>Labs, |
| 144 | Infra\>Troopers and Internals\>GPU\>Testing. See the general [tree |
| 145 | sheriffing page] for more details. |
| 146 | 1. Otherwise, examine the builds just before and after the redness was |
| 147 | introduced. Look at the revisions in the builds before and after the |
| 148 | failure was introduced. |
| 149 | 1. **File a bug** capturing the regression range and excerpts of any |
| 150 | associated logs. Regressions should be marked P1. CC engineers who you |
| 151 | think may be able to help triage the issue. Keep in mind that the logs |
| 152 | on the bots expire after a few days, so make sure to add copies of |
| 153 | relevant logs to the bug report. |
| 154 | 1. Use the `Hotlist=PixelWrangler` label to mark bugs that require the |
| 155 | pixel wrangler's attention, so it's easy to find relevant bugs when |
| 156 | handing off shifts. |
| 157 | 1. Study the regression range carefully. Use drover to revert any CLs |
| 158 | which break the chromium.gpu bots. Use your judgment about |
| 159 | chromium.gpu.fyi, since not all bots are covered by trybots. In the |
| 160 | revert message, provide a clear description of what broke, links to |
| 161 | failing builds, and excerpts of the failure logs, because the build |
| 162 | logs expire after a few days. |
| 163 | 1. Make sure the bots are running jobs. |
| 164 | 1. Keep an eye on the console views of the various bots. |
| 165 | 1. Make sure the bots are all actively processing jobs. If they go offline |
| 166 | for a long period of time, the "summary bubble" at the top may still be |
| 167 | green, but the column in the console view will be gray. |
| 168 | 1. Email the Chromium troopers if you find a bot that's not processing |
| 169 | jobs. |
| 170 | 1. Make sure the GPU try servers are in good health. |
| 171 | 1. The GPU try servers are no longer distinct bots on a separate |
| 172 | waterfall, but instead run as part of the regular tryjobs on the |
| 173 | Chromium waterfalls. The GPU tests run as part of the following |
| 174 | tryservers' jobs: |
| 175 | 1. <code>[linux_chromium_rel_ng]</code> on the [luci.chromium.try] |
| 176 | waterfall |
| 177 | <!-- TODO(kainino): update link to luci.chromium.try --> |
| 178 | 1. <code>[mac_chromium_rel_ng]</code> on the [tryserver.chromium.mac] |
| 179 | waterfall |
| 180 | <!-- TODO(kainino): update link to luci.chromium.try --> |
| 181 | 1. <code>[win7_chromium_rel_ng]</code> on the [tryserver.chromium.win] |
| 182 | waterfall |
| 183 | 1. The best tool to use to quickly find flakiness on the tryservers is the |
| 184 | new [Chromium Try Flakes] tool. Look for the names of GPU tests (like |
| 185 | maps_pixel_test) as well as the test machines (e.g. |
| 186 | mac_chromium_rel_ng). If you see a flaky test, file a bug like [this |
| 187 | one](http://crbug.com/444430). Also look for compile flakes that may |
| 188 | indicate that a bot needs to be clobbered. Contact the Chromium |
| 189 | sheriffs or troopers if so. |
| 190 | 1. Glance at these trybots from time to time and see if any GPU tests are |
| 191 | failing frequently. **Note** that test failures are **expected** on |
| 192 | these bots: individuals' patches may fail to apply, fail to compile, or |
| 193 | break various tests. Look specifically for patterns in the failures. It |
| 194 | isn't necessary to spend a lot of time investigating each individual |
| 195 | failure. (Use the "Show: 200" link at the bottom of the page to see |
| 196 | more history.) |
| 197 | 1. If the same set of tests are failing repeatedly, look at the individual |
| 198 | runs. Examine the swarming results and see whether they're all running |
| 199 | on the same machine. (This is the "Bot assigned to task" when clicking |
| 200 | any of the test's shards in the build logs.) If they are, something |
| 201 | might be wrong with the hardware. Use the [Swarming Server Stats] tool |
| 202 | to drill down into the specific builder. |
| 203 | 1. If you see the same test failing in a flaky manner across multiple |
| 204 | machines and multiple CLs, it's crucial to investigate why it's |
| 205 | happening. [crbug.com/395914](http://crbug.com/395914) was one example |
| 206 | of an innocent-looking Blink change which made it through the commit |
| 207 | queue and introduced widespread flakiness in a range of GPU tests. The |
| 208 | failures were also most visible on the try servers as opposed to the |
| 209 | main waterfalls. |
| 210 | 1. Check if any pixel test failures are actual failures or need to be |
| 211 | rebaselined. |
| 212 | 1. For a given build failing the pixel tests, click the "stdio" link of |
| 213 | the "pixel" step. |
| 214 | 1. The output will contain a link of the form |
| 215 | <http://chromium-browser-gpu-tests.commondatastorage.googleapis.com/view_test_results.html?242523_Linux_Release_Intel__telemetry> |
| 216 | 1. Visit the link to see whether the generated or reference images look |
| 217 | incorrect. |
| 218 | 1. All of the reference images for all of the bots are stored in cloud |
| 219 | storage under [chromium-gpu-archive/reference-images]. They are indexed |
| 220 | by version number, OS, GPU vendor, GPU device, and whether or not |
| 221 | antialiasing is enabled in that configuration. You can download the |
| 222 | reference images individually to examine them in detail. |
| 223 | 1. Rebaseline pixel test reference images if necessary. |
| 224 | 1. Follow the [instructions on the GPU testing page]. |
| 225 | 1. Alternatively, if absolutely necessary, you can use the [Chrome |
| 226 | Internal GPU Pixel Wrangling Instructions] to delete just the broken |
| 227 | reference images for a particular configuration. |
| 228 | 1. Update Telemetry-based test expectations if necessary. |
| 229 | 1. Most of the GPU tests are run inside a full Chromium browser, launched |
| 230 | by Telemetry, rather than a Gtest harness. The tests and their |
| 231 | expectations are contained in [src/content/test/gpu/gpu_tests/] . See |
| 232 | for example <code>[webgl_conformance_expectations.py]</code>, |
| 233 | <code>[gpu_process_expectations.py]</code> and |
| 234 | <code>[pixel_expectations.py]</code>. |
| 235 | 1. See the header of the file a list of modifiers to specify a bot |
| 236 | configuration. It is possible to specify OS (down to a specific |
| 237 | version, say, Windows 7 or Mountain Lion), GPU vendor |
| 238 | (NVIDIA/AMD/Intel), and a specific GPU device. |
| 239 | 1. The key is to maintain the highest coverage: if you have to disable a |
| 240 | test, disable it only on the specific configurations it's failing. Note |
| 241 | that it is not possible to discern between Debug and Release |
| 242 | configurations. |
| 243 | 1. Mark tests failing or skipped, which will suppress flaky failures, only |
| 244 | as a last resort. It is only really necessary to suppress failures that |
| 245 | are showing up on the GPU tryservers, since failing tests no longer |
| 246 | close the Chromium tree. |
| 247 | 1. Please read the section on [stamping out flakiness] for motivation on |
| 248 | how important it is to eliminate flakiness rather than hiding it. |
| 249 | 1. For the remaining Gtest-style tests, use the [`DISABLED_` |
| 250 | modifier][gtest-DISABLED] to suppress any failures if necessary. |
| 251 | |
| 252 | [Sheriff-O-Matic now has support for the chromium.gpu.fyi waterfall]: https://sheriff-o-matic.appspot.com/chromium.gpu.fyi |
| 253 | [Chromium tab]: https://sheriff-o-matic.appspot.com/chromium |
| 254 | [tree sheriffing page]: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs |
| 255 | [linux_chromium_rel_ng]: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng |
| 256 | [luci.chromium.try]: https://ci.chromium.org/p/chromium/g/luci.chromium.try/builders |
| 257 | [mac_chromium_rel_ng]: https://ci.chromium.org/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/ |
| 258 | [tryserver.chromium.mac]: https://ci.chromium.org/p/chromium/g/tryserver.chromium.mac/builders |
| 259 | [win7_chromium_rel_ng]: https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/ |
| 260 | [tryserver.chromium.win]: https://ci.chromium.org/p/chromium/g/tryserver.chromium.win/builders |
| 261 | [Chromium Try Flakes]: http://chromium-try-flakes.appspot.com/ |
| 262 | <!-- TODO(kainino): link doesn't work, but is still included from chromium-swarm homepage so not removing it now --> |
| 263 | [Swarming Server Stats]: https://chromium-swarm.appspot.com/stats |
| 264 | [chromium-gpu-archive/reference-images]: https://console.developers.google.com/storage/chromium-gpu-archive/reference-images |
| 265 | [instructions on the GPU testing page]: https://sites.google.com/a/chromium.org/dev/developers/testing/gpu-testing#TOC-Updating-and-Adding-New-Pixel-Tests-to-the-GPU-Bots |
| 266 | [Chrome Internal GPU Pixel Wrangling Instructions]: https://sites.google.com/a/google.com/client3d/documents/chrome-internal-gpu-pixel-wrangling-instructions |
| 267 | [src/content/test/gpu/gpu_tests/]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/ |
| 268 | [webgl_conformance_expectations.py]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/webgl_conformance_expectations.py |
| 269 | [gpu_process_expectations.py]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/gpu_process_expectations.py |
| 270 | [pixel_expectations.py]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/pixel_expectations.py |
| 271 | [stamping out flakiness]: gpu_testing.md#Stamping-out-Flakiness |
| 272 | [gtest-DISABLED]: https://github.com/google/googletest/blob/master/googletest/docs/AdvancedGuide.md#temporarily-disabling-tests |
| 273 | |
| 274 | ### When Bots Misbehave (SSHing into a bot) |
| 275 | |
| 276 | 1. See the [Chrome Internal GPU Pixel Wrangling Instructions] for information |
| 277 | on ssh'ing in to the GPU bots. |
| 278 | |
| 279 | [Chrome Internal GPU Pixel Wrangling Instructions]: https://sites.google.com/a/google.com/client3d/documents/chrome-internal-gpu-pixel-wrangling-instructions |
| 280 | |
| 281 | ### Reproducing WebGL conformance test failures locally |
| 282 | |
| 283 | 1. From the buildbot build output page, click on the failed shard to get to |
| 284 | the swarming task page. Scroll to the bottom of the left panel for a |
| 285 | command to run the task locally. This will automatically download the build |
| 286 | and any other inputs needed. |
| 287 | 2. Alternatively, to run the test on a local build, pass the arguments |
| 288 | `--browser=exact --browser-executable=/path/to/binary` to |
| 289 | `content/test/gpu/run_gpu_integration_test.py`. |
| 290 | Also see the [telemetry documentation]. |
| 291 | |
| 292 | [telemetry documentation]: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/docs/run_benchmarks_locally.md |
| 293 | |
| 294 | ## Extending the GPU Pixel Wrangling Rotation |
| 295 | |
| 296 | See the [Chrome Internal GPU Pixel Wrangling Instructions] for information on extending the rotation. |
| 297 | |
| 298 | [Chrome Internal GPU Pixel Wrangling Instructions]: https://sites.google.com/a/google.com/client3d/documents/chrome-internal-gpu-pixel-wrangling-instructions |