Blame - docs/gpu/pixel_wrangling.md - chromium/src.git

blob: 7d1fcda47fb4f12de5f4d85e02115afdc208198a [file] [log] [blame] [view]

Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	1	# GPU Bots & Pixel Wrangling
				2
				3	![](images/wrangler.png)
				4
				5	(December 2017: presentation on GPU bots and pixel wrangling: see [slides].)
				6
				7	GPU Pixel Wrangling is the process of keeping various GPU bots green. On the
				8	GPU bots, tests run on physical hardware with real GPUs, not in VMs like the
				9	majority of the bots on the Chromium waterfall.
				10
				11	[slides]: https://docs.google.com/presentation/d/1sZjyNe2apUhwr5sinRfPs7eTzH-3zO0VQ-Cj-8DlEDQ/edit?usp=sharing
				12
				13	[TOC]
				14
				15	## Fleet Status
				16
				17	The following links (sorry, Google employees only) show the status of various
				18	GPU bots in the fleet.
				19
				20	Primary configurations:
				21
				22	* [Windows 10 Quadro P400 Pool](http://shortn/_dmtaFfY2Jq)
				23	* [Windows 10 Intel HD 630 Pool](http://shortn/_QsoGIGIFYd)
				24	* [Linux Quadro P400 Pool](http://shortn/_fNgNs1uROQ)
				25	* [Linux Intel HD 630 Pool](http://shortn/_dqEGjCGMHT)
				26	* [Mac AMD Retina 10.12.6 GPU Pool](http://shortn/_BcrVmfRoSo)
				27	* [Mac Mini Chrome Pool](http://shortn/_Ru8NESapPM)
				28	* [Android Nexus 5X Chrome Pool](http://shortn/_G3j7AVmuNR)
				29
				30	Secondary configurations:
				31
				32	* [Windows 7 Quadro P400 Pool](http://shortn/_cuxSKC15UX)
				33	* [Windows AMD R7 240 GPU Pool](http://shortn/_XET7RTMHQm)
				34	* [Mac NVIDIA Retina 10.12.6 GPU Pool](http://shortn/_jQWG7W71Ek)
				35
				36	## GPU Bots' Waterfalls
				37
				38	The waterfalls work much like any other; see the [Tour of the Chromium Buildbot
				39	Waterfall] for a more detailed explanation of how this is laid out. We have
				40	more subtle configurations because the GPU matters, not just the OS and release
				41	v. debug. Hence we have Windows Nvidia Release bots, Mac Intel Debug bots, and
				42	so on. The waterfalls we’re interested in are:
				43
				44	* [Chromium GPU]
				45	* Various operating systems, configurations, GPUs, etc.
				46	* [Chromium GPU FYI]
				47	* These bots run less-standard configurations like Windows with AMD GPUs,
				48	Linux with Intel GPUs, etc.
				49	* These bots build with top of tree ANGLE rather than the `DEPS` version.
				50	* The [ANGLE tryservers] help ensure that these bots stay green. However,
				51	it is possible that due to ANGLE changes these bots may be red while
				52	the chromium.gpu bots are green.
				53	* The [ANGLE Wrangler] is on-call to help resolve ANGLE-related breakage
				54	on this watefall.
				55	* To determine if a different ANGLE revision was used between two builds,
				56	compare the `got_angle_revision` buildbot property on the GPU builders
				57	or `parent_got_angle_revision` on the testers. This revision can be
				58	used to do a `git log` in the `third_party/angle` repository.
				59
				60	<!-- TODO(kainino): update link when the page is migrated -->
				61	[Tour of the Chromium Buildbot Waterfall]: http://www.chromium.org/developers/testing/chromium-build-infrastructure/tour-of-the-chromium-buildbot
				62	[Chromium GPU]: https://ci.chromium.org/p/chromium/g/chromium.gpu/console?reload=120
				63	[Chromium GPU FYI]: https://ci.chromium.org/p/chromium/g/chromium.gpu.fyi/console?reload=120
				64	[ANGLE tryservers]: https://build.chromium.org/p/tryserver.chromium.angle/waterfall
				65	<!-- TODO(kainino): update link when the page is migrated -->
				66	[ANGLE Wrangler]: https://sites.google.com/a/chromium.org/dev/developers/how-tos/angle-wrangling
				67
				68	## Test Suites
				69
				70	The bots run several test suites. The majority of them have been migrated to
				71	the Telemetry harness, and are run within the full browser, in order to better
				72	test the code that is actually shipped. As of this writing, the tests included:
				73
				74	* Tests using the Telemetry harness:
				75	* The WebGL conformance tests: `webgl_conformance_integration_test.py`
				76	* A Google Maps test: `maps_integration_test.py`
				77	* Context loss tests: `context_lost_integration_test.py`
				78	* Depth capture tests: `depth_capture_integration_test.py`
				79	* GPU process launch tests: `gpu_process_integration_test.py`
				80	* Hardware acceleration validation tests:
				81	`hardware_accelerated_feature_integration_test.py`
				82	* Pixel tests validating the end-to-end rendering pipeline:
				83	`pixel_integration_test.py`
				84	* Stress tests of the screenshot functionality other tests use:
				85	`screenshot_sync_integration_test.py`
				86	* `angle_unittests`: see `src/gpu/gpu.gyp`
				87	* drawElements tests (on the chromium.gpu.fyi waterfall): see
				88	`src/third_party/angle/src/tests/BUILD.gn`
				89	* `gles2_conform_test` (requires internal sources): see
				90	`src/gpu/gles2_conform_support/gles2_conform_test.gyp`
				91	* `gl_tests`: see `src/gpu/BUILD.gn`
				92	* `gl_unittests`: see `src/ui/gl/BUILD.gn`
				93
				94	And more. See `src/content/test/gpu/generate_buildbot_json.py` for the
				95	complete description of bots and tests.
				96
				97	Additionally, the Release bots run:
				98
				99	* `tab_capture_end2end_tests:` see
				100	`src/chrome/browser/extensions/api/tab_capture/tab_capture_apitest.cc` and
				101	`src/chrome/browser/extensions/api/cast_streaming/cast_streaming_apitest.cc`
				102
				103	### More Details
				104
				105	More details about the bots' setup can be found on the [GPU Testing] page.
				106
				107	[GPU Testing]: https://sites.google.com/a/chromium.org/dev/developers/testing/gpu-testing
				108
				109	## Wrangling
				110
				111	### Prerequisites
				112
				113	1. Ideally a wrangler should be a Chromium committer. If you're on the GPU
				114	pixel wrangling rotation, there will be an email notifying you of the upcoming
				115	shift, and a calendar appointment.
				116	* If you aren't a committer, don't panic. It's still best for everyone on
				117	the team to become acquainted with the procedures of maintaining the
				118	GPU bots.
				119	* In this case you'll upload CLs to Gerrit to perform reverts (optionally
				120	using the new "Revert" button in the UI), and might consider using
				121	`TBR=` to speed through trivial and urgent CLs. In general, try to send
				122	all CLs through the commit queue.
				123	* Contact bajones, kainino, kbr, vmiura, zmo, or another member of the
				124	Chrome GPU team who's already a committer for help landing patches or
				125	reverts during your shift.
				126	2. Apply for [access to the bots].
				127
				128	[access to the bots]: https://sites.google.com/a/google.com/chrome-infrastructure/golo/remote-access?pli=1
				129
				130	### How to Keep the Bots Green
				131
				132	1. Watch for redness on the tree.
				133	1. [Sheriff-O-Matic now has support for the chromium.gpu.fyi waterfall]!
				134	1. The chromium.gpu bots are covered under Sheriff-O-Matic's [Chromium
				135	tab]. As pixel wrangler, ignore any non-GPU test failures in this tab.
				136	1. The bots are expected to be green all the time. Flakiness on these bots
				137	is neither expected nor acceptable.
				138	1. If a bot goes consistently red, it's necessary to figure out whether a
				139	recent CL caused it, or whether it's a problem with the bot or
				140	infrastructure.
				141	1. If it looks like a problem with the bot (deep problems like failing to
				142	check out the sources, the isolate server failing, etc.) notify the
				143	Chromium troopers and file a P1 bug with labels: Infra\>Labs,
				144	Infra\>Troopers and Internals\>GPU\>Testing. See the general [tree
				145	sheriffing page] for more details.
				146	1. Otherwise, examine the builds just before and after the redness was
				147	introduced. Look at the revisions in the builds before and after the
				148	failure was introduced.
				149	1. File a bug capturing the regression range and excerpts of any
				150	associated logs. Regressions should be marked P1. CC engineers who you
				151	think may be able to help triage the issue. Keep in mind that the logs
				152	on the bots expire after a few days, so make sure to add copies of
				153	relevant logs to the bug report.
				154	1. Use the `Hotlist=PixelWrangler` label to mark bugs that require the
				155	pixel wrangler's attention, so it's easy to find relevant bugs when
				156	handing off shifts.
				157	1. Study the regression range carefully. Use drover to revert any CLs
				158	which break the chromium.gpu bots. Use your judgment about
				159	chromium.gpu.fyi, since not all bots are covered by trybots. In the
				160	revert message, provide a clear description of what broke, links to
				161	failing builds, and excerpts of the failure logs, because the build
				162	logs expire after a few days.
				163	1. Make sure the bots are running jobs.
				164	1. Keep an eye on the console views of the various bots.
				165	1. Make sure the bots are all actively processing jobs. If they go offline
				166	for a long period of time, the "summary bubble" at the top may still be
				167	green, but the column in the console view will be gray.
				168	1. Email the Chromium troopers if you find a bot that's not processing
				169	jobs.
				170	1. Make sure the GPU try servers are in good health.
				171	1. The GPU try servers are no longer distinct bots on a separate
				172	waterfall, but instead run as part of the regular tryjobs on the
				173	Chromium waterfalls. The GPU tests run as part of the following
				174	tryservers' jobs:
				175	1. <code>[linux_chromium_rel_ng]</code> on the [luci.chromium.try]
				176	waterfall
				177	<!-- TODO(kainino): update link to luci.chromium.try -->
				178	1. <code>[mac_chromium_rel_ng]</code> on the [tryserver.chromium.mac]
				179	waterfall
				180	<!-- TODO(kainino): update link to luci.chromium.try -->
				181	1. <code>[win7_chromium_rel_ng]</code> on the [tryserver.chromium.win]
				182	waterfall
				183	1. The best tool to use to quickly find flakiness on the tryservers is the
				184	new [Chromium Try Flakes] tool. Look for the names of GPU tests (like
				185	maps_pixel_test) as well as the test machines (e.g.
				186	mac_chromium_rel_ng). If you see a flaky test, file a bug like [this
				187	one](http://crbug.com/444430). Also look for compile flakes that may
				188	indicate that a bot needs to be clobbered. Contact the Chromium
				189	sheriffs or troopers if so.
				190	1. Glance at these trybots from time to time and see if any GPU tests are
				191	failing frequently. Note that test failures are expected on
				192	these bots: individuals' patches may fail to apply, fail to compile, or
				193	break various tests. Look specifically for patterns in the failures. It
				194	isn't necessary to spend a lot of time investigating each individual
				195	failure. (Use the "Show: 200" link at the bottom of the page to see
				196	more history.)
				197	1. If the same set of tests are failing repeatedly, look at the individual
				198	runs. Examine the swarming results and see whether they're all running
				199	on the same machine. (This is the "Bot assigned to task" when clicking
				200	any of the test's shards in the build logs.) If they are, something
				201	might be wrong with the hardware. Use the [Swarming Server Stats] tool
				202	to drill down into the specific builder.
				203	1. If you see the same test failing in a flaky manner across multiple
				204	machines and multiple CLs, it's crucial to investigate why it's
				205	happening. [crbug.com/395914](http://crbug.com/395914) was one example
				206	of an innocent-looking Blink change which made it through the commit
				207	queue and introduced widespread flakiness in a range of GPU tests. The
				208	failures were also most visible on the try servers as opposed to the
				209	main waterfalls.
				210	1. Check if any pixel test failures are actual failures or need to be
				211	rebaselined.
				212	1. For a given build failing the pixel tests, click the "stdio" link of
				213	the "pixel" step.
				214	1. The output will contain a link of the form
				215	<http://chromium-browser-gpu-tests.commondatastorage.googleapis.com/view_test_results.html?242523_Linux_Release_Intel__telemetry>
				216	1. Visit the link to see whether the generated or reference images look
				217	incorrect.
				218	1. All of the reference images for all of the bots are stored in cloud
				219	storage under [chromium-gpu-archive/reference-images]. They are indexed
				220	by version number, OS, GPU vendor, GPU device, and whether or not
				221	antialiasing is enabled in that configuration. You can download the
				222	reference images individually to examine them in detail.
				223	1. Rebaseline pixel test reference images if necessary.
				224	1. Follow the [instructions on the GPU testing page].
				225	1. Alternatively, if absolutely necessary, you can use the [Chrome
				226	Internal GPU Pixel Wrangling Instructions] to delete just the broken
				227	reference images for a particular configuration.
				228	1. Update Telemetry-based test expectations if necessary.
				229	1. Most of the GPU tests are run inside a full Chromium browser, launched
				230	by Telemetry, rather than a Gtest harness. The tests and their
				231	expectations are contained in [src/content/test/gpu/gpu_tests/] . See
				232	for example <code>[webgl_conformance_expectations.py]</code>,
				233	<code>[gpu_process_expectations.py]</code> and
				234	<code>[pixel_expectations.py]</code>.
				235	1. See the header of the file a list of modifiers to specify a bot
				236	configuration. It is possible to specify OS (down to a specific
				237	version, say, Windows 7 or Mountain Lion), GPU vendor
				238	(NVIDIA/AMD/Intel), and a specific GPU device.
				239	1. The key is to maintain the highest coverage: if you have to disable a
				240	test, disable it only on the specific configurations it's failing. Note
				241	that it is not possible to discern between Debug and Release
				242	configurations.
				243	1. Mark tests failing or skipped, which will suppress flaky failures, only
				244	as a last resort. It is only really necessary to suppress failures that
				245	are showing up on the GPU tryservers, since failing tests no longer
				246	close the Chromium tree.
				247	1. Please read the section on [stamping out flakiness] for motivation on
				248	how important it is to eliminate flakiness rather than hiding it.
				249	1. For the remaining Gtest-style tests, use the [`DISABLED_`
				250	modifier][gtest-DISABLED] to suppress any failures if necessary.
				251
				252	[Sheriff-O-Matic now has support for the chromium.gpu.fyi waterfall]: https://sheriff-o-matic.appspot.com/chromium.gpu.fyi
				253	[Chromium tab]: https://sheriff-o-matic.appspot.com/chromium
				254	[tree sheriffing page]: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs
				255	[linux_chromium_rel_ng]: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng
				256	[luci.chromium.try]: https://ci.chromium.org/p/chromium/g/luci.chromium.try/builders
				257	[mac_chromium_rel_ng]: https://ci.chromium.org/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/
				258	[tryserver.chromium.mac]: https://ci.chromium.org/p/chromium/g/tryserver.chromium.mac/builders
				259	[win7_chromium_rel_ng]: https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/
				260	[tryserver.chromium.win]: https://ci.chromium.org/p/chromium/g/tryserver.chromium.win/builders
				261	[Chromium Try Flakes]: http://chromium-try-flakes.appspot.com/
				262	<!-- TODO(kainino): link doesn't work, but is still included from chromium-swarm homepage so not removing it now -->
				263	[Swarming Server Stats]: https://chromium-swarm.appspot.com/stats
				264	[chromium-gpu-archive/reference-images]: https://console.developers.google.com/storage/chromium-gpu-archive/reference-images
				265	[instructions on the GPU testing page]: https://sites.google.com/a/chromium.org/dev/developers/testing/gpu-testing#TOC-Updating-and-Adding-New-Pixel-Tests-to-the-GPU-Bots
				266	[Chrome Internal GPU Pixel Wrangling Instructions]: https://sites.google.com/a/google.com/client3d/documents/chrome-internal-gpu-pixel-wrangling-instructions
				267	[src/content/test/gpu/gpu_tests/]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/
				268	[webgl_conformance_expectations.py]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/webgl_conformance_expectations.py
				269	[gpu_process_expectations.py]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/gpu_process_expectations.py
				270	[pixel_expectations.py]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/pixel_expectations.py
				271	[stamping out flakiness]: gpu_testing.md#Stamping-out-Flakiness
				272	[gtest-DISABLED]: https://github.com/google/googletest/blob/master/googletest/docs/AdvancedGuide.md#temporarily-disabling-tests
				273
				274	### When Bots Misbehave (SSHing into a bot)
				275
				276	1. See the [Chrome Internal GPU Pixel Wrangling Instructions] for information
				277	on ssh'ing in to the GPU bots.
				278
				279	[Chrome Internal GPU Pixel Wrangling Instructions]: https://sites.google.com/a/google.com/client3d/documents/chrome-internal-gpu-pixel-wrangling-instructions
				280
				281	### Reproducing WebGL conformance test failures locally
				282
				283	1. From the buildbot build output page, click on the failed shard to get to
				284	the swarming task page. Scroll to the bottom of the left panel for a
				285	command to run the task locally. This will automatically download the build
				286	and any other inputs needed.
				287	2. Alternatively, to run the test on a local build, pass the arguments
				288	`--browser=exact --browser-executable=/path/to/binary` to
				289	`content/test/gpu/run_gpu_integration_test.py`.
				290	Also see the [telemetry documentation].
				291
				292	[telemetry documentation]: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/docs/run_benchmarks_locally.md
				293
				294	## Extending the GPU Pixel Wrangling Rotation
				295
				296	See the [Chrome Internal GPU Pixel Wrangling Instructions] for information on extending the rotation.
				297
				298	[Chrome Internal GPU Pixel Wrangling Instructions]: https://sites.google.com/a/google.com/client3d/documents/chrome-internal-gpu-pixel-wrangling-instructions