Blame - docs/gpu/pixel_wrangling.md - chromium/src.git

blob: 3c4f4583526368a962998ce77f8459ccfa49ed99 [file] [log] [blame] [view]

Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	1	# GPU Bots & Pixel Wrangling
				2
				3	![](images/wrangler.png)
				4
				5	(December 2017: presentation on GPU bots and pixel wrangling: see [slides].)
				6
				7	GPU Pixel Wrangling is the process of keeping various GPU bots green. On the
				8	GPU bots, tests run on physical hardware with real GPUs, not in VMs like the
				9	majority of the bots on the Chromium waterfall.
				10
				11	[slides]: https://docs.google.com/presentation/d/1sZjyNe2apUhwr5sinRfPs7eTzH-3zO0VQ-Cj-8DlEDQ/edit?usp=sharing
				12
				13	[TOC]
				14
				15	## Fleet Status
				16
Kenneth Russell	ffe96ee	2019-03-16 00:37:29	[diff] [blame]	17	* [Chrome GPU Fleet Status](http://vi/chrome-infra/Projects/gpu)
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	18
Kenneth Russell	ffe96ee	2019-03-16 00:37:29	[diff] [blame]	19	(Sorry, this link is Google internal only.)
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	20
Kenneth Russell	ffe96ee	2019-03-16 00:37:29	[diff] [blame]	21	These graphs show 1 day of activity by default. The drop-down boxes at the top
				22	allow viewing of longer durations.
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	23
Kenneth Russell	ffe96ee	2019-03-16 00:37:29	[diff] [blame]	24	See [this CL](http://cl/238562533) for an example of how to update these graphs.
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	25
				26	## GPU Bots' Waterfalls
				27
				28	The waterfalls work much like any other; see the [Tour of the Chromium Buildbot
				29	Waterfall] for a more detailed explanation of how this is laid out. We have
				30	more subtle configurations because the GPU matters, not just the OS and release
				31	v. debug. Hence we have Windows Nvidia Release bots, Mac Intel Debug bots, and
				32	so on. The waterfalls we’re interested in are:
				33
				34	* [Chromium GPU]
				35	* Various operating systems, configurations, GPUs, etc.
				36	* [Chromium GPU FYI]
				37	* These bots run less-standard configurations like Windows with AMD GPUs,
				38	Linux with Intel GPUs, etc.
				39	* These bots build with top of tree ANGLE rather than the `DEPS` version.
				40	* The [ANGLE tryservers] help ensure that these bots stay green. However,
				41	it is possible that due to ANGLE changes these bots may be red while
				42	the chromium.gpu bots are green.
				43	* The [ANGLE Wrangler] is on-call to help resolve ANGLE-related breakage
				44	on this watefall.
				45	* To determine if a different ANGLE revision was used between two builds,
				46	compare the `got_angle_revision` buildbot property on the GPU builders
				47	or `parent_got_angle_revision` on the testers. This revision can be
				48	used to do a `git log` in the `third_party/angle` repository.
				49
				50	<!-- TODO(kainino): update link when the page is migrated -->
				51	[Tour of the Chromium Buildbot Waterfall]: http://www.chromium.org/developers/testing/chromium-build-infrastructure/tour-of-the-chromium-buildbot
				52	[Chromium GPU]: https://ci.chromium.org/p/chromium/g/chromium.gpu/console?reload=120
				53	[Chromium GPU FYI]: https://ci.chromium.org/p/chromium/g/chromium.gpu.fyi/console?reload=120
				54	[ANGLE tryservers]: https://build.chromium.org/p/tryserver.chromium.angle/waterfall
kylechar	5687394	2019-11-11 17:29:55	[diff] [blame]	55	[ANGLE Wrangler]: https://chromium.googlesource.com/angle/angle/+/master/infra/ANGLEWrangling.md
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	56
				57	## Test Suites
				58
				59	The bots run several test suites. The majority of them have been migrated to
				60	the Telemetry harness, and are run within the full browser, in order to better
				61	test the code that is actually shipped. As of this writing, the tests included:
				62
				63	* Tests using the Telemetry harness:
				64	* The WebGL conformance tests: `webgl_conformance_integration_test.py`
				65	* A Google Maps test: `maps_integration_test.py`
				66	* Context loss tests: `context_lost_integration_test.py`
				67	* Depth capture tests: `depth_capture_integration_test.py`
				68	* GPU process launch tests: `gpu_process_integration_test.py`
				69	* Hardware acceleration validation tests:
				70	`hardware_accelerated_feature_integration_test.py`
				71	* Pixel tests validating the end-to-end rendering pipeline:
				72	`pixel_integration_test.py`
				73	* Stress tests of the screenshot functionality other tests use:
				74	`screenshot_sync_integration_test.py`
Daniel Bratell	f73f0df	2018-09-24 13:52:49	[diff] [blame]	75	* `angle_unittests`: see `src/third_party/angle/src/tests/BUILD.gn`
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	76	* drawElements tests (on the chromium.gpu.fyi waterfall): see
				77	`src/third_party/angle/src/tests/BUILD.gn`
				78	* `gles2_conform_test` (requires internal sources): see
Daniel Bratell	f73f0df	2018-09-24 13:52:49	[diff] [blame]	79	`src/gpu/gles2_conform_support/BUILD.gn`
Takuto Ikuta	f533325	2019-11-06 16:07:08	[diff] [blame]	80	* `gl_tests`: see `src/gpu/BUILD.gn`
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	81	* `gl_unittests`: see `src/ui/gl/BUILD.gn`
behdad	cf8139e	2019-12-02 17:55:46	[diff] [blame]	82	* `rendering_representative_perf_tests` (on the chromium.gpu.fyi waterfall):
				83	see `src/chrome/test/BUILD.gn`
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	84
Kenneth Russell	8a386d4	2018-06-02 09:48:01	[diff] [blame]	85	And more. See
				86	[`src/testing/buildbot/README.md`](../../testing/buildbot/README.md)
				87	and the GPU sections of `test_suites.pyl` and `waterfalls.pyl` for the
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	88	complete description of bots and tests.
				89
				90	Additionally, the Release bots run:
				91
				92	* `tab_capture_end2end_tests:` see
				93	`src/chrome/browser/extensions/api/tab_capture/tab_capture_apitest.cc` and
				94	`src/chrome/browser/extensions/api/cast_streaming/cast_streaming_apitest.cc`
				95
				96	### More Details
				97
				98	More details about the bots' setup can be found on the [GPU Testing] page.
				99
				100	[GPU Testing]: https://sites.google.com/a/chromium.org/dev/developers/testing/gpu-testing
				101
				102	## Wrangling
				103
				104	### Prerequisites
				105
				106	1. Ideally a wrangler should be a Chromium committer. If you're on the GPU
				107	pixel wrangling rotation, there will be an email notifying you of the upcoming
				108	shift, and a calendar appointment.
				109	* If you aren't a committer, don't panic. It's still best for everyone on
				110	the team to become acquainted with the procedures of maintaining the
				111	GPU bots.
				112	* In this case you'll upload CLs to Gerrit to perform reverts (optionally
				113	using the new "Revert" button in the UI), and might consider using
				114	`TBR=` to speed through trivial and urgent CLs. In general, try to send
				115	all CLs through the commit queue.
				116	* Contact bajones, kainino, kbr, vmiura, zmo, or another member of the
				117	Chrome GPU team who's already a committer for help landing patches or
				118	reverts during your shift.
James Darpinian	abd9f47	2018-05-22 22:14:20	[diff] [blame]	119	1. Apply for [access to the bots].
				120	1. You may want to install the [Flake linker] extension, which adds several useful features to the bot build log pages.
				121	* Links to Chromium flakiness dashboard from build result pages, so you can see all failures for a single test across the fleet.
				122	* Automatically hides green build steps so you can see the failure immediately.
				123	* Turns build log links into deep links directly to the failure line in the log.
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	124
				125	[access to the bots]: https://sites.google.com/a/google.com/chrome-infrastructure/golo/remote-access?pli=1
James Darpinian	abd9f47	2018-05-22 22:14:20	[diff] [blame]	126	[Flake linker]: https://chrome.google.com/webstore/detail/flake-linker/boamnmbgmfnobomddmenbaicodgglkhc
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	127
				128	### How to Keep the Bots Green
				129
				130	1. Watch for redness on the tree.
				131	1. [Sheriff-O-Matic now has support for the chromium.gpu.fyi waterfall]!
				132	1. The chromium.gpu bots are covered under Sheriff-O-Matic's [Chromium
				133	tab]. As pixel wrangler, ignore any non-GPU test failures in this tab.
				134	1. The bots are expected to be green all the time. Flakiness on these bots
				135	is neither expected nor acceptable.
				136	1. If a bot goes consistently red, it's necessary to figure out whether a
				137	recent CL caused it, or whether it's a problem with the bot or
				138	infrastructure.
				139	1. If it looks like a problem with the bot (deep problems like failing to
				140	check out the sources, the isolate server failing, etc.) notify the
				141	Chromium troopers and file a P1 bug with labels: Infra\>Labs,
				142	Infra\>Troopers and Internals\>GPU\>Testing. See the general [tree
				143	sheriffing page] for more details.
				144	1. Otherwise, examine the builds just before and after the redness was
				145	introduced. Look at the revisions in the builds before and after the
				146	failure was introduced.
				147	1. File a bug capturing the regression range and excerpts of any
				148	associated logs. Regressions should be marked P1. CC engineers who you
				149	think may be able to help triage the issue. Keep in mind that the logs
				150	on the bots expire after a few days, so make sure to add copies of
				151	relevant logs to the bug report.
				152	1. Use the `Hotlist=PixelWrangler` label to mark bugs that require the
				153	pixel wrangler's attention, so it's easy to find relevant bugs when
				154	handing off shifts.
				155	1. Study the regression range carefully. Use drover to revert any CLs
				156	which break the chromium.gpu bots. Use your judgment about
				157	chromium.gpu.fyi, since not all bots are covered by trybots. In the
				158	revert message, provide a clear description of what broke, links to
				159	failing builds, and excerpts of the failure logs, because the build
				160	logs expire after a few days.
				161	1. Make sure the bots are running jobs.
				162	1. Keep an eye on the console views of the various bots.
				163	1. Make sure the bots are all actively processing jobs. If they go offline
				164	for a long period of time, the "summary bubble" at the top may still be
				165	green, but the column in the console view will be gray.
				166	1. Email the Chromium troopers if you find a bot that's not processing
				167	jobs.
				168	1. Make sure the GPU try servers are in good health.
				169	1. The GPU try servers are no longer distinct bots on a separate
				170	waterfall, but instead run as part of the regular tryjobs on the
				171	Chromium waterfalls. The GPU tests run as part of the following
				172	tryservers' jobs:
Stephen Martinis	089f5f0	2019-02-12 02:42:24	[diff] [blame]	173	1. `[linux-rel]` on the [luci.chromium.try] waterfall
				174	1. `[mac-rel]` on the [luci.chromium.try] waterfall
				175	1. `[win7-rel]` on the [luci.chromium.try] waterfall
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	176	1. The best tool to use to quickly find flakiness on the tryservers is the
				177	new [Chromium Try Flakes] tool. Look for the names of GPU tests (like
Stephen Martinis	089f5f0	2019-02-12 02:42:24	[diff] [blame]	178	maps_pixel_test) as well as the test machines (e.g. mac-rel). If you
				179	see a flaky test, file a bug like [this one](http://crbug.com/444430).
				180	Also look for compile flakes that may indicate that a bot needs to be
				181	clobbered. Contact the Chromium sheriffs or troopers if so.
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	182	1. Glance at these trybots from time to time and see if any GPU tests are
				183	failing frequently. Note that test failures are expected on
				184	these bots: individuals' patches may fail to apply, fail to compile, or
				185	break various tests. Look specifically for patterns in the failures. It
				186	isn't necessary to spend a lot of time investigating each individual
				187	failure. (Use the "Show: 200" link at the bottom of the page to see
				188	more history.)
				189	1. If the same set of tests are failing repeatedly, look at the individual
				190	runs. Examine the swarming results and see whether they're all running
				191	on the same machine. (This is the "Bot assigned to task" when clicking
				192	any of the test's shards in the build logs.) If they are, something
				193	might be wrong with the hardware. Use the [Swarming Server Stats] tool
				194	to drill down into the specific builder.
				195	1. If you see the same test failing in a flaky manner across multiple
				196	machines and multiple CLs, it's crucial to investigate why it's
				197	happening. [crbug.com/395914](http://crbug.com/395914) was one example
				198	of an innocent-looking Blink change which made it through the commit
				199	queue and introduced widespread flakiness in a range of GPU tests. The
				200	failures were also most visible on the try servers as opposed to the
				201	main waterfalls.
				202	1. Check if any pixel test failures are actual failures or need to be
				203	rebaselined.
Brian Sheedy	c4650ad0	2019-07-29 17:31:38	[diff] [blame]	204	1. For a given build failing the pixel tests, look for either:
				205	1. One or more links named `gold_triage_link for <test name>`. This will
Brian Sheedy	fcb315e	2019-09-26 21:56:30	[diff] [blame]	206	be the case if there are fewer than 10 links. If the test was run on
				207	a trybot, the link will instead be named
				208	`triage_link_for_entire_cl for <test name>` (the weird naming comes
				209	with how the recipe processes and displays links).
Brian Sheedy	c4650ad0	2019-07-29 17:31:38	[diff] [blame]	210	1. A single link named
				211	`Too many artifacts produced to link individually, click for links`.
				212	This will be the case if there are 10 or more links.
				213	1. In either case, follow the link(s) to the triage page for the image the
				214	failing test produced.
Brian Sheedy	fcb315e	2019-09-26 21:56:30	[diff] [blame]	215	1. If the test was run on a trybot, all the links will point to the same
				216	page, which will be the triage page for every untriaged image
				217	produced by the CL being tested.
Brian Sheedy	c4650ad0	2019-07-29 17:31:38	[diff] [blame]	218	1. Ensure you are signed in to the Gold server the links take you to (both
				219	@google.com and @chromium.org accounts work).
				220	1. Triage images on those pages (typically by approving them, but you can
				221	mark them as negative if it is an image that should not be produced). In
				222	the case of a negative image, a bug should be filed on
				223	[crbug](https://crbug.com) to investigate and fix the cause of that
				224	particular image being produced, as future occurrences of it will cause
				225	the test to fail. Such bugs should include the `Internals>GPU>Testing`
				226	component and whatever component is suitable for the type of failing
				227	test (likely `Blink>WebGL` or `Blink>Canvas`). The test should also be
				228	marked as failing or skipped(see the item below on updating the
				229	Telemetry-based test expectations) so that the test failure doesn't show
				230	up as a builder failure. If the failure is consistent, prefer to skip
				231	instead of mark as failing so that the failure links don't pile up. If
				232	the failure occurs on the trybots, include the change to the
				233	expectations in your CL.
				234	1. Additional, less common triage steps for the pixel tests can be found in
				235	[this section][gold less common failures] of the GPU Gold documentation.
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	236	1. Update Telemetry-based test expectations if necessary.
				237	1. Most of the GPU tests are run inside a full Chromium browser, launched
				238	by Telemetry, rather than a Gtest harness. The tests and their
Rakib M. Hasan	2046a05	2019-05-13 23:33:15	[diff] [blame]	239	expectations are contained in [src/content/test/gpu/gpu_tests/test_expectations] . See
				240	for example <code>[webgl_conformance_expectations.txt]</code>,
behdad	cf8139e	2019-12-02 17:55:46	[diff] [blame]	241	<code>[gpu_process_expectations.txt]</code>,
				242	<code>[pixel_expectations.txt]</code> and
				243	[rendering_representative_perf_tests].
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	244	1. See the header of the file a list of modifiers to specify a bot
				245	configuration. It is possible to specify OS (down to a specific
				246	version, say, Windows 7 or Mountain Lion), GPU vendor
				247	(NVIDIA/AMD/Intel), and a specific GPU device.
				248	1. The key is to maintain the highest coverage: if you have to disable a
				249	test, disable it only on the specific configurations it's failing. Note
				250	that it is not possible to discern between Debug and Release
				251	configurations.
				252	1. Mark tests failing or skipped, which will suppress flaky failures, only
				253	as a last resort. It is only really necessary to suppress failures that
				254	are showing up on the GPU tryservers, since failing tests no longer
				255	close the Chromium tree.
				256	1. Please read the section on [stamping out flakiness] for motivation on
				257	how important it is to eliminate flakiness rather than hiding it.
				258	1. For the remaining Gtest-style tests, use the [`DISABLED_`
				259	modifier][gtest-DISABLED] to suppress any failures if necessary.
				260
				261	[Sheriff-O-Matic now has support for the chromium.gpu.fyi waterfall]: https://sheriff-o-matic.appspot.com/chromium.gpu.fyi
				262	[Chromium tab]: https://sheriff-o-matic.appspot.com/chromium
				263	[tree sheriffing page]: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs
Stephen Martinis	089f5f0	2019-02-12 02:42:24	[diff] [blame]	264	[linux-rel]: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux-rel
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	265	[luci.chromium.try]: https://ci.chromium.org/p/chromium/g/luci.chromium.try/builders
Stephen Martinis	089f5f0	2019-02-12 02:42:24	[diff] [blame]	266	[mac-rel]: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac-rel
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	267	[tryserver.chromium.mac]: https://ci.chromium.org/p/chromium/g/tryserver.chromium.mac/builders
Stephen Martinis	089f5f0	2019-02-12 02:42:24	[diff] [blame]	268	[win7-rel]:
				269	https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7-rel
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	270	[tryserver.chromium.win]: https://ci.chromium.org/p/chromium/g/tryserver.chromium.win/builders
				271	[Chromium Try Flakes]: http://chromium-try-flakes.appspot.com/
				272	<!-- TODO(kainino): link doesn't work, but is still included from chromium-swarm homepage so not removing it now -->
				273	[Swarming Server Stats]: https://chromium-swarm.appspot.com/stats
Brian Sheedy	c4650ad0	2019-07-29 17:31:38	[diff] [blame]	274	[gold less common failures]: gpu_pixel_testing_with_gold.md#Triaging-Less-Common-Failures
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	275	[Chrome Internal GPU Pixel Wrangling Instructions]: https://sites.google.com/a/google.com/client3d/documents/chrome-internal-gpu-pixel-wrangling-instructions
Rakib M. Hasan	2046a05	2019-05-13 23:33:15	[diff] [blame]	276	[src/content/test/gpu/gpu_tests/test_expectations]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/test_expectations
				277	[webgl_conformance_expectations.txt]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/test_expectations/webgl_conformance_expectations.txt
				278	[gpu_process_expectations.txt]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/test_expectations/gpu_process_expectations.txt
				279	[pixel_expectations.txt]: https://chromium.googlesource.com/chromium/src/+/master/content/test/gpu/gpu_tests/test_expectations/pixel_expectations.txt
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	280	[stamping out flakiness]: gpu_testing.md#Stamping-out-Flakiness
				281	[gtest-DISABLED]: https://github.com/google/googletest/blob/master/googletest/docs/AdvancedGuide.md#temporarily-disabling-tests
behdad	cf8139e	2019-12-02 17:55:46	[diff] [blame]	282	[rendering_representative_perf_tests]: ../testing/rendering_representative_perf_tests.md#Updating-Expectations
Kai Ninomiya	a6429fb3	2018-03-30 01:30:56	[diff] [blame]	283
				284	### When Bots Misbehave (SSHing into a bot)
				285
				286	1. See the [Chrome Internal GPU Pixel Wrangling Instructions] for information
				287	on ssh'ing in to the GPU bots.
				288
				289	[Chrome Internal GPU Pixel Wrangling Instructions]: https://sites.google.com/a/google.com/client3d/documents/chrome-internal-gpu-pixel-wrangling-instructions
				290
				291	### Reproducing WebGL conformance test failures locally
				292
				293	1. From the buildbot build output page, click on the failed shard to get to
				294	the swarming task page. Scroll to the bottom of the left panel for a
				295	command to run the task locally. This will automatically download the build
				296	and any other inputs needed.
				297	2. Alternatively, to run the test on a local build, pass the arguments
				298	`--browser=exact --browser-executable=/path/to/binary` to
				299	`content/test/gpu/run_gpu_integration_test.py`.
				300	Also see the [telemetry documentation].
				301
				302	[telemetry documentation]: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/docs/run_benchmarks_locally.md
				303
				304	## Extending the GPU Pixel Wrangling Rotation
				305
				306	See the [Chrome Internal GPU Pixel Wrangling Instructions] for information on extending the rotation.
				307
				308	[Chrome Internal GPU Pixel Wrangling Instructions]: https://sites.google.com/a/google.com/client3d/documents/chrome-internal-gpu-pixel-wrangling-instructions