Performance/Runtime Hardware Testing

From MozillaWiki
Jump to: navigation, search

Proposal

A system for testing the rendering capabilities of the user's hardware prior to enabling features that depend on that hardware. The system will be in addition to our existing blocklists for incompatible hardware/driver versions. The system will initially target basic rendering features like video decoding and layer compositing, but is designed to grow with more features over time.

Background

We've had a number of bugs end up in the release channel due to incompatiblity with hardware/driver combinations that we're unable to test before shipping. These bugs are not due to poor software development practices, but are attributed to the matrix of hardware/driver/feature combinations being effectively unbounded. The proposal is to test hardware features at runtime, check for successful rendering, and only enable the feature for the user if these tests pass. We collect telemetry regarding incompatible hardware/driver versions gathered from the runtime tests, and generate the blocklists using the data collected. We will monitor the data collected, and design a Quality Program for responding to the data.

Details

We run a sentinel function at startup that detects if the local environment can render the feature. If the sentinel function doesn't complete, then we assume a crash and disable the feature. If the sentinel function completes, but fails to match a known-good rendering, we assume hardware incompatibility and disable the feature or fall back to software. The sentinel function runs again after each Firefox update.

The system is designed so that additional startup tests can be added in the future. The user may be prompted for permission to send telemetry when we detect failures (data collected may include hardware ID, driver version, opengl extensions, and other system info.) In the future, we may add longer-running tests (eg. performance measurements for compute-intensive features like DEAA antialiasing ) that the user can opt-in via a UI. A goal is to be able to modify the system and tests using only code that's available in chrome-JS, as doing so enables hotfixes via our add-on distribution mechanisms.

We use all-channel telemetry to detect GPU+driver combinations that are neither whitelisted nor blacklisted but have significant numbers of users, and focus QA on those combinations.

Although we're starting with tests to prevent catastrophic failure (eg. no graphics or video,) the goal is to build out a generalized and consistent user experience for enabling hardware-accelerated and hardware-enabled features. This system can include crash reporting, telemetry, blacklist and whitelist, tracking configurations and alerting under some conditions, communications with the user as to what’s wrong, communications with the user as to asking for help, community engagement, etc. For this to succeed, we want to manage it as a long-term Quality Program and not a band-aid for a particular problem.

The proposal is to start on Windows and test video decoding and layer compositing at app startup, after Gecko is loaded. We record the current GPU driver version and re-run the tests if the driver changes. We also re-run the test with each Firefox update. The goal is to run two tests that complete in under 500ms:

Test 1: We composite a reference bitmap using the hardware-accelerated code path and read back the composited result from the hardware. We then compare the result with the original reference bitmap. We enable hardware compositing on success. Report telemetry on failure.

Test 2: If hardware compositing was successful in Test 1, we decode a single frame of a sample video and composite the resulting frame and compare the result with a reference bitmap. We enable hardware decoding on success. Report telemetry on failure.

  • Open Issue: Should we also report telemetry on success? Perhaps if both a blacklist and whitelist were maintained, we could populate the whitelist with versions that passed our tests.
  • Open Issue: how to deal with systems that may work correctly on one GPU but fail on another (eg. laptops that have both low-power integrated GPU's and separate discrete GPU's that can switch at runtime.)

Next Steps

Matt Woodrow is prototyping an initial implementation of the Platform pieces for Windows. We'll have a meeting next week to discuss the various pieces (UX, Telemetry, Program Management, etc.) and find owners for each. Please attend if you're interested in contributing to this effort.