Performance/Fenix/Performance reviews

From MozillaWiki
Jump to: navigation, search

Do you want to know if your change impacts Fenix or Focus performance? If so, here are the methods you can use, in order of preference:

  1. Benchmark: use an automated test to measure the change in duration
  2. Timestamp benchmark: add temporary code and manually measure the change in duration. Practical for non-UI measurements or very simple UI measurements
  3. Profile: take a profile, identify the start and end points of your measurement, and measure the change in duration

We don't necessarily recommend these techniques though they have their place:

  1. Screen recording, side-by-side: take a screen recording of before and after your change, synchronize the videos, and put them side-by-side with timestamps using the perf-tools/combine-videos-side-by-side.sh script.

The trade-offs for each technique are mentioned in their respective section.

Benchmark remotely

You can run benchmarks remotely in CI/automation now.

First, you will need to clone, and setup mozilla-central, or mozilla-unified locally. See these build instructions for how to get setup. When prompted what kind of build you would like, select an Artifact build as you won't need to build Firefox.

After/while building and setting up, you'll need to get setup for pushing to try. Follow this link to do this.

Once you've finished setting up, you should be able to run ./mach try perf --help. Now you're ready to test custom APKs on the try branch with the following instructions:

  • Get the path to your custom APK.
    • Ensure that you have a nightly APK build so that the activity, intents, and package name line up with existing tasks.
  • Run ./mach try perf --mozperftest-upload-apk /path/to/apk to copy the APK in-tree.
    • This will replace the APK used in mozperftest tests (e.g. startup tests).
    • Use --browsertime-upload-apk if you want to target Browsertime performance tests.
  • Commit the changes in-tree with: hg bookmark my-upload; hg commit -m "Upload APK"
  • Now you can re-run the performance selector to pick the tests you want to run, and perform the push: ./mach try perf --android --show-all
    • Search for 'perftest 'start to find the startup tests.
    • You'll be provided with a PerfCompare View link once the try runs are pushed that will show you a before/after comparison of the performance differences.
    • You can also find all your try pushes at https://treeherder.mozilla.org/jobs?repo=try&author=YOUR_EMAIL.

Benchmark locally

A benchmark is an automated test that measures performance, usually the duration from point A to point B. Automated benchmarks have similar trade-offs to automated functionality tests when compared to one-off manual testing: they can continuously catch regressions and minimize human error. For manual benchmarks in particular, it can be tricky to be consistent about how we aggregate each test run into the results.

See the Benchmark remotely section for information about how you can run these tests in CI/automation.

To benchmark, do the following:

  1. Select a benchmark that measures your change or write a new one yourself
  2. Run the benchmark on the commit before your change
  3. Run the benchmark on the commit after your change
  4. Compare the results: generally, this means comparing the median

We currently support the following benchmarks:

Measuring cold start up duration

To measure the cold start up duration, the approach is usually simple:

  1. From the mozilla-mobile/perf-tools repository, use measure_start_up.py.
    The arguments for start-up should include your target (Fenix or Focus).
  2. Determine the start-up path that your code affects this could be:
    1. cold_main_first_frame: when clicking the app's homescreen icon, this is the duration from process start until the first frame drawn
    2. cold_view_nav_start: when opening the browser through an outside link (e.g. a link in gmail), this is the duration from process start until roughly Gecko's Navigation::Start event
  3. After determining the path your changes affect, these are the steps that you should follow:

Example:

  • Run measure_start_up.py located in perf-tools. Note:
    • The usual iteration coumbered list itemnts used is 25. Running less iterations might affect the results due to noise
    • Make sure the application you're testing is a fresh install. If testing the Main intent (which is where the browser ends up on its homepage), make sure to clear the onboarding process before testing
 python3 measure_start_up.py -c=25 --product=fenix nightly cold_view_nav_start results.txt

where -c refers to the iteration count. The default of 25 should be good.

  • Once you have gathered your results, you can analyze them using analyze_durations.py in perf-tools.
  python3 analyze_durations.py results.txt


NOTE:For testing before and after to compare changes made to Fenix: repeat these steps, but this time for the code before the changes. Therefore, you could checkout the parent comment (I.e: using git rev-parse ${SHA}^ where ${SHA} is the first commit on the branch where the changes are)

An example of using these steps to review a PR can be found (here).

Testing non start-up changes

Testing for non start-up changes is a bit different than the steps above since the performance team doesn't have tools as of now to test different part of the browser.

  1. The first step here would be to instrument the code to take (manual timings). By getting timings before and after the changes, it could potentially indicate any changes in performance.
  2. Using profiles and markers.
    1. (Profiles) can be a good visual representative for performance changes. A simple way to find your code and its changes could be either through the call tree, the flame graph or stack graph. NOTE: some code may be missing from the stack since pro-guard may inline it, or the sampling rate of the profiler is more than the time taken by the code.
    2. Another useful tool to find changes in performance is markers. Markers can be good to show the time elapsed between point A and point B or to pin point when a certain action happens.

Timestamp benchmark

A timestamp benchmark is a manual test where a developer adds temporary code to log the duration they want to measure and then performs the use case on the device themselves to get the values printed. Here's a simple example:

val start = SystemClock.elapsedRealtime()
thingWeWantToMeasure()
val end = SystemClock.elapsedRealtime()
Log.e("benchmark", "${end - start}") // result is in milliseconds

We recommend this approach for non-UI measurements only. Since the framework doesn't notify us when the UI is visually complete, it's challenging to instrument that point and thus accurately measure a duration that waits for the UI.

Like automated benchmarks, these tests can accurately measure what users experience. However, they are fairly quick to write and execute but are tedious and time-consuming to carry out and have many places to introduce errors.

Here's an outline of a typical timestamp benchmark:

  1. Decide the duration you want to measure
  2. Do the following once for the commit before your changes and once for the commit after your changes...
    1. Add code to measure the duration.
    2. Build & install a release build like Nightly or Beta (debug builds have unrepresentative perf)
    3. Do a "warm up" run first: the first run after an install will always be slower because the JIT cache isn't primed so you should run and ignore it, i.e. run your test case, wait a few seconds, force-stop the app, clear logcat, and then begin testing & measuring
    4. Run the use case several times (maybe 10 times if it's quick, 5 if it's slow). You probably want to measure "cold" performance: we assume users will generally only perform a use case a few times per process lifetime. However, the more times a code path is run during the process lifetime, the more likely it'll execute faster because it's cached. Thus, if we want to measure a use case in a way that is similar to what users experience, we must measure the first time an interaction occurs during the process. In practice this means after you execute your use case once, force-stop the app before executing it again
    5. Capture the results from logcat. If you log, "average <number-in-ms>", you can use the following script to capture all the results and find the median adb logcat -d > logcat && python3 perf-tools/analyze_durations.py logcat
  3. Compare the results, generally by comparing the median of the two runs

Example: page load

TODO... the duration for a page load is a non-UI use case that is more complex than the very simple example provided above

Example: very simple UI measurements

TODO... if the screen content is drawn synchronously, you can do something like:

view.doOnPreDraw {
  end = SystemClock.elapsedRealtime()
  // Be sure to verify that this draw call is the draw call where the UI is visually complete
  // e.g. post to the front of the main thread queue and Thread.sleep(5000) and check the device
}

Profile

You can take profiles with the Firefox Profiler, identify the start and end points for the duration you're measuring in your profile, and use the difference between them to measure the duration. It's quick to take these profiles but there are big downsides: profilers add overhead so the duration will not be precise, it's difficult to avoid noise in the results because devs can only take so many profiles, and it may be non-trivial to correctly identify the start and end points of the duration especially when the implementations you compare have big differences.

Follow the example below to see how to measure the change in duration for a change with profiles.

Example: time to display homescreen

On a low-end device...

  1. We pick the specific duration we want to measure: the time from hitting the home button when a tab is open until the homescreen is visually complete.
  2. We build & install a release build (e.g. Nightly, Beta; debug builds have unrepresentative perf). You can also use a recent Nightly, like this example does.
  3. We do a "warm up" run to populate the JIT's cache (the first run has unrepresentative perf). We start the app, set up the start state (open a tab), do our use case (click the home button and wait for the UI to fully load). Then we force-stop the app.
  4. We profile: start the app (which should launch to the most recent tab), start the profiler (see here for instructions), perform the use case (click the home button in the toolbar and wait for the homescreen to finish loading), and stop the profiler. Don't forget to enable the profiler permissions (3-dot menu -> Settings -> Remote debugging via USB).
  5. We identify the duration of the change in the raw profile. The most accurate and reproducible way to do this is using the Marker Chart.
    1. In this case, we can identify the start point through the dispatchTouchEvent ACTION_UP marker, right-click it, and choose "Start selection here" to narrow the profile's timeline range. We can then click the magnifying glass with the + in the timeline to clamp the range.
    2. The end point is more tricky: we don't have a marker to identify that the UI is visually complete. As such, we can use the information in the Marker Chart and Stack Chart to make a best guess as to when the UI is visually complete (notice that this creates a point of inaccuracy). If we temporarily clamp our range to after the last marker (onGlobalLayout) is run, we see that there is a measure/layout pass for Compose after it. We make a best guess that the content isn't visually complete until this last measure/layout/draw pass completes. To clamp the range to this, we can double-click on the draw method above measureAndLayout to shrink our range to that method – this lets us accurately capture the end point. Then we can drag the selection handle to re-expand the range all the way to the left, back to our start point. Then we can clamp the range given that the start and end points we want to measure are the start and end points of the range. The final profile – https://share.firefox.dev/3o7EvOI – gives us our final duration, which we can see in the value at the top left of the profiler: 1.4s in this case.

With the measurement in hand, repeat these steps for your changes and compare the resulting times. Note: it's possible the device was under load when you took the profile so you may wish to take more than one profile if you suspect that is the case.

Backfilling to determine culprit commit

We now have alerting enabled on the firefox-android branch/project. If any changes are detected, an alert will be produced after ~6 days (once enough data is produced). With the alert, you'll need to determine the culprit commit, or the commit that caused the regression.

To do so, you can start with backfilling on the firefox-android branch. "Backfilling" is the act of running the regressing test on past pushes that didn't run the test. This lets you fill in all the holes you have in your data (or backfilling them). It's likely that you'll find a different culprit commit than the one you originally found.

If the culprit commit suggests that there's something coming from mozilla-central/geckoview, then you'll need to move to mozilla-central and start searching for the culprit commit there with the same backfilling process. That said, you'll need to use try runs to do this with custom Fenix APKs that are built using the current mozilla-central/geckoview commit. You should be able to find geckoview artifacts in autoland/mozilla-central so you won't need to remake them. See the Benchmark remotely section for information about how you can do those try runs. Please reach out in #perftest if you need any help, or if anything is unclear.