2014-10-02 Performance Acceptance Results

Overview

These are the results of performance release acceptance testing for FxOS 2.1, as of the Oct 02, 2014 build.

Our acceptance metric is startup time from launch to visually-complete, as metered via the Gaia Performance Tests, with the system initialized to make reference-workload-light.

For this release, there are two baselines being compared to: 2.0 performance and our responsiveness guidelines targeting no more than 1000ms for startup time.

The Gecko and Gaia revisions of the builds being compared are:

2.0:

Gecko: mozilla-b2g32_v2_0/a9c1cb2bbcee
Gaia: 87ee41fcb3f9a14d7a8bb67f1dd7fd95a6bcd0f

2.1:

Gecko: mozilla-aurora/31af4315ffde
Gaia: d4d54850334c1222b505c99ee27e53aef157c5d5

Startup -> Visually Complete

Startup -> Visually Complete times the interval from launch when the application is not already loaded in memory (cold launch) until the application has initialized all initial onscreen content. Data might still be loading in the background, but only minor UI elements related to this background load such as proportional scroll bar thumbs may be changing at this time.

This is equivalent to Above the Fold in web development terms.

More information about this timing can be found on MDN.

Execution

These results were generated from 480 application data points per release, generated over 16 different runs of make test-perf as follows:

Flash to base build
Flash stable FxOS build from tinderbox
Constrain phone to 319MB via bootloader
Clone gaia
Check out the gaia revision referenced in the build's sources.xml
GAIA_OPTIMIZE=1 NOFTU=1 make reset-gaia
make reference-workload-light
For 16 repetitions:
1. Reboot the phone
2. Wait for the phone to appear to adb, and an additional 30 seconds for it to settle.
3. Run make test-perf with 31 replicates

Result Analysis

First, any repetitions showing app errors are thrown out.

Then, the first data point is eliminated from each repetition, as it has been shown to be a consistent outlier likely due to being the first launch after reboot. The balance of the results are typically consistent within a repetition, leaving 30 data points per repetition.

These are combined into a large data point set. Each set has been graphed as a 32-bin histogram so that its distribution is apparent, with comparable sets from 2.0 and 2.1 plotted on the same graph.

For each set, the median and the 95th percentile results have been calculated. These are real-world significant as follows:

Median: 50% of launches are faster than this. This can be considered typical performance, but it's important to note that 50% of launches are slower than this, and they could be much slower. The shape of the distribution is important.
95th Percentile (p95): 95% of launches are faster than this. This is a more quality-oriented statistic commonly used for page load and other task-time measurements. It is not dependent on the shape of the distribution and better represents a performance guarantee.

Distributions for launch times are positive-skewed asymmetric, rather than normal. This is typical of load-time and other task-time tests where a hard lower-bound to completion time applies. Therefore, other statistics that apply to normal distributions such as mean, standard deviation, confidence intervals, etc., are potentially misleading and are not reported here. They are available in the summary data sets, but their validity is questionable.

On each graph, the solid line represents median and the broken line represents p95.

Pass/Fail Criteria

Pass/Fail is determined according to our documented release criteria for 2.1. This boils down to launch time being under 1000 ms.

Median launch time has been used for this, per current convention. However, as mentioned above, p95 launch time might better capture a guaranteed level of quality for the user. In cases where this is significantly over 1000 ms, more investigation might be warranted.

Areas

Calendar

FxOS Performance Comparison Results, 2.1 2014-10-02 Calendar

2.0

480 data points
Median: 1142 ms
p95: 1368 ms

2.1

450 data points
Median: 1267 ms
p95: 1374 ms

Result: FAIL (regression, over guidelines)

Comment: Median launch performance for 2.1 has slowed by over 100ms. The p95 experience stays mostly the same. Even best-case performance exceeds guidelines.

Camera

FxOS Performance Comparison Results, 2.1 2014-10-02 Camera

2.0

390 data points
Median: 1485 ms
p95: 1741 ms

2.1

450 data points
Median: 1577 ms
p95: 1743 ms

Result: FAIL (regression, over guidelines)

Comment: Median launch performance for 2.1 has slowed by under 100ms. The p95 experience stays almost exactly the same. Best-case performance exceeds guidelines.

Clock

FxOS Performance Comparison Results, 2.1 2014-10-02 Clock

2.0

480 data points
Median: 919 ms
p95: 1164 ms

2.1

480 data points
Median: 1045 ms
p95: 1178 ms

Result: FAIL (regression, over guidelines. Very close.)

Comment: Median launch performance for 2.1 has slowed by over 100ms, and has regressed over the guidelines, albeit while staying near. The p95 experience stays mostly the same.

Contacts

FxOS Performance Comparison Results, 2.1 2014-10-02 Contacts

2.0

479 data points (one repetition exited prior to the last replicate, other points were examined for validity and retained)
Median: 811 ms
p95: 928 ms

2.1

450 data points
Median: 928 ms
p95: 1052 ms

Result: PASS

Comment: Median launch performance for 2.1 has slowed by over 100ms, but remains beneath guidelines. The p95 experience has regressed similarly, and now slightly exceeds guidelines.

Dialer

FxOS Performance Comparison Results, 2.1 2014-10-02 Dialer

2.0

480 data points
Median: 560 ms
p95: 669 ms

2.1

480 data points
Median: 635 ms
p95: 716 ms

Result: PASS

Comment: Median launch performance for 2.1 has slowed by around 75 ms. The p95 experience has slowed by around 50 ms. However, even p95 remains well within guidelines.

Email

FxOS Performance Comparison Results, 2.1 2014-10-02 Email

2.0

420 data points
Median: 351 ms
p95: 500 ms

2.1

480 data points
Median: 386 ms
p95: 441 ms

Result: PASS

Comment: Median launch performance for 2.1 has slowed slightly, with p95 showing a slight improvement. All launch times are significantly within guidelines. Note that this test is performed without a live email server, and real-life performance is likely network-bound.

FM Radio

FxOS Performance Comparison Results, 2.1 2014-10-02 FM Radio

2.0

480 data points
Median: 627 ms
p95: 793 ms

2.1

480 data points
Median: 709 ms
p95: 864 ms

Result: PASS

Comment: Median launch performance for 2.1 has slowed by around 85 ms. The p95 experience has slowed by about the same. However, all launch times are within guidelines.

Gallery

FxOS Performance Comparison Results, 2.1 2014-10-02 Gallery

2.0

480 data points
Median: 977 ms
p95: 1257 ms

2.1

450 data points
Median: 1009 ms
p95: 1196 ms

Result: FAIL (regression, over guidelines. Very close.)

Comment: Median launch performance for 2.1 has slowed slightly, and has regressed over but very near the guidelines. The p95 experience has improved by around 60 ms.

Music

FxOS Performance Comparison Results, 2.1 2014-10-02 Music

2.1

480 data points
Median: 1093 ms
p95: 1237 ms

Result: FAIL (over guidelines. Very close)

Comment: Music is not tested for 2.0. For 2.1, median launch performance is slightly over guidelines as calculated, with p95 experience trailing behind it.

However, the distribution is bimodal, possibly indicating a hidden factor that only sometimes happens and slows performance when it does. A guess at the medians of the two separated modes would be ~975 ms and ~1150 ms, suggesting that without the confounding factor this might have come in under guidelines, but this bears more examination by the app's owners.

Note that the same test run on an earlier build showed the same bimodality, indicating this isn't a fluke.

Settings

FxOS Performance Comparison Results, 2.1 2014-10-02 Settings

2.0

450 data points
Median: 3397 ms
p95: 3798 ms

2.1

450 data points
Median: 2577 ms
p95: 2790 ms

Result: FAIL (well over guidelines by numbers. Recommend looking at this more closely, see below)

Comment: Median and p95 performance for Settings show a vast improvement in numbers, coming down by over 800 ms and 1000 ms respectively.

Both still remain well over guidelines, though due to the kitchen-sink nature of settings this may represent a long pole query to underlying hardware (GPS, wifi, etc) for status and may not represent bad user experience.

As above in Music, the 2.0 distribution for Settings is noticeably bimodal, suggesting a confounding factor that sometimes happens and slows performance. However, even the best 2.0 mode is significantly slower than the 2.1 distribution. The 2.1 distribution seems unimodal.

SMS

FxOS Performance Comparison Results, 2.1 2014-10-02 SMS

2.0

480 data points
Median: 1751 ms
p95: 1963 ms

2.1

480 data points
Median: 1674 ms
p95: 1820 ms

Result: FAIL (over guidelines)

Comment: Median launch performance for 2.1 has slightly improved, as has the p95 experience. Both remain well over guidelines.

Both 2.0 and 2.1 distributions are bimodal, suggesting a confounding factor that sometimes happens and slows performance.

Video

FxOS Performance Comparison Results, 2.1 2014-10-02 Video

2.0

480 data points
Median: 984 ms
p95: 1220 ms

2.1

480 data points
Median: 954 ms
p95: 1071 ms

Result: PASS

Comment: Median launch performance for 2.1 has slightly improved, and remains slightly under guidelines. p95 performance has shown a more notable improvement of around 150ms and only slightly exceeds guidelines.

Raw Data

File:2.1-20141002120127-data.zip

Contents:

2.0-20140924123153-results.zip: make test-perf result sets for 2.0
2.1-20141002120127-results.zip: make test-perf result sets for 2.1
2.0-20140924123153-summary.json: combined test results with aggregate stats
2.1-20141002120127-summary.json: combined test results with aggregate stats
crunch_perf_results.py: creates the summary results
compare_results.py: generates the graphs on this page
requirements.txt: python libs necessary for scripts
requirements-ipython.txt: python libs necessary for using iPython qtconsole with scripts
README.md: information regarding the scripts

B2G/QA/2014-10-02 Performance Acceptance

Contents

2014-10-02 Performance Acceptance Results

Overview

Startup -> Visually Complete

Execution

Result Analysis

Pass/Fail Criteria

Areas

Calendar

Camera

Clock

Contacts

Dialer

Email

FM Radio

Gallery

Music

Settings

SMS

Video

Raw Data

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools