TestEngineering/Performance/Sheriffing
Contents
Overview
The code sheriff team does a great job of finding regressions in unittests and getting fixes for them or backing stuff out. This keeps our trees green and usable while thousands of checkins a month take place!
A Performance Sheriff's job is similar: to make sure that performance regressions in Firefox are detected and dealt with. They look at the data produced by performance test jobs (mainly Talos), find regressions, determine their root cause(s) and gets bugs on file to track all issues and make interested parties aware of what is going on.
A short demo describing the performance sheriffing workflow: https://youtu.be/VdTC7Maa6hE
What is an alert
As of January 2016, alerts are generated in Perfherder. These are generated by programatically verifying there is a sustained regression over time (original data point + 12 future data points).
There is an alert summary outlining the alerts which match the same set of revisions. For the summary there are a few pieces of information:
- Title (which is a good bug title if filing one for a regression:
- date of the suspect revision push
- link to the Treeherder jobs view for the suspect revision
- suspected changeset [range] including commit summary
- a status field (default 'untriaged') where we can associate a bug, file a bug, and document the resolution.
Below the summary will be a list of alerts, each alert will reference:
- Test name
- platform (including build type, such as opt, pgo)
- old score (median score of the previous 12 commits)
- new score (median score of the future 12 commits)
- % change / values
- bar chart to show severity, green = improvement, red = regression
- Confidence value (from the t-test code)
Keep in mind that alerts mention improvements and regressions, which is valuable for us to track the entire system as a whole. For filing bugs, we focus mostly on the regressions.
Investigating the alert
This is a manual process that needs to be done for every alert. We need to:
- Look at the graph and determine the original branch, date, revision where the alert occurred
- Look at Treeherder and determine if we have all the data.
- Retrigger jobs if needed (more noise, more retriggers)
- Once you have more data, look at the data in compare view to see if other tests/platforms have changed
- Add all related alerts you see to the summary with the reassign button
Determining the root cause from Perfherder
When viewing a single alert and clicking on the graph link, Perfherder automatically show multiple branches for the given test/platform. This helps you determine the root branch. It is best to zoom in and out to verify where the regression is.
While this isn't always clear, most of the time it is easy to see another alert on a different branch and mark the current one as a downstream if needed.
In rare cases we do not automatically generate an alert on the original branch and then we would want to manually create an alert, then mark the first alert you were looking at as downstream to the new alert. These manually triggered alerts have a human icon showing up, instead of a confidence score.
Determining if we have all the data from Treeherder
Since an alert is a suggestion of the original changeset, I always open the graph view, zoom in to a narrow window, then open the test job (from the link shown when clicking on a data point) of a job in the future. Then I filter the Treeherder view down and show the next 10 jobs. This gives you a range of pushes to see coalescing, retriggers, and allows you to fill in the holes by retriggering and scheduling jobs. Here we are looking for a few things:
- Do we have data for the revision before / after the revision we have identified as regressing? If not, we should consider filling in the missing data.
- Is our revision or the revision before / after a merge? If so, we should retrigger to ensure that we are not investigating a merged changeset, if we are on a merged changeset, we need to go to the original branch and bisect.
- Does it look like most of the other platforms/talos tests have completed in this range? If not, then we could have other alerts for tests/platforms arriving in the future.
Retriggering jobs
In the case where we have:
- missing data
- an alert on a merge
- an alert on a pgo build (with no alert on a non-pgo)
- an alert where the range of the regression overlaps with the regular range (a small alert (<5%) or a noisy test)
We need to do some retriggers. I usually find it useful to retrigger 3 times on 5 revisions:
- target revision-2
- target revision-1
- target revision
- target revision+1
- target revision+2
In the case where there is missing data, target revision becomes a range of: [target revision, revisions with missing data]
This is important because we then have enough evidence to show that the regression is sustained through retriggers and over time. If there is suspect of alerts on other tests/platforms, please retrigger those as well.
Determining the scope of the regression from Perfherder
Once you have the spot, you can validate the other platforms by adding additional data sets to the graph. It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to coalescing.
Cases to watch out for
There are many reasons for an alert and different scenarios to be aware of:
- backout (usually within 1 week causing a similar regression/improvement)
- pgo/nonpgo (some errors are pgo only and might be a side effect of pgo). We only ship PGO, so these are the most important.
- test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests (we need bugs to document these)
- Merged - sometimes the root cause looks to be a merge, this is a normall a side effect of Coalescing.
- Coalesed - this is when we don't run every job on every platform on every push and sometimes we have a set of changes
- Regular regression - the normal case where we get an alert and we see it merge from branch to branch
Tracking bugs
Every release of Firefox we create a tracking bug (i.e. bug 1386631 - Firefox 57) which we use to associate all regressions found during that release. The reason for this is 2 fold:
- We can go to one spot and see what regressions we have for reference on new bugs or to follow up.
- When we uplift it is important to see which alerts we are expecting
These bugs just contain a set of links to other bugs, no conversation is needed.
Filing a bug
A lot of work has been done inside of Perfherder to make filing a bug easier. As each bug has unique attributes it is hard to handle this fully programmatic, but we can do our best. In fact, there is a 'File bug' clickable link which is underneath each Alert Summary section in Perfherder Alerts. Clicking it will bring you to Bugzilla with as many fields filled in as possible.
Here are some things to check/verify when filing a bug:
- Product/Component - this should be the same as the bug which is the root cause, if >1 bug, file in Talos
- Dependent/Block bugs - For a new bug, add the tracking bug (for the current version) and root cause bug(s) as blocking this bug
- CC list - cc patch author(s), reviewer(s) and owner of the tests as documented in the Talos test definition; if we have >1 bug, we should cc everyone who worked on those bugs so we call pitch in an answer questions faster
- Summary of bug should have a check to make sure the revision is accurate
- The description is auto suggested as well, please verify the revision here
As a note, the generated description refers the patch author to guidelines and expectations for them about how and when to respond.
Once a bug is filed it is a good idea to do a few things in another comment:
- provide a link to compare view to show you have done retriggers and believe this is valid
- needinfo the patch author (if many patch authors, needinfo one of :davehunt, :igoldan or :rwood)
- mention how confident you are in the regression (more confidence if you have a lot of retriggers and there is only one patch, less confident if you are waiting on backfilling data, retriggers, try runs, etc.)
Other common tasks
The most common task and most important task is investigating new alerts and filing bugs. Of course as the system grows and scales there are additional tasks to do.
Merge Day - Uplifts
Every 6 weeks we do an uplift. These typically result in dozens of alerts for each uplift.
The job here is to triage alerts as we usually do, except in this case we have a much larger volume of alerts. One thing here is we have alerts from the upstream branch. Take for example when we uplift Mozilla-Central to Mozilla-Beta. We have a tracking bug for each release, and there is a list of bugs (keep in mind some are resolved as wontfix). In a perfect world (half the time) we can match up the alerts that are showing up on Mozilla-Beta with the bugs that have already been filed. The job here is to verify and add bugs to keep track of what is there.
In addition to that, if there are bugs that are on file and do not show up on Mozilla-Beta, we need to indicate in the bug that we don't see this on the uplift and recommend closing the bug as fixed or as worksforme. Likewise all bugs that are showing up on Beta, we need to comment in there and mention that this is showing up on Beta and ask for action.
Resolved Bugs - Leftover Alerts
In many cases we resolve a bug and need to wait a day or two for enough data to show up so we can verify the bug is fixed. It is common to have associated a single bug with many alerts.
It is good practice to update the state of the alert summary in Perfherder alerts so it accurately reflects the state of the related bug.