B2G/QA/Automation/UI/Strategy/Streamline End to end Execution

From MozillaWiki
Jump to: navigation, search

Objective

Reduce handoffs and increase alert automation in End-to-end test execution so that failures can be triaged more quickly and with less people.

Challenges Addressed

  • Test results are hidden in Jenkins
  • Test results have too many spurious failures to sheriff
  • Test results only looked at a limited number of times per day

The Problem

We've had too many moving parts in our execution process, and too much manual review. One team would examine the results and triage for automation failures. Things that didn't fail automation would be handed to a manual team, who'd triage for manual failures.

None of this happened unless the results were explicitly looked at. Further, looking at the results is time-intensive because something always fails, so one needs to read them line-by-line to understand what is important. By the time we report bugs from automation, they've frequently already been found through some other process.

The Solution

High-urgency test runs such as smoketests will be aggressively pruned for unreliable tests; these will be moved into their own test runs. Once high-urgency runs are stabilized they will enhanced to alert proactively via email to QA or a similar mechanism. An alert is an unambiguous indication to immediately test manually and file any necessary bugs.

We will also add a very high-urgency red alert smoke test (Brick Test) for build flashing, which will alert proactively to the larger organization in order to prevent merging code that will take down our automation without recovery.

Of the remaining tests that cannot be reliably alerted upon, they will be triaged as they are now, but immediately tested manually by the triaging team with any product bugs immediately filed and escalated if necessary. We will also fix or modify the tests to make them more reliable whenever possible.

Timeline

Reduce Handoffs

Q1:

  • [MISSED] Documentation of existing execution/triage process into wiki [Geo]
  • [DONE] Plan/schedule for new team assuming execution/triage work. [Johan]
  • [DONE] Revision of plan for minimized execution/triage process [Johan][1]

Increase Alert Automation

Q1:

  • [MISSED] Implement Brick test and alert [John]
    • Brick test implemented, alert carries over
  • [DONE] Create failure analysis tools to analyze confidence [Geo]
  • [MISSED] Refactor suites by confidence [Martijn]
  • [MISSED] Make suites alert to appropriate email targets [John]

Fix Unreliable Tests

Ongoing:

  • Disable/xfail tests that are making results unclear
  • Fix unreliable tests as solutions are identified

Risks

Very few. We need buy-in that the Brick Test can alert outside QA, but worst case is it alerts within QA instead and we have to figure out another way to escalate ASAP.