Releases/Post-mortems/Firefox 3.6.9

From MozillaWiki
Jump to: navigation, search

Schedule / Location / Call Information

  • Thursday, 2010-09-30 @ 1:00 pm PDT
  • In Very Good Very Mighty (3rd floor)
  • 650-903-0800 x92 Conf# 8605 (US/INTL)
  • 1-800-707-2533 (pin 369) Conf# 8605 (US)
  • join irc.mozilla.org #post-mortem for back channel

Overview

  • 3.6.9 took bug 532730, which caused a crash on startup for some users
    • Notes
      • Need to take notes is triage meetings
      • Why wasn't this crash considered a top crash?
        • An existing crash with the same sig didn't see it as new or spiking given the crash data
      • Once feedback came in, took updates off the wire
      • Legnitto Blake your fix caused this...
        • Mark had a fix locally
        • still under investigation why var is null
        • took awhile to diagnose this
        • bug for this? bug 594699
        • should have blocked 2.0
      • assumptions were not right initially
      • dev was all over this, not a lack of response
      • Alot of confusion as to why this occurred.
      • No STRs, only evidence of fix was thru code expectation
  • Mark noticed the new Thunderbird top crash (also affected Seamonkey)
  • Updates were turned off during investigation?
    • Notes
      • stemmed bleeding with turning updates off
      • we managed not to 0 day ourselves a good thing
      • did we see how many took the updates? 38mil.
      • from J.O thought it was a chemspill release
        • email said less urgent, but that didn't get to everyone.
        • Al indicated he was on it in terms of urgency
        • Al completed testing 12:45p Sep 15.
      • There was confusion of urgency some had less/ some more
  • Press learned of updates being turned off
  • Engineering investigation took a bit, still didn't know why
  • Cheng contacted people to try to figure out root cause
  • Eventually fixed by bug 594699
  • During the 3.6.10 release there were issues around the release that were bumpy
    • Less hurry as the updates were turned off
    • Socorro roll-out
    • Mirror uptake and when/if to use the CDN
      • 12:45 go to mirrors
      • 13:10 rsync to pvt-mirror01
      • 14:20 0% mirror uptake
      • 14:36 rsync to pvt-mirror01 completed
      • 16:20 mirrors saturated enough for release
    • QA coverage in relation to the QA offsite

Things that went right

  • chofmann had produced reports of new crashes
  • LegNeato sat on top of developers to keep sense of urgency
  • week delay in opening bugs in 3.6.9 kept us from zero-daying ourselves
  • took 4.5 hours from "go to build" to "all signed builds handed to QA"

Things that went wrong

  • The crashes were there in nightlies (with much less volume), but masked by different crashes with the same signature
  • Unclear how to think of the severity of 3.6.10 since the 3.6.9 sg bugs were still closed, but MFSAs were published.
  • QA offsite may have resulted in resource contention with the 3.6.10 and 4.0b6 releases coming right on top of each other
  • Below is a copy of the notes Al Billings left for me:
3.6.10 and 3.5.13 Post-Mortem Items

1) Lack of Information on progress going live: Shipping 3.6.10 (and 3.5.13)
took an especially long time. The initial "go" to go live was given at 12:45
PM. More than an hour and a half later, QA and Release Management (Christian)
were told that pickup on mirrors was still low and we had to wait. Eventually,
Christian tried to get us to use the paid network (CDN?) because of uptake
issues but that was blocked by JustinF. It turns out that when the "go" was
given at 12:45 PM, RelEng began copying the release bits from one internal
server to another. Only when this copying was done, could we go live. All of
the delay and lack of uptake turned out to be because it took over 2 hours for
the bits to all be copied internally. We were not waiting on external uptake
because we had not offered the bits to external mirrors yet but no one out of
RelEng knew this. Once the internal copying was done, we were able to get
uptake enough to do release testing within 30 minutes.

So the overall problem is:

a) RelEng may or may not have the right processes to enable a quick release.
For example, why did it take two hours to copy all of the data between internal
boxes but 30 minutes for third parties outside of Mozilla to deploy it to the
level of being able to go live? Can RelEng frontload certain internal tasks in
order to facilitate quicker releases?

b) Lack of RelEng transparency: Within the RelEng team, what was actually going
on appears to be known but it was not communicated to anyone outside RelEng.
Based on this lack of knowledge, decisions were made (such as the use of a paid
network that costs Mozilla money) that were not necessary. RelEng insists
pretty commonly on transparency and details as to what and how other groups are
doing things but not for its own processes. We should have a clearer idea
outside of RelEng as to what is going on during various parts of the release
process.

c) Lack of communication: As part of the lack of transparency, there was not
clear minute by minute communication. We (QA and Release Management) would be
told that things were in process and then have to ask again in 30 or 40 minutes
to get an update.


Potential solutions:

a) Clear checklists of what is being done and by whom at each stage in the release process.


b) As items in checklists are cleared, this status change should be
communicated via e-mail (as John O'Duinn insists for all official 
communications) to all parties (probably release-drivers e-mail list).


c) Clear criteria of expected times for each task in checklist: We need to know
what the expected time to complete an item is and at what point this value has
been exceeded enough to invoke some kind of emergency response (such as paying
for network bandwidth, throwing more on-call engineers at the problem, etc).

I hope that this is helpful.

  • Notes from joduinn after meeting with IT
*) mrz suggested there might be something he can do to speed up the
rsync from ftp.m.o -> pvt-mirror01. joduinn to followup. {{bug|601025}}

*) part of the confusion was caused by different people having different
data and hence different opinions on how much CDN would help at
different times throughout the release day. This was resolved by having
people meet on phone call at 15:20 to see if CDN was still justified.
The same discussion at 14:36 or 12:45 could have different outcome. Even
better if possible to avoid getting into this state in the first place,
which leads to:

*) Any proposed change to plan-of-record should be communicated to all
via  release-drivers email. Sideline conversations, pvt msg, etc leave
people out of the loop, causing confusion. Also its hard to coordinate
when people are geographically distributed.

*) propose: release-driver (legneato/beltzner) will give "go to mirrors"
to RelEng as usual. RelEng will email when sync starts, and when files 
visible on pvt-mirror1. Additionally, legneato/beltzner will now also state
what time they want the mirrors ready for release. Without CDN, this is
typically 4 hours, and for a non-chemspill that has been fine. Last time
we used CDN in a chemspill release, mirrors were ready in 2 hours. At
time of "push to mirrors", Legneato/beltzner to make call on how
time-critical the chemspill release is. RelEng will coordinate with IT
on this, and all other mirror settings, as usual.

Suggested Improvements

  • clear statement on whether release is chemspill or not; if any doubt, then restate
  • bug approvals/rejections should always come with comments
  • Make sure we can differentiate different crashes with the same signature (bug 600929)
  • Come up with a policy when/when not to use the cdn
    • ideal is to have CDN on as soon as we start pushing to mirrors, and scale it back as mirrors get saturated; this allows us to go live at any point that we'd like
  • Investigate improvements to the local IT mirroring infrastructure to let us get builds to the mirror sites faster
  • Come up with a contingency plan to rely less on mirrors for future updates (bug 596839?)
  • We didn't learn that much from users doing the user outreach around crashes... but we got turnaround in about 48 hours. Let's keep that tool on our table.