Releases/Post-mortems/Firefox 3.6.9
From MozillaWiki
< Releases
Contents
Schedule / Location / Call Information
- Thursday, 2010-09-30 @ 1:00 pm PDT
- In Very Good Very Mighty (3rd floor)
- 650-903-0800 x92 Conf# 8605 (US/INTL)
- 1-800-707-2533 (pin 369) Conf# 8605 (US)
- join irc.mozilla.org #post-mortem for back channel
Overview
- 3.6.9 took bug 532730, which caused a crash on startup for some users
- Notes
- Need to take notes is triage meetings
- Why wasn't this crash considered a top crash?
- An existing crash with the same sig didn't see it as new or spiking given the crash data
- Once feedback came in, took updates off the wire
- Legnitto Blake your fix caused this...
- Mark had a fix locally
- still under investigation why var is null
- took awhile to diagnose this
- bug for this? bug 594699
- should have blocked 2.0
- assumptions were not right initially
- dev was all over this, not a lack of response
- Alot of confusion as to why this occurred.
- No STRs, only evidence of fix was thru code expectation
- Notes
- Mark noticed the new Thunderbird top crash (also affected Seamonkey)
- Updates were turned off during investigation?
- Notes
- stemmed bleeding with turning updates off
- we managed not to 0 day ourselves a good thing
- did we see how many took the updates? 38mil.
- from J.O thought it was a chemspill release
- email said less urgent, but that didn't get to everyone.
- Al indicated he was on it in terms of urgency
- Al completed testing 12:45p Sep 15.
- There was confusion of urgency some had less/ some more
- Notes
- Press learned of updates being turned off
- Engineering investigation took a bit, still didn't know why
- Cheng contacted people to try to figure out root cause
- Eventually fixed by bug 594699
- During the 3.6.10 release there were issues around the release that were bumpy
- Less hurry as the updates were turned off
- Socorro roll-out
- Mirror uptake and when/if to use the CDN
- 12:45 go to mirrors
- 13:10 rsync to pvt-mirror01
- 14:20 0% mirror uptake
- 14:36 rsync to pvt-mirror01 completed
- 16:20 mirrors saturated enough for release
- QA coverage in relation to the QA offsite
Things that went right
- chofmann had produced reports of new crashes
- LegNeato sat on top of developers to keep sense of urgency
- week delay in opening bugs in 3.6.9 kept us from zero-daying ourselves
- took 4.5 hours from "go to build" to "all signed builds handed to QA"
Things that went wrong
- The crashes were there in nightlies (with much less volume), but masked by different crashes with the same signature
- Unclear how to think of the severity of 3.6.10 since the 3.6.9 sg bugs were still closed, but MFSAs were published.
- QA offsite may have resulted in resource contention with the 3.6.10 and 4.0b6 releases coming right on top of each other
- Below is a copy of the notes Al Billings left for me:
3.6.10 and 3.5.13 Post-Mortem Items 1) Lack of Information on progress going live: Shipping 3.6.10 (and 3.5.13) took an especially long time. The initial "go" to go live was given at 12:45 PM. More than an hour and a half later, QA and Release Management (Christian) were told that pickup on mirrors was still low and we had to wait. Eventually, Christian tried to get us to use the paid network (CDN?) because of uptake issues but that was blocked by JustinF. It turns out that when the "go" was given at 12:45 PM, RelEng began copying the release bits from one internal server to another. Only when this copying was done, could we go live. All of the delay and lack of uptake turned out to be because it took over 2 hours for the bits to all be copied internally. We were not waiting on external uptake because we had not offered the bits to external mirrors yet but no one out of RelEng knew this. Once the internal copying was done, we were able to get uptake enough to do release testing within 30 minutes. So the overall problem is: a) RelEng may or may not have the right processes to enable a quick release. For example, why did it take two hours to copy all of the data between internal boxes but 30 minutes for third parties outside of Mozilla to deploy it to the level of being able to go live? Can RelEng frontload certain internal tasks in order to facilitate quicker releases? b) Lack of RelEng transparency: Within the RelEng team, what was actually going on appears to be known but it was not communicated to anyone outside RelEng. Based on this lack of knowledge, decisions were made (such as the use of a paid network that costs Mozilla money) that were not necessary. RelEng insists pretty commonly on transparency and details as to what and how other groups are doing things but not for its own processes. We should have a clearer idea outside of RelEng as to what is going on during various parts of the release process. c) Lack of communication: As part of the lack of transparency, there was not clear minute by minute communication. We (QA and Release Management) would be told that things were in process and then have to ask again in 30 or 40 minutes to get an update. Potential solutions: a) Clear checklists of what is being done and by whom at each stage in the release process. b) As items in checklists are cleared, this status change should be communicated via e-mail (as John O'Duinn insists for all official communications) to all parties (probably release-drivers e-mail list). c) Clear criteria of expected times for each task in checklist: We need to know what the expected time to complete an item is and at what point this value has been exceeded enough to invoke some kind of emergency response (such as paying for network bandwidth, throwing more on-call engineers at the problem, etc). I hope that this is helpful.
- Notes from joduinn after meeting with IT
*) mrz suggested there might be something he can do to speed up the rsync from ftp.m.o -> pvt-mirror01. joduinn to followup. {{bug|601025}} *) part of the confusion was caused by different people having different data and hence different opinions on how much CDN would help at different times throughout the release day. This was resolved by having people meet on phone call at 15:20 to see if CDN was still justified. The same discussion at 14:36 or 12:45 could have different outcome. Even better if possible to avoid getting into this state in the first place, which leads to: *) Any proposed change to plan-of-record should be communicated to all via release-drivers email. Sideline conversations, pvt msg, etc leave people out of the loop, causing confusion. Also its hard to coordinate when people are geographically distributed. *) propose: release-driver (legneato/beltzner) will give "go to mirrors" to RelEng as usual. RelEng will email when sync starts, and when files visible on pvt-mirror1. Additionally, legneato/beltzner will now also state what time they want the mirrors ready for release. Without CDN, this is typically 4 hours, and for a non-chemspill that has been fine. Last time we used CDN in a chemspill release, mirrors were ready in 2 hours. At time of "push to mirrors", Legneato/beltzner to make call on how time-critical the chemspill release is. RelEng will coordinate with IT on this, and all other mirror settings, as usual.
Suggested Improvements
- clear statement on whether release is chemspill or not; if any doubt, then restate
- bug approvals/rejections should always come with comments
- Make sure we can differentiate different crashes with the same signature (bug 600929)
- Come up with a policy when/when not to use the cdn
- ideal is to have CDN on as soon as we start pushing to mirrors, and scale it back as mirrors get saturated; this allows us to go live at any point that we'd like
- Investigate improvements to the local IT mirroring infrastructure to let us get builds to the mirror sites faster
- Come up with a contingency plan to rely less on mirrors for future updates (bug 596839?)
- We didn't learn that much from users doing the user outreach around crashes... but we got turnaround in about 48 hours. Let's keep that tool on our table.