ReleaseEngineering:Buildduty:Downtimes

From MozillaWiki
Jump to: navigation, search

How do I schedule a downtime?

Whenever RelEng/IT/WebDev wants a tree-closing downtime for work on any systems, they should contact the CiDuty person of the day in #ci, by email or preferably by putting exact wording they'd like in the downtime notice as it relates to their bug, and then nominating the bug using the "needs-treeclosure?" flag.

  • NOTE: All ServerOps components now have "needs-treeclosure" flag enabled (previously only some ServerOps components had the flag).
  • NOTE: ServerOps "infra only" bugs are not visible to CiDuty by default. If the bug can be changed to "Moco specific", then it will be visible to all of RelEng, and show up in all the queries. If the bug needs to remain "infra only", then the bug assignee needs to file, and closely track all changes to, a new separate bug that CiDuty can use for scheduling.

When planning a downtime, CiDuty should consider:

  • the urgency of the work
  • what other work, if any, can be safely done in the same downtime

...and propose a time that:

  • is low impact to developers
  • is low impact to releases (assessed by asking a list of pre-approvers).
  • is not scheduled near any other downtimes / planned outages
  • fits the schedule of the person who understands the work
  • has CiDuty available to handle tree closing/opening and field questions in #developers (other channels as needed)
  • gives >1 day notice to newsgroups if at all possible

Preparing for the downtime

  • review all bugs nominated with "needs-treeclosure?"
  • for bugs approved to land in the downtime, CiDuty will:
    • verify that the bug is assigned to the person who will actually be doing the work in the downtime
    • set the "needs-treeclosure+" flag in the bugs
    • set the whiteboard field with the proposed time/date.
    • write a downtime notice that includes:
      • bug# and description for each item.
      • for security sensitive work, be vague, but still include the bug# and vague description. (this reduces confusion about whether an item of work is in/out of a downtime).
      • boilerplate text about the timing of the downtime and how it will affect developers. I've included the boilerplate below:

Downtime Boilerplate

* When can I push?

We pride ourselves on having the self-serve tools[1] to make 
it easier to recover from build failures caused by a 
downtime. However, we understand that some developers may 
not be available to re-trigger failed runs after a downtime 
is done, or may not want to incur that hassle. Some would 
rather push early enough to receive all their results before 
the downtime starts, others would rather wait until the 
downtime is complete.

If you have LDAP access to Mozilla servers, which if you're 
landing code you likely do, you can check the current 
end-to-end times for your chosen development branch[2]. 
Compare your end-to-end time with the declared start of the 
downtime in order to make an informed decision about whether
you really want to push _now_.

1. https://build.mozilla.org/buildapi/self-serve 
2. https://build.mozilla.org/buildapi/reports/endtoend

Who do I notify, and when?

  1. etherpad/email everyone who will be working in the downtime, and ask to approve the draft downtime notice
  2. BEFORE posting to the newsgroups, send draft copy of the downtime notice to the list of pre-approvers. Ask for any objections/questions. As of 2012/04/25, the current list of pre-approvers is:
    • Bob Moss <bmoss@m.c>, Chris Hofmann <chofmann@m.c>, Alex Keybl <akeybl@m.c>, Damon Sicore <dsicore@m.c>, Johnathan Nightingale <johnath@m.c>, JP Rosevear <jpr@m.c>, Sheila Mooney <smooney@m.c>
    • cc release@m.c and infra-all@m.c
  3. post the downtime notice to the dev.planning & dev.tree-management newsgroups, and send a copy of the notice to all@m.c.
  4. err on the side of over-communication, i.e. play it safe: if you think a group will be impacted by a downtime and they are not included in the lists above, contact them.
  5. All of the above notifications should go out *at least 24 hours* before the planned downtime.

Running the downtime

  1. Be sure to check dev.planning, dev.tree-management newsgroups and planet.m.o regularly to ensure nothing comes up in response that would require changes to, or outright cancelling of, the downtime. A standout example here would be a chem-spill release.
  2. Before starting the downtime, CiDuty will notify sheriff in #developers, and close trees.

After the downtime

CiDuty will:

  1. reopen the trees
  2. verify with sheriff that trees open, all ok?
  3. update the bugs with status (landed-and-stuck", rolled-back) - it's possible this will be done first by the person who attempted the downtime. However, if not, CiDuty should update the bugs for the record.
  4. send "TREE OPEN" newsgroup post/email, listing what did / didn't get done

How do I coordinate downtimes with IT?

Some IT maintenance requires tree closure. Maintaining or rebooting any of the following systems needs coordinated downtimes with RelEng and IT. It usually also needs advance notice of Tree Closures posted to the [usual sources:

  • build.m.o (clobberer, build data, tryserver symbols)
  • cruncher.build.m.o (graphs dashboard, dumping build data, dashboards, pulse)
  • cvs.mozilla.org (talos)
  • hg.mozilla.org (firefox source repos, build repos)
  • tinderbox.mozilla.org (central reporting point for all builds)
  • ftp.mozilla.org (release updates on beta channel)
  • stage.mozilla.org (publishing builds, downloading builds for talos/unittest)
  • graphs.mozilla.org (performance tracking)
  • buildbot-rw-vip.db.scl3.mozilla.com (buildbot scheduler db, graphserver?)
  • buildbot-ro-vip.db.scl3.mozilla.com (used by cruncher)
  • mail.build.mozilla.org - currently dm-mail03.m.o (build mail to tinderbox)
  • aus3-staging.mozilla.org (update snippets)
  • nm-ops03.build.mozilla.org (releng VMs)
  • nagios.mozilla.org (monitoring)
  • relengweb1.dmz.scl3.mozilla.com (replacement for build.m.o)
  • tbpl.mozilla.org (build status)

For any questions, or if you're not sure about a particular server, please check with |ciduty in #ci.

If possible, consolidate RelEng and IT downtimes that need tree closures to avoid the disruption of having two tree closures soon after each other. This is "nice to do", not a "requirement"; if it reduces risk by doing two separate downtimes, that's fine!