CIDuty/Dealing With Outages
Contents
Because of the nature of releng systems, outages rarely affect just us. It is common for a releng service outage to also cause a tree closure, which will automatically involve sheriffs and, by proxy, developers.
Outages that affect more than just releng systems will involve the Mozilla Operations Center (MOC). Since there's always a chance that a "simple" tree closure might expand into a larger outage, we've tried to align our incident response process with that of the MOC to facilitate hand-offs and support.
The MOC has a 3 step process that is applicable to incidents of all sizes:
Analyze
The first step of any outage is determining what the symptoms are and which systems are affected. This determines impact. This will also allow you to reach out to the right people for help.
Engage
Once you know what the problem is, you have several tasks to complete:
- file a bug for the incident with as much info as you have. Be sure to cc the people with the domain knowledge needed to address the issue.
- change the topic of the #buildduty channel to include the bug number for the outage
- reach out to people on IRC or via releng escalation protocols for help
Communicate
Communication is the most important thing during an outage.
As ciduty, you may not be actively fixing the issue yourself, but you can act as an effective status aggregator and coordinator. One person should always be in this role, handing off as necessary to others in different timezones.
Here is a quick list of communication tasks:
- Broadcast incident status to IRC every 30 minutes. While most of the discussion may be happening in the #ci channel, you should update other channels as appropriate: #developers, #sheriffs, ...
- Most outages will involve IT systems. Make yourself available in the appropriate IT IRC channels.
- Mirror that status into the relevant bugs as appropriate.
- Coordinate with others who are helping to avoid duplicating work.
- If debugging collaboratively with others, start a public etherpad to share state.
- Be aware of timezones. Keep tabs on who is working on what and what timezone they are in so hand-offs can be arranged. Be sure to include yourself in this calculus, i.e. if you are managing the incident, make sure you bring someone up to speed and hand-off before you leave for the day.
Specific diagnostic aids
Problems with stage/ftp
Ganglia is useful for monitoring Mozilla-hosted services:
For problems with stage/ftp, the productdelivery cluster is informative:
Problems with networking
If you're experiencing network issues, particularly being colos (e.g. SCL3<->EC2), join the #netops-alerts IRC channel. Connection details are in https://mana.mozilla.org/wiki/display/SYSADMIN/IRC+use+within+IT
Smokeping can help you visualize outages in real-time:
- http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.core1-releng-scl3
- http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1
- http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw1
- http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2
Investigate network packet loss between Mozilla and Amazon EC2 using mtr.