ReleaseEngineering/How To/Process nagios alerts
From MozillaWiki
< ReleaseEngineering | How To
Contents
This page covers various nagios alerts, and gives pointers to how to resolve them. It also marks alerts that should now be automatically handled by the new automation infrastructure, so bugs can be filed as appropriate.
How to add/modify releng nagios alerts
The releng nagios alerts live in the sysadmins svn repo.
svn co svn+ssh://svn.mozilla.org/sysadmins/puppet/trunk/modules/nagios/manifests/releng
When adding new alerts, it's preferable to create hostgroups to define a class of machines that will share alerting characteristics, rather than adding alerts for single machines.
Processing existing alerts
Backlog Age
- Affects: end-to-end time for developers. When we hit our warning threshold (currently 6hr), there have been builds waiting to *start* for that long.
- Runs on: nagios server, checking https://secure.pub.build.mozilla.org/builddata/buildjson/builds-pending.js
- Possible solutions:
- kill off unnecessary jobs
- make sure build-pending.js isn't stale
- restart buildbot masters if they are slow
- for full options: Dealing with high pending counts
builds-4hr
- Affects: treeherder. This data is used to provide job history in treeherder. The specific file can be found here: http://builddata.pub.build.mozilla.org/buildjson/builds-4hr.js.gz
- Runs on: relengwebadm host, as a cronjob under the buildapi user
- Possible solutions: usually this script fails or runs slowly when there are problems with the buildbot status database, either a lock, another long-running query, or simply load. Killing off the offending query and re-running the report-4hr script will fix this but be aware that the report-4hr script can take a while to run, especially on a cold cache.
Command Queue
- Affects: buildbot masters. These are jobs that become wedged (possibly failed) in the queue and need to be resubmitted or deleted.
- See ReleaseEngineering/Queue_directories for debugging instructions.