ReleaseEngineering/How To/Unstick a Stuck Slave From A Master

From MozillaWiki
Jump to: navigation, search

Sometimes slaves can be in various wedged states, which prevents a master reconfig.

If this is the case, then you need to convince the master to drop the connection to that slave, with prejudice.

The Hard Way

First you'll need to find slaves that may be stuck. Check the list of slaves on the master you're trying to reconfig using the no_builders=1 pragma, e.g.:

http://buildbot-master11.build.scl1.mozilla.com:8201/buildslaves?no_builders=1

If any slaves have a "Last heard from" time of more than a few hours (or > 10 connections in the past hour), those are likely stuck.

Then, on the master, use lsof to figure out what file descriptor the socket it on for that slave:

$ /usr/sbin/lsof -p $master_pid | grep linux-ix-slave05
buildbot 2788 cltbld   16u  IPv4 471638980                TCP staging-master.build.mozilla.org:9012->linux-ix-slave05.build.mozilla.org:54714 (ESTABLISHED)

The '16u' here gives the file descriptor within the master process (without the u)

Then, open up the manhole and:

>>> import os
>>> os.close(16)

This will cause some weird tracebacks in the master log, but will let the reconfig finish. It's black magic, and leaves the master in a potentially-very-corrupted state (eventually Twisted will try to close fd 16, which will close something completely unrelated), so disable the master in slavealloc and do a clean shutdown of this master once the reconfig is back on track.