ReleaseEngineering/How To/Manage Buildbot with Fabric

From MozillaWiki
Jump to: navigation, search

RelEng has started writing some tools to manage all the buildbot masters using fabric.

The manage_masters.py script is available from the tools repository, in the buildfarm/maintenance directory.

Fabric is a pre-requisite for running these tools. It is easy-installable into a virtual environment. Setup of ssh-agent is strongly recommended (see below for details.

Setup

hg clone ssh://hg.mozilla.org/build/tools
cd tools
mkvirtualenv tools
pip install fabric

Usage

Usage: manage_masters.py [options] action [action ...]

Supported actions (run the script to see if there are more): check, checkconfig, show_revisions, reconfig, restart, graceful_restart, stop, graceful_stop, start, update

Example:

python manage_masters.py \
   -f http://hg.mozilla.org/build/tools/raw-file/tip/buildfarm/maintenance/production-masters.json \
   -R scheduler check

or if your tools repo is up to date just

python manage_masters.py -f production-masters.json -R scheduler check

buildbot-wrangler.py

Make sure you run fabric from "buildfarm/maintenance" since buildbot-wrangler.py is there and needs to be uploaded to the masters when we try to do a reconfig.

Traceback (most recent call last):
 File "build/bdist.macosx-10.6-universal/egg/fabric/main.py", line 540, in main
 File "/Users/armenzg/repos/releng/braindump/buildbot-related/master_fabric.py", line 99, in reconfig
   put('buildbot-wrangler.py', '%s/buildbot-wrangler.py' % m['master_dir'])
 File "build/bdist.macosx-10.6-universal/egg/fabric/network.py", line 391, in host_prompting_wrapper
 File "build/bdist.macosx-10.6-universal/egg/fabric/operations.py", line 283, in put
ValueError: 'buildbot-wrangler.py' is not a valid local path or glob.
Disconnecting from production-master02.build.mozilla.org... done.

Suggestions

Don't use fabric with the test masters to reconfig if you are in a rush (backing something out) as it takes forever (sequential reconfigs).

If you need to reconfig everything it is much better if you run four instances of fabric (each on a different terminal). The reconfig step is blocking and it won't continue to the next host on a role group until it finishes. (Remember the reconfig step does NOT update.)

# in case it is not clear; Run each one on a different window
python manage_masters.py -f production-masters.json -j16 -R scheduler update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -R build     update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -R try       update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -R tests     update checkconfig reconfig

The tests reconfig can take a really long time, so you can parallelize the test process using -M {macosx|windows|linux|panda} (instead of "-R tests") each on a different tab plus -j16. So, replace the last line/window with these 5 (for a total of 8 windows):

python manage_masters.py -f production-masters.json -j16 -M macosx  update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -M windows update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -M linux   update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -M tegra   update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -M panda   update checkconfig reconfig

To validate the above (i.e. we haven't added any new platforms since the docs were updated), run:

diff -u \
 <(./manage_masters.py -f production-masters.json -l -R tests) \
 <(./manage_masters.py -f production-masters.json -l -M macosx \
     -M windows -M linux -M tegra  -M panda)

If any differences are reported, include those platforms and update the docs.

Hosts and role groups

Fabric works on individual hosts, and supports organizing these hosts into groups. This is mostly a good fit for how we need to work, except we often have multiple buildbot masters on a single host, so there is a bit of hacking in master_fabric.py to pick out the right hosts to operate on depending on what the user has selected.

Hosts are selected with the -H flag, and roles are selected with the -R flag. Hosts correspond to the 'name' field in the masters json file, and are short abbreviations to refer to each master, e.g. bm13-build1, bm19-tests1-tegra, bm33-try1, bm36-build_scheduler. We have 4 roles defined: build, scheduler, try, and tests. Selecting a role will restrict fabric to only operate on masters that operate on that role.

The string 'all' when specified via -H or -R means that all masters in the masters file will be operated on. You can also use -M flag to match on strings in the master name, eg -M tests1-windows to pick up all the windows test masters. Note that manage_masters.py will "or" all host specifications from the command line, e.g. "-R tests -M windows" will return all hosts in role "tests", not just the windows test masters.

Fabric relies on being able to ssh to the masters without password authentication, so be sure to have your ssh keys set up! Which means have the needed keys added into the running instance of your ssh-agent (your "~/.ssh/config" file is not consulted by Paramiko.) If you don't have the keys set up, you'll be asked for your password one time per invocation, so use multiple commands per invocation where appropriate.

Updating checkout

python manage_masters.py -f production-masters.json -R scheduler update
[production-master02.build.mozilla.org] run: hg pull
[production-master02.build.mozilla.org] out: pulling from http://hg.mozilla.org/build/buildbotcustom
[production-master02.build.mozilla.org] out: searching for changes
[production-master02.build.mozilla.org] out: adding changesets
[production-master02.build.mozilla.org] out: adding manifests
[production-master02.build.mozilla.org] out: adding file changes
[production-master02.build.mozilla.org] out: added 11 changesets with 19 changes to 12 files
[production-master02.build.mozilla.org] out: (run 'hg update' to get a working copy)
[production-master02.build.mozilla.org] run: hg update -r default
[production-master02.build.mozilla.org] err: .hgtags@8546abc704ee, line 93: tag 'FIREFOX_3_6_9_BUILD1' refers to unknown node
[production-master02.build.mozilla.org] err: .hgtags@8546abc704ee, line 94: tag 'FIREFOX_3_6_9_RELEASE' refers to unknown node
[production-master02.build.mozilla.org] out: 12 files updated, 0 files merged, 2 files removed, 0 files unresolved
[production-master02.build.mozilla.org] run: hg pull
[production-master02.build.mozilla.org] out: pulling from http://hg.mozilla.org/build/buildbot-configs
[production-master02.build.mozilla.org] out: searching for changes
[production-master02.build.mozilla.org] out: adding changesets
[production-master02.build.mozilla.org] out: adding manifests
[production-master02.build.mozilla.org] out: adding file changes
[production-master02.build.mozilla.org] out: added 35 changesets with 49 changes to 32 files
[production-master02.build.mozilla.org] out: (run 'hg update' to get a working copy)
[production-master02.build.mozilla.org] run: hg update -r default
[production-master02.build.mozilla.org] err: .hgtags@ac95f8973f7e, line 221: tag 'FIREFOX_3_6_13_RELEASE' refers to unknown node
[production-master02.build.mozilla.org] err: .hgtags@ac95f8973f7e, line 222: tag 'FIREFOX_3_6_13_BUILD1' refers to unknown node
[production-master02.build.mozilla.org] out: 32 files updated, 0 files merged, 0 files removed, 0 files unresolved
[production-master01.build.mozilla.org] run: hg pull
[production-master01.build.mozilla.org] out: pulling from http://hg.mozilla.org/build/buildbotcustom
[production-master01.build.mozilla.org] out: searching for changes
[production-master01.build.mozilla.org] out: adding changesets
[production-master01.build.mozilla.org] out: adding manifests
[production-master01.build.mozilla.org] out: adding file changes
[production-master01.build.mozilla.org] out: added 5 changesets with 13 changes to 10 files
[production-master01.build.mozilla.org] out: (run 'hg update' to get a working copy)
[production-master01.build.mozilla.org] run: hg update -r default
[production-master01.build.mozilla.org] out: 10 files updated, 0 files merged, 2 files removed, 0 files unresolved
[production-master01.build.mozilla.org] run: hg pull
[production-master01.build.mozilla.org] out: pulling from http://hg.mozilla.org/build/buildbot-configs
[production-master01.build.mozilla.org] out: searching for changes
[production-master01.build.mozilla.org] out: adding changesets
[production-master01.build.mozilla.org] out: adding manifests
[production-master01.build.mozilla.org] out: adding file changes
[production-master01.build.mozilla.org] out: added 10 changesets with 11 changes to 9 files
[production-master01.build.mozilla.org] out: (run 'hg update' to get a working copy)
[production-master01.build.mozilla.org] run: hg update -r default
[production-master01.build.mozilla.org] out: 9 files updated, 0 files merged, 0 files removed, 0 files unresolved

Done.
Disconnecting from production-master01.build.mozilla.org... done.
Disconnecting from production-master02.build.mozilla.org... done.

Show which revisions are checked out

Order is master, buildbotcustom, buildbot-configs, tools.

$ python manage_masters.py -f production-masters.json -R build -R scheduler show_revisions
pm01-bm        1046bc8c7e00 57e8bc4354d2 cfca31588669
pm01-sm        1046bc8c7e00 57e8bc4354d2 cfca31588669
pm02-sm        1046bc8c7e00 57e8bc4354d2 cfca31588669
pm03-bm        1046bc8c7e00+ 57e8bc4354d2 cfca31588669
bm3            1046bc8c7e00+ 57e8bc4354d2 cfca31588669
bm4            1046bc8c7e00 57e8bc4354d2 cfca31588669

Looks like we have some local modifications! Bad Release Engineers, no scotch^W cookie for you.

Checkconfig

python manage_masters.py -f production-masters.json -R build -R scheduler checkconfig
bm3            OK
pm02-sm        OK
pm01-bm        OK
pm01-sm        OK
bm4            OK

Done.
Disconnecting from buildbot-master1.build.mozilla.org... done.
Disconnecting from production-master01.build.mozilla.org... done.
Disconnecting from buildbot-master2.build.mozilla.org... done.
Disconnecting from production-master02.build.mozilla.org... done.

Reconfigure

Reminder: reconfigure only does the reconfig; you need to have previously done an 'update' and 'checkconfig'

python manage_masters.py -f production-masters.json -R build reconfig    
[buildbot-master1.build.mozilla.org] put: buildbot-wrangler.py -> /builds/buildbot/build_master3/master/buildbot-wrangler.py
[buildbot-master1.build.mozilla.org] run: rm -f *.pyc
[buildbot-master1.build.mozilla.org] run: python buildbot-wrangler.py reconfig .
[production-master01.build.mozilla.org] put: buildbot-wrangler.py -> /builds/buildbot/builder_master1/buildbot-wrangler.py
[production-master01.build.mozilla.org] run: rm -f *.pyc
[production-master01.build.mozilla.org] run: python buildbot-wrangler.py reconfig .
[production-master01.build.mozilla.org] err: 2010-11-24 06:58:26-0800 [Broker,252,10.2.71.15] Unhandled Error
[production-master01.build.mozilla.org] err: 	Traceback (most recent call last):
[production-master01.build.mozilla.org] err: 	Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
[production-master01.build.mozilla.org] err: 	]
[buildbot-master2.build.mozilla.org] put: buildbot-wrangler.py -> /builds/buildbot/build_master4/master/buildbot-wrangler.py
[buildbot-master2.build.mozilla.org] run: rm -f *.pyc
[buildbot-master2.build.mozilla.org] run: python buildbot-wrangler.py reconfig .

Done.
Disconnecting from buildbot-master1.build.mozilla.org... done.
Disconnecting from production-master01.build.mozilla.org... done.
Disconnecting from buildbot-master2.build.mozilla.org... done.


If the reconfig gets stuck, see How To/Unstick a Stuck Slave From A Master.

As a special case for test masters, you can unstick things by either:

  • triggering a "Clean Shutdown" from the web UI for that master, or
  • using manage_masters.py graceful_restart command

After jobs complete, the master will shut down (web page will not be served). Fabric should notice and unstick itself at that point. If fabric doesn't notice, in a separate window, individually do the update and start steps. If fabric still doesn't notice, good luck and document what works.