ReleaseEngineering:2010-Q3-Workweek:DowntimeLessRestart
From MozillaWiki
- Goal:
- Downtime-less restarts
- cheap way is to have many more masters and graceful shutdowns one at a time
- why does a current job have to die if master is not up? couldn't it talk with someone else?
- more masters:
- currently 3, so taking down a master is bigdeal
- setup extra masters (total 10?15?), reshuffle existing slaves
- final step of "24x7 pod plans" no-more-downtimes
- have slaves ask for name of master on reboot
- for now this lets us move slaves between masters without having to login to each slave
- next dynamically have slaves rebalance across new masters
- next dynamically rebalance masters as new slaves come online / powered off.
- https://bugzilla.mozilla.org/show_bug.cgi?id=508673
- how to track history of jobs for a given slave, if that slave moves across different masters
- use nthomas's utility that shows history of jobs for a slave *across* different masters
- could be implemented in puppet, or in separate web app
- long discussion about whether its better/faster/easier/safer to do in puppet (no windows) or in webapp,
- nice to do in some way that can be upstreamed to buildbot
- optional component
- some discussion of using "load-balancer-master" (one "master" master)
- making all masters be hidden behind "master" master.
- need proof of concept
- big, scary, expensive?