ReleaseEngineering:2010-Q3-Workweek:DowntimeLessRestart

From MozillaWiki

Jump to: navigation, search

Goal:
- Downtime-less restarts
- cheap way is to have many more masters and graceful shutdowns one at a time
- why does a current job have to die if master is not up? couldn't it talk with someone else?

more masters:
- currently 3, so taking down a master is bigdeal
- setup extra masters (total 10?15?), reshuffle existing slaves
- final step of "24x7 pod plans" no-more-downtimes

have slaves ask for name of master on reboot
- for now this lets us move slaves between masters without having to login to each slave
- next dynamically have slaves rebalance across new masters
- next dynamically rebalance masters as new slaves come online / powered off.
- https://bugzilla.mozilla.org/show_bug.cgi?id=508673

how to track history of jobs for a given slave, if that slave moves across different masters
- use nthomas's utility that shows history of jobs for a slave *across* different masters

could be implemented in puppet, or in separate web app
- long discussion about whether its better/faster/easier/safer to do in puppet (no windows) or in webapp,
nice to do in some way that can be upstreamed to buildbot
- optional component

some discussion of using "load-balancer-master" (one "master" master)
- making all masters be hidden behind "master" master.
- need proof of concept
- big, scary, expensive?

Retrieved from "https://wiki.mozilla.org/index.php?title=ReleaseEngineering:2010-Q3-Workweek:DowntimeLessRestart&oldid=254725"