ReleaseEngineering:2010-Q3-Workweek:DowntimeLessRestart

From MozillaWiki
Jump to: navigation, search
  • Goal:
    • Downtime-less restarts
    • cheap way is to have many more masters and graceful shutdowns one at a time
    • why does a current job have to die if master is not up? couldn't it talk with someone else?


  • more masters:
    • currently 3, so taking down a master is bigdeal
    • setup extra masters (total 10?15?), reshuffle existing slaves
    • final step of "24x7 pod plans" no-more-downtimes
  • have slaves ask for name of master on reboot
    • for now this lets us move slaves between masters without having to login to each slave
    • next dynamically have slaves rebalance across new masters
    • next dynamically rebalance masters as new slaves come online / powered off.
    • https://bugzilla.mozilla.org/show_bug.cgi?id=508673
  • how to track history of jobs for a given slave, if that slave moves across different masters
    • use nthomas's utility that shows history of jobs for a slave *across* different masters


  • could be implemented in puppet, or in separate web app
    • long discussion about whether its better/faster/easier/safer to do in puppet (no windows) or in webapp,
  • nice to do in some way that can be upstreamed to buildbot
    • optional component


  • some discussion of using "load-balancer-master" (one "master" master)
    • making all masters be hidden behind "master" master.
    • need proof of concept
    • big, scary, expensive?