ReleaseEngineering/NoReboots

From MozillaWiki
Jump to: navigation, search
Warning: This RelEng page is obsolete!
This is largely based on Buildbot infra. Though some may apply to Taskcluster, this page needs to be updated.

Why no reboots:

Turning off reboots saves machine time in several ways: 70-120 seconds of reboot time, greater potential for FS cache utilization, and opportunities for sharing work (like clobbering) across jobs more effectively. As such, if successful, we should expect to see less instances launched per day -- as the throughput of the infrastructure rises.

Results:

A preliminary survey of the time saved by non-rebooting spot instances suggests that -- if jobs are not failing more often as a result of the changes -- a great deal of wasted time is being recovered:

   minimum time saved (seconds) = (<iterations_seen> - <halts_seen>) * <reboot_sec_avg>
One hour sample from 01-13-2015 01 CST Raw Data

The test/try machines spend more time rebooting (79s) and less time doing pre-flight tasks (~45s) on average while builders have an opposite skew (82s pre-flight tasks and 67s reboot time). We seem to have saved around 60 hours of machine time during this interval.

start_to_bb_sec_avg reboot_sec_avg start_to_bb_sec_total halts_seen reboot_percentage iterations_seen type reboot_sec_total
47 79 132104 178 6 2804 all spot instances 14070
45 79 121606 167 6 2676 test and try 13329
82 67 10498 11 8 128 builders 741
Three hour sample from 01-13-2015 02 CST Raw Data

During the longer samples builder reboot times increased dramatically, however, this is likely the result of builder issues which were occuring around the time the data was taken; test/try results remained stable. We seem to have saved around 204 hours of machine time during this interval.

start_to_bb_sec_avg reboot_sec_avg start_to_bb_sec_total halts_seen reboot_percentage iterations_seen type reboot_sec_total
47 92 410104 605 7 8547 all spot instances 55769
45 80 362177 553 7 7884 test and try 44648
72 213 47927 52 7 663 builders 11121

NOTE: The data above was gathered via runners influxdb logging and verified by spot checking runner logs (/var/log/runner.log) The reports themselves were generated by this script: http://pastebin.mozilla.org/8191890

No reboots timeline (puppet):

 date:        Fri Jan 16 17:47:33 2015 +0000
 summary:     Bug 1122601 - Coerce runner to reboot after particular job types; r=rail
 date:        Tue Jan 13 21:07:53 2015 +0000
 summary:     Bug 1109932 - Enable reboots for all try, talos, and test slaves; r=Callek
 date:        Tue Jan 06 14:26:55 2015 -0600
 summary:     Bug 1118125 - Turn off osx reboots; r=Callek
 date:        Tue Dec 30 19:40:23 2014 +0000
 summary:     Bug 1103123 - Turn off rebooting of all linux slaves; r=callek
 date:        Thu Dec 18 18:45:10 2014 +0000
 summary:     Bug 1113245 - Remove cleanslate process list on Linux and Mac machines during reboots with halt.py; r=rail
 date:        Fri Dec 12 21:33:04 2014 +0000
 summary:     Bug 1103123 - Turn off rebooting of talos machines; r=catlee

How no-reboot mode is enabled (idllizer, post_flight)

Buildbot is now started/managed by runner, which runs tasks in an infinite loop according to some specified order [each task is blocking]. As such, buildbot initiates a graceful shutdown immediately after accepting any job so that the runner tasks may loop around again after it’s finished. A single runner loop looks like this:

   <tasks before buildbot> -> buildbot.py [graceful shutdown] -> <tasks after buildbot> -> post_flight.py

The graceful shutdown is initiated by idelizer.py, then, post_flight.py decides whether or not to shut down the machine or go forward with another loop.

post_flight checks:

hostname blacklist

Any machine with a hostname that matches some regular expression found in this list will be rebooted by post_flight. For example: [“^tst-“, “^t-"] would reboot all test machines after any job.

build api

BuildAPI is used to fetch data about the most recent job, if the job fails the slave is rebooted. This feature may need to be disabled, since it could mask failures. The thinking, on turning it on, was that we could track problems via logging and avoid tree closures in the case of problems. This is likely too optimistic. Failing hard may be better in the end.

jobname blacklist=====

Works like the hostname blacklist, except, acting on the name of the most recently run job (which is known about by BuildAPI).

What issues have been noted since no-reboot work started?

These bugs have been noted, since December '14, as having possible connection to Runner/NoReboots:

bug 1114541 bug 989048 bug 1109932 bug 1114688 bug 1111137

How are we tracking the status of machines, and measuring effectiveness?

Runner constantly uploads task stats to influxdb, for dashboards see: https://stats.taskcluster.net/grafana/#/dashboard/db/runner