User:Armenzg/Test pool efficiency
The main focus of this page is to collect information around running *tests*, how to have a good comprehension, good metrics and determine the efficiency of the system.
Contents
Information about jobs
Non-running-tests wall time:
- machine reboot time (if applicable)
- runner (if applicable)
- buildslave connecting to master assigning job
- buildbot steps besides mozharness call
- buildbot steps lag (due to master lagginess)
- mozharness non-running-tests actions
- clobber
- download-and-extract
- checkout
About reboots
We always reboot on Windows testers since runner isn't managing all the processes there. We also reboot after any android, emulator, mochitests or reftests, since those change the system state in ways we haven't been able to identify...the only way to get back to a known good state is to reboot.
Reboots can happen as part of runner (post mozharness run) or as part of mozharness.
From job to job
Finish job --> reboot (mh or runner) --> bootup --> runner does pre-flight checks, including running clobber --> starts buildbot
We do know when machines run buildbot again if runner is in place since it reports to influxdb.
NOTE: bootup !== aws slave spinup time
Known bugs
- We are currently experiencing lags introduced by masters
- reduce # of active jobs running on a master
- reduce # of buildbot steps
- reduce output
- the reason this impacts step lag is that the log processing is happening over the same channel as the start/stop commands
- can we make mozharness not output to stdio and make the log_uploader.py upload the Mozharness log and set log_url to it?
- send logs back to the master on bigger chunks (less interruptions of the masters)
- http://hg.mozilla.org/build/buildbotcustom/file/03644c855bb4/bin/log_uploader.py#l111
- the data is somewhat structured already - that function serializes it out to the current format
- bug 1209112 - Virtualenv cache always gets clobbered
- bug 1208223 - We lack Mozharness metrics for test jobs (per-action)
- We lack per Buildbot steps metrics
- We have some data on pulse but we don't know real elapsedTime
- We don't have runner for Windows test jobs
- This would move clean up steps prior to Buildbot start up
Optimizations
Auditing
- Evaluate which jobs can be combined or re-shuffled
Sources
Structured logs
- Jobs by status active data
buildbot_status duration exception 3473 failure 1353995 retry 107128 success 174430338 warnings 8688192
InfluxDB
- InfluxDB various DBs admin console
- Buildbot master lags: hosted graphite dashboard
- The master lag is calculated by measuring the reported time of one of the initial steps that should be nearly instantaneous
- What is the impact on jobs?
- Tree uptimes, end to end, branch load, time per push hosted graphite dashboard
- Per buildbot step metrics - pulse stream
- grafana runner dashboard
- We only have support for Linux and Mac