977
edits
Changes
New page: =Notes and Caveats= Please contact Alice if you need to do a physical reboot. =No Graphs= ==== Talos boxes are reporting green, but there are no graph links. It appears that no results ...
=Notes and Caveats=
Please contact Alice if you need to do a physical reboot.
=No Graphs=
==== Talos boxes are reporting green, but there are no graph links. It appears that no results from the tests were collected.====
1. Turn on '--debug' for talos machines
* ssh to qm-rhel02
* in /build/perfmaster edit master.cfg
* edit command=['python', 'run_tests.py', '--noisy'] to command=['python', 'run_tests.py', '--noisy', '--debug']
* restart the buildmaster
** in /build
** buildbot reconfig perfmaster
2. Wait for machines to cycle
3. Read the logs, there should be graph server errors that will indicate why sending data is failing
=Slave isn't reporting/is failing out=
=== A given Talos slave hasn't reported any numbers in a long time (upwards of 6 - 8 hours) OR A given Talos slave from a set has been consuming a lot of builds rapidly and failing out on browser download/installation ===
1. Check waterfall at: http://qm-rhel02.mozilla.org:2006/ (mpt-vpn)
* see if slave is connected.
2. Restart slave
* login to machine using provided credentials
** VNC (qm-pxp01-05, qm-mini*):
* close running instances of firefox or dialog windows (make sure to check the taskbar)
** login via ssh
** 'buildbot stop talos-slave' (ignore 'never saw slave...' message on mac)
** 'buildbot start talos-slave' (ignore 'never saw slave...' message on mac)
* verify slave reappears on buildbot waterfall page
'''note''' builds are triggered by finished builds on the Tinderbox (Firefox for trunk, Mozilla1.8 for branch). Then, depending on when the master was started, may take up to 10 minutes to recognize a change. If the master is restarted, first completed tinderbox builds are often missed so sometimes it can take upwards of 30-40 minutes to verify that systems are working as expected.
=Odd Reports=
=== A given Talos machine is reporting significantly higher/lower numbers than matching machines. ===
*** Talos machines reporting to trunk come in sets of three (qm-mini-ubuntu01/02/03, qm-mini-vista01/02/03, etc) so that outlier results can be spotted. If we see an outlier we try and fix the configuration on that given machine to have it match it's equals.
==== Linux ====
1. Stop the build slave
~$ buildbot stop talos-slave
2. Is throttling on/correct?
* reset the throttling
~$ sudo cpufreq-set -g userspace
~$ sudo cpufreq-set -g userspace -c 1
~$ sudo cpufreq-set -f 1000
~$ sudo cpufreq-set -f 1000 -c 1
~$ cpufreq-info
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 0
hardware limits: 1000 MHz - 1.67 GHz
available frequency steps: 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: userspace, conservative, powersave, ondemand, performance
current policy: frequency should be within 1000 MHz and 1.67 GHz.
The governor "userspace" may decide which speed to use
within this range.
current CPU frequency is 1000 MHz.
analyzing CPU 1:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 1
hardware limits: 1000 MHz - 1.67 GHz
available frequency steps: 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: userspace, conservative, powersave, ondemand, performance
current policy: frequency should be within 1000 MHz and 1.67 GHz.
The governor "userspace" may decide which speed to use
within this range.
current CPU frequency is 1000 MHz.
3. Is the random number generator set up correctly?
~$ cd /dev
~$ sudo rm random; mknod random c 1 9
~$ ls -l | grep random
crw-r--r-- 1 root root 1, 9 2007-12-18 10:48 random
crw-rw-rw- 1 root root 1, 9 2007-12-17 22:24 urandom
4. Can you VNC to the machine?
* login via VNC
* If this fails login via ssh
~$ sudo x11vnc -display :0 -shared -forever -rfbauth /home/mozqa/.vnc/passwd -auth /var/lib/gdm/:0.Xauth -bg
5. Check settings
* Screensaver off
* Auto-update off
* all sleep features off
* Screen size 1280 x 1024
6. Re-start apache
~$ /etc/init.d/apache2 restart
7. Re-start the buildbot slave and check the numbers after the next successful machine cycle
~$ buildbot start talos-slave
8. If all else fails, reboot the machine
9. Ensure the settings as described above are correct
10. Re-start the buildbot slave
~$ buildbot start talos-slave
Please contact Alice if you need to do a physical reboot.
=No Graphs=
==== Talos boxes are reporting green, but there are no graph links. It appears that no results from the tests were collected.====
1. Turn on '--debug' for talos machines
* ssh to qm-rhel02
* in /build/perfmaster edit master.cfg
* edit command=['python', 'run_tests.py', '--noisy'] to command=['python', 'run_tests.py', '--noisy', '--debug']
* restart the buildmaster
** in /build
** buildbot reconfig perfmaster
2. Wait for machines to cycle
3. Read the logs, there should be graph server errors that will indicate why sending data is failing
=Slave isn't reporting/is failing out=
=== A given Talos slave hasn't reported any numbers in a long time (upwards of 6 - 8 hours) OR A given Talos slave from a set has been consuming a lot of builds rapidly and failing out on browser download/installation ===
1. Check waterfall at: http://qm-rhel02.mozilla.org:2006/ (mpt-vpn)
* see if slave is connected.
2. Restart slave
* login to machine using provided credentials
** VNC (qm-pxp01-05, qm-mini*):
* close running instances of firefox or dialog windows (make sure to check the taskbar)
** login via ssh
** 'buildbot stop talos-slave' (ignore 'never saw slave...' message on mac)
** 'buildbot start talos-slave' (ignore 'never saw slave...' message on mac)
* verify slave reappears on buildbot waterfall page
'''note''' builds are triggered by finished builds on the Tinderbox (Firefox for trunk, Mozilla1.8 for branch). Then, depending on when the master was started, may take up to 10 minutes to recognize a change. If the master is restarted, first completed tinderbox builds are often missed so sometimes it can take upwards of 30-40 minutes to verify that systems are working as expected.
=Odd Reports=
=== A given Talos machine is reporting significantly higher/lower numbers than matching machines. ===
*** Talos machines reporting to trunk come in sets of three (qm-mini-ubuntu01/02/03, qm-mini-vista01/02/03, etc) so that outlier results can be spotted. If we see an outlier we try and fix the configuration on that given machine to have it match it's equals.
==== Linux ====
1. Stop the build slave
~$ buildbot stop talos-slave
2. Is throttling on/correct?
* reset the throttling
~$ sudo cpufreq-set -g userspace
~$ sudo cpufreq-set -g userspace -c 1
~$ sudo cpufreq-set -f 1000
~$ sudo cpufreq-set -f 1000 -c 1
~$ cpufreq-info
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 0
hardware limits: 1000 MHz - 1.67 GHz
available frequency steps: 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: userspace, conservative, powersave, ondemand, performance
current policy: frequency should be within 1000 MHz and 1.67 GHz.
The governor "userspace" may decide which speed to use
within this range.
current CPU frequency is 1000 MHz.
analyzing CPU 1:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 1
hardware limits: 1000 MHz - 1.67 GHz
available frequency steps: 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: userspace, conservative, powersave, ondemand, performance
current policy: frequency should be within 1000 MHz and 1.67 GHz.
The governor "userspace" may decide which speed to use
within this range.
current CPU frequency is 1000 MHz.
3. Is the random number generator set up correctly?
~$ cd /dev
~$ sudo rm random; mknod random c 1 9
~$ ls -l | grep random
crw-r--r-- 1 root root 1, 9 2007-12-18 10:48 random
crw-rw-rw- 1 root root 1, 9 2007-12-17 22:24 urandom
4. Can you VNC to the machine?
* login via VNC
* If this fails login via ssh
~$ sudo x11vnc -display :0 -shared -forever -rfbauth /home/mozqa/.vnc/passwd -auth /var/lib/gdm/:0.Xauth -bg
5. Check settings
* Screensaver off
* Auto-update off
* all sleep features off
* Screen size 1280 x 1024
6. Re-start apache
~$ /etc/init.d/apache2 restart
7. Re-start the buildbot slave and check the numbers after the next successful machine cycle
~$ buildbot start talos-slave
8. If all else fails, reboot the machine
9. Ensure the settings as described above are correct
10. Re-start the buildbot slave
~$ buildbot start talos-slave