Mobile/Maemo4 Testfarm Notes

From MozillaWiki
Jump to: navigation, search

Here is a writeup of a number of the issues we had to solve, work around, or live with (to this day) in ramping up the automated Maemo4 test device pool. Hopefully this will give some insights as to some possible hurdles we will face as we attempt to automate a large number of devices on other platforms.

I'm not sure if this is a comprehensive list of the issues we've faced, but it's what I can recall from memory. Hope it helps.


Power and Battery

Maemo Power Management

On the N810s, power management is software-based (and buggy). Also, the power drawn by the power supply appears to be less than the amount of power needed by a running device, resulting in constant battery drain when plugged in and powered on.

On reboot [?], occasionally the devices will go into a) an infinite reboot cycle, where it goes to the Nokia screen then blanks the screen then comes back, forever, or b) "ACTDEAD" mode. One theory for the first situation is JFFS2 filesystem corruption. The other failure is caused by the startup script (/mnt/initf/linuxrc) entering ACTDEAD mode because the device thinks it was turned on because the charger was plugged into an otherwise off device.

Diablo has a bug where a reboot with the power plugged in can cause some of these infinite reboot situations. The fix is to unplug the power and reboot, then replug the power once it's booted; this is not really an option in an automated 24/7 farm of dozens of devices. The workaround, which I haven't gotten to work, is to let the battery drain all the way then launch a terminal with a

while [ 1 ] ; do sleep 1; done

running inside that terminal. We attempted this and found that all it really did was annoy us and hide the fact that the device hadn't connected to wifi.

(This bug is supposedly fixed in Fremantle.)

Test Harnesses or Power Bench

These devices were never meant to be turned on and left on 24/7. We've spent a significant amount of time to get them to do this.

The real solution here is [appears to be] to get rid of the battery and get the device onto a power supply that gives enough power. There are test harnesses for the Nokias that cost $500 a pop that do this (we don't need any of the serial I/O provided by the harnesses); we're also investigating a bench power supply and soldering connections to the back of the devices.

($500 may seem trivial until you realize we may need 30 more of them, at which time it starts looking like real money. Also, our current count of 40 devices is not anywhere near enough to keep up with 3 branches throttled. I hear rumors of covering all branches, as well as Try, per-checkin; this would require hundreds of working devices. All of a sudden this becomes 10s to 100s of thousands of dollars on power supplies alone.)

Screen Blanking and Power Save

As mentioned above, we've had to attempt a number of solutions to power saving. Power saving takes a number of guises:

  • wifi power saving (mentioned below)
  • screen power saving
  • cpu power saving

You can disable a number of these in the Control Panel, to a degree. After setting the screen dimming to 2min dim, 5min off, never turn off while charging, you can force it further by installing the 3rd party MoreDimmingOptions package and setting those times to 1440min (24 hours).

However, when the battery is full, the N810 no longer sees itself as "charging", even if it's still plugged in.

I've tried the Advanced Power Management package, which is great at keeping the device on with lots of information about the charge, but has a way of constantly popping up an annoying notification about the battery that you have to click to make go away. I suspect this will affect unit tests that require focus. I haven't tried rolling this out to production.

For cpu, we added a Talos buildbot step to

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

which should keep the cpu from falling asleep mid-testsuite.

There are other magic /sys files that have no been explored. Some of them are written to during the boot process by /mnt/initfs/linuxrc. We have tried to stop these things from being set but have not had time to fully investigate this

Physical Connectivity

A decent percentage of our issues have been caused by the physical [wall-wart] plug of the device. A standard [American] 3-prong power plug tends to stay connected very solidly to the wall or power strip. The Nokia power supply can easily be nudged out of the power strip, leading to battery drain on the device in question.

This is exacerbated by the shipment of European N810s which came with UK power plugs and adapters. The adapters not only took 2x the space on a power strip (so we could only plug 3-4 devices into one strip) but also further loosened the connection because of both faulty plug converters and general physical unwieldiness.

Network and Wifi

Wired vs. Wireless

In limited comparisons between wired and wireless ethernet, I found wireless more reliable. Very few people believe me when I state this fact. On the N8x0s, wired ethernet involves:

  • enabling host USB on the N8x0
  • attaching a USB ethernet device to the N8x0 via a USB cable and adapter which aren't terribly physically robust, so bumping into the table might disconnect the network
  • cross-compiling a third party driver that isn't supported by Nokia
  • expecting this whole thing to be at a level of enterprise-testing robustness

When I say that wireless was more reliable, this involves any of

  • physical decoupling from the wire,
  • finding the network down on the wired device more often (/etc/init.d/network restart or reboot), or
  • finding the wired device hibernating/turned off/infinite rebooting more often than the wireless device.

The latter may be due to additional power drain from the USB connection.

Wifi Specific Issues

Sometimes the devices have issues reconnecting to the wifi network automatically. This has been mostly resolved or worked around.

Turn wifi power saving off. This can cause disconnects in processes that need to stay connected (ssh, twisted) and dougt says it can crash your wireless router.

We reduced the signal strength from 100mW to 10mW since the routers are right there and we don't want the devices interfering as much with each other.

zandr says we should keep the N810s 10cm apart for the same reason.

We're waiting on an RF-shielded room which should hopefully help our wifi stability; there are so many wifi networks all clamoring for the 802.11b/g space and possibly people attempting to hack in from elsewhere. A wifi network inside an RF shielded room should help with both those scenarios.

[Non] Disconnects

As a mobile device, the N810 [correctly] defaults to keeping network connections open even when there is a disconnect. This is useful for mobile users moving from one tower's (or wifi router's) range to another.

When the N810 is used as a networked 24/7 test device, this can become a point of aggravation or confusion. SSH connections take a long time to drop after the device loses network or reboots, and buildbot thinks that the device is still available days after it's dropped off the map.

We're still living with this but it certainly isn't the biggest problem we have.

We are investigating moving mobile-master to a Linux based machine because we feel that part of the disconnect issue stems from the OS X networking stack. Some of our Xserves (that are unit test and build machines, not masters) are suffering from this issue which further supports this notion.

/dev/random

At dougt's suggestion I have moved /dev/random elsewhere and softlinked /dev/urandom to /dev/random, specifically to reduce the strictness in random numbers, so ssh connections can happen faster. This makes sense since availability and speed matter more to us in the test infrastructure than data security.

However, after I noticed that

  • /dev/random comes back at boot, and
  • the devices that I made this change seemed to fall over [need reimaging] more than 2-3 times more often than devices without this change,

I stopped making this change. It's solving a non-critical issue and introducing a new one.

Update Icon

Every once and again, the little orange icon will blink forever, notifying you that you have updates available on your debian packages.

Interestingly enough, when this was going on (on maybe half our devices), we noticed there was a large discrepancy between devices in performance numbers on certain Talos suites, and the slow devices corresponded with the devices with the update notification icon.

This discrepancy disappeared once I clicked on the orange update icon, chose "update now", then cancelled out of the update inside of App Manager. That made the orange icon go away, but kept my packages at the same revision. And brought those devices' perf numbers back to match the other devices'.

I'm not sure what was causing the perf regression exactly: perhaps multiple pings to the repositories, or maybe just the cpu/graphics time needed to render a blinking icon. I'm sadly guessing the latter.

We can disable most of the update notifications, but not all (I think the system repo is hardcoded and not disable-able in the Control Panel) so this problem has mostly gone away. Also, we'd rather have each device at a known state of packages and only roll out updates when we plan on doing so, so this is an acceptable solution for us for the moment.

Disk and Filesystems

Internal Flash

Out of the box, the N810 had /, which is raw flash memory with JFFS2 and /media/mmc2 (internal sd card) which has much more disk but is formatted vfat for some reason. Due to vfat limitations, running executables is problematic, even if you fix the vfat mount options. I've hit enough errors trying to run with fennec in /media/mmc2 that I've given up and installed as much as I can in /... all that remained in /media/mmc2 were the tp3/tp4 pagesets and maemkit (python was on / though).

/ is small enough (256MB) that there isn't enough disk space to download and extract two tarballs of fennec, especially if the unit tests are there as well. jhford noticed that if we *generate* the filesystem image rather than writing onto the filesystem, it uses much less space... jffs2 is not very optimized for writing.

/media/mmc2 tends to become corrupted easily. This may have to do with swap living on /media/mmc2/.swap by default, maybe it's something else, but when /media/mmc2 goes read-only it's time to re-image, as I don't trust it after a fsck.vfat -y. (See the imaging section below.)

However, we have since switched to booting off of...

SD Cards

The main win here is reimaging turnaround time; we could have dozens of spare, pre-imaged SD cards that are ready to hot-swap into a device that needs reimaging. Power on, change the hostname, and you're set.

If we get multiple SD card readers and/or an 20x SD card cloner that could reduce the serial 13min/card time significantly.

We're running fairly well on these. We're still hitting a number of the above issues as we try to enable some sort of intelligent monitoring and/or logging so we can maintain this many devices without having to look at or poke at each one individually to tell if they're running smoothly.

We've got a larger amount of usable disk available, which is nice. We tried upping swap with little success (see below).

I have, however, run into some un-removable files. This might be filesystem corruption caused by hard reboots since ext2 doesn't have journaling. However, in my searches I read that ext3 (journaling) isn't recommended on device. We can try, and see if the filesystem reliability offsets the performance hit. Otherwise, the workaround is constant reimaging.

Memory and Swap

Max Swap

We seem to have a max swap of 128mb. I think I read about this online somewhere. The Control Panel only allows you to create 128mb swap.

Early on I tried adding a second swapfile of 128mb, and recently jhford attempted to create a swapfile of 700mb on the internal flash memory. We hit corruption or other breakage a lot sooner than we would expect, and started working again when we went back down to 128mb, so as far as we're concerned we have a max of 128mb swap to work with.

Corruption

I think I covered this in the filesystem/disk section above. After corruption, I suppose you could nuke and recreate your swapfile and re-rsync whatever's on that disk, but that takes as much time as (or more than) reimaging an SD card, and seems less reliable to me.

Swap Required

We've attempted to run test suites on devices without swap (usually by mistake) and we can't. Not if we want any useful information from the tests other than "we *still* can't run these without swap".

I believe there's a growing QA bug about sites that don't work without swap. For the moment, swap is required.

Manual Intervention and Maintenance

The N810s require constant maintenance and manual intervention. Much less so than before, but still, if we were to leave for a week or two and come back, all of the N810s would be down.

We have a number of possible fixes that could improve this (e.g. test harness power) but time will tell.

Prompts

This is ok because buildbot will kill the test after no output for X amount of time (where X is configurable in the buildbot configs). We've noticed we're hitting more and more prompts that expect user input in the automated unit tests, and the browser waits on that user input until buildbot kills the test suite and reboots the device. There is a bug open on this, but expect to need to know when the browser is hung.

Wake Me Up

This can be as easy as "touch my (blacked out) screen", or "press my power button until I turn back on", "attempt to get me re-networked", or "remove my power, try to boot me up, and either reattach power or yank the battery and toss me in the reimage box".

It's best if you can track issues like these by device, so you know if one is acting up more than the others.

Reboots

Cleaning the System

We can spend a lot of code and time trying to detect and force the previous test run to quit, if it happens to be running a zombie process. The easiest way to force the quit is to reboot the system.

This is also a good way to clear memory leaks and caches. We've noted a significant uptick in system stability after a reboot, and corresponding perma-reds and crashes in Fennec if the device has been up and running multiple test runs.

Stabilizing Numbers

We are rebooting our desktop test systems regularly. Our Talos boxen reboot after every run and we have a bug open to reboot our unit test/build boxen every run as well (we're currently at a reboot every 5 runs).

What we've noticed on desktop talos is that uptime creates a corresponding upward movement in numbers (slower perf) that drops back down to the baseline after a reboot. Rebooting every run creates less variance in the numbers, resulting in more reliable perf numbers.

There is an argument that users don't reboot as regularly as our test boxen, but most of them don't load nearly as many pages in an hour as our talos boxen do either. We would need to write a test that we think mirrors user behavior if we want to measure that, and if that test takes days of uptime for single run, that would require exponentially increasing our farm of talos boxen.

Clean Startup

Reboots are a source of much goodness here, but create a requirement of booting up cleanly. If the devices are rebooting around the clock 24/7, any sort of manual intervention requirement will shortly result in zero available devices for tests.

We've had to work around the lag in acquiring a wifi connection and IP address (generally by sleeping and attempting to connect to a remote server periodically for 30 minutes, and rebooting if that's not successful). There is a need to get the new build onto the device without prompts, which was mostly straightforward on Maemo but appears to be much less so on win{ce,mo}. Then launch the tests automatically, record the results, and reboot.

System Crashes

These are fairly rare in the desktop realm, but all too common in the device realm. At their most benign, the crashes kill the test run in progress and boot the device cleanly for the next test run. At their worst, they can corrupt the filesystem or otherwise render the device useless until we reimage.

For the former scenario, you want to have:

  • a "live" logging system so you know approximately where in the test run it crashed. This means output from the test runs as they go, hopefully not too far behind, that are kept and reported somewhere other than the device.
    • We use buildbot for Maemo tests, so twistd receives the logs with up to maybe a few hundred character lag. sshd would have a similar result, as could possibly any sort of remote command that shares stdout/stderr with the caller.
    • We are able to call tests quietly, so they log all of their output to file and output their contents at the end. This doesn't work well in the case of crashes, so I've asked Alice to re-enable live output from the new Talos (already taken care of) and I've hacked Maemkit to output test results per-chunk when verbose=True.

There isn't much to be done for the second scenario outside of having a good imaging solution.

There's another scenario in between, where the crash kills the test run but doesn't boot up the device cleanly, or possibly doesn't reboot the device at all. It's here that you need to fall back on attempts at detecting and killing previous test runs. The best situation here is to somehow realize that this happened, and force a reboot.

More on Imaging

It's easy for me to say "just reimage" now that jhford has rolled out an imaging solution. Previously, reimaging the device involved flashing the original device image, then installing and configuring all the packages and files I needed by hand. This was a highly time-consuming and error-prone process.

Rolling out and testing the two automated imaging processes that we've used took time too, but now we can reap the rewards. But any new platform that we expect to support to this degree will need a similar solution.

Imaging is needed regularly. I think the downtime helps any battery/heat issues, and re-creating the filesystem helps since the disks and filesystems don't seem terribly robust. I don't think anyone ever designed a phone expecting that anyone would want to read, write, and remove hundreds of mb per hour 24/7.

Talos Setup

I'm not sure how this will be handled outside of a python framework. The current Maemo setup is documented here.

You do need a webserver unless you plan on using file://. Alice notes that http:// and file:// use different code paths in the browser, so it's preferable to use http://.

The main thing to keep in mind when setting up Talos (or other perf numbers) is that internal consistency is king. Yes, the tests are all controlled by a buildbot daemon, and hit a local webserver (nginx) that hits the local filesystem to load the pagesets, and that's taking up memory and disk throughput that could otherwise be utilized by the browser.

However, since every single device is configured in the same way, and the load on each particular device is only really affected by its own tests, we get a consistency level that is acceptable enough to give us semi-reliable perf numbers. Not as reliable as the best desktop talos boxes, but enough to tell when code changes affect those numbers.

One of the solutions being considered is an external webserver, which would lower the device load. Our main concern here is that it may be better for device reliability, but worse for consistency.

Yes, if you have an external webserver that serves pages to one device, it may be stable enough to give good numbers. However, if 50 devices hit the same webserver, we're going to see webserver lag in our numbers. If we find that at certain times of day or release cycle that all 50 are hitting the webserver simultaneously, and at other times only 3-4 are, we're going to start seeing large fluctuations in numbers in addition to any fluctuations you'd normally see. Add in any sort of network weirdness and it's conceivable the tests won't have any real reliability.

This solution seems to have the most momentum, however, just warning about what we may end up seeing.

Unittest Setup

I think jmaher is pretty familiar with the details here.

The main issue once these are running is parsing and tracking test failures/runs as well as differentiating the known/expected failures for platform X versus desktop Firefox. I think that's largely tracked in bug 511174