Mozilla A-Team: Peptest results, an exercise in statistical analysis

Mar 04, 2012 14:45

UPDATE: It's been pointed out that the current metric (sum of squares of unresponsive periods, divided by 1000) is used in Talos and has had a fair bit of thought put into it. I was curious what not squaring the results would do, but I wouldn't go with another metric without more careful thought.

UPDATE 2: It has also been pointed out that peptest tests performance, not correctness, and hence should report its results elsewhere (essentially as I've done with the sampled data) and not be a strict pass/fail test. This approach definitely warrants some consideration.

About a week and a half ago, peptest was deployed to try. To recap, peptest identifies periods of unresponsiveness, where "unresponsiveness" is currently defined as any time the event loop takes more than 50 ms to complete. We have a very small suite of basic tests at the moment, looking for unresponsiveness while opening a blank tab, opening a new window, opening the bookmarks menu, opening and using context menus, and resizing a window.

The results are currently ignored, since we still don't know how useful they will be, but you can see them by going to https://tbpl.mozilla.org/?tree=Try&noignore=1. They are marked by a "U" (not sure why exactly, but it will change at some point to something more obvious).

At the moment, every platform fails at least one of these tests, and most of the time there are multiple failed tests. This isn't too surprising, since 50 ms is a pretty bold target. However, going forward, we need some sort of baseline result, so that we can identify real regressions. To accomplish this, peptest tests can be configured with a failure threshold. We calculate a metric for each test (see below), and, if a failure threshold is configured, a metric value below this threshold is considered a pass. Hopefully, we can identify a threshold for each test (or, likely, a threshold for each platform-test combination) such that all the tests pass but significant increases in unresponsiveness will trigger failures. At the same time, we will also file bugs on all the tests so we don't forget about the fact that there are still unresponsive periods during their execution that are being hidden by the thresholds. We can lower or eliminate the thresholds if these bugs are partially or fully fixed.

Things, of course, aren't that simple. I gathered and analyzed the peptest logs from try over a four-day period, and there is quite a lot of variance in the results, even on the same platform. With a sufficiently generous threshold, we could get the tests to pass most of the time, but there are occasionally some crazy outliers that no reasonable threshold could contain. However, it is probably okay to have the tests turn orange once in a while. 0 oranges might be an unreasonable target for this project, and intermittent oranges would be a reminder that, sometimes, there are really unacceptable periods of unresponsiveness.

(Btw one test, test_contextMenu.js, appears to only fail on Linux and Linux64, but this is actually a bug in the test--on all the other platforms, it's erroring out before it hits the end. I've since fixed this but haven't collected new data yet.)

I experimented a bit with the test metric, to see if that improved the situation. Right now, as deployed on try, the metric is calculated as the sum of the squares of the unresponsive periods in a single test (an unresponsive period being, by definition, a value above 50). I tried just summing the periods without squaring them, which seemingly increases the variance in some tests and decreases it in others. I also experimented with raising the minimum unresponsive period from 50 ms to 100 ms, since there are strong arguments that 50 ms is pretty unrealistic, at least at this stage.

I've graphed the failures, along with their mean and standard deviations, at http://people.mozilla.com/~mcote/peptest/results/. I also plotted passes as 0s (there are certainly lots of unresponsive periods less than 50 ms in those passes, but for all intents and purposes they are 0) in a different colour. There are unique URLs to all combinations of platform, test, and metric. The raw data is also available there (in JSON).

Following is a brief discussion of some of the problems with identifying good failure thresholds.

Some of the simple tests don't have much variance. test_openBlankTab.js, which just measures the responsiveness when opening and closing a blank tab, mostly passes, with just a few outliers. Some slightly more complicated tests, however, have quite a bit of variance. The bookmarks-menu test, test_openBookmarksMenu.js, scrolls through the bookmarks menu and then opens the bookmarks window. The results on snowleopard are particularly egregious:




As you can see, most of the failures are clustered around the mean. The standard deviation encompasses most of them. Changing the metric from the sum of squares of unresponsive periods to just the sum of the periods improves things a little:




There is only one point above a single standard deviation, although two are rather close. Increasing the allowable unresponsive period to 100 ms reduced the standard deviation, but only because a few low points became passes:




So this is one example where we would expect to see at least one orange every few days, even if we set the metric to about 25% higher than the mean.

In other cases we have mostly passes but some really crazy outliers. On snowleopard, test_openWindow.js, which merely opens a new window, has mostly passes, but in this sample there is one run that had unresponsive periods totalling more than 250 ms.




So here, we could leave the failure threshold at 0 ms, although we'd still have oranges every few days. In this case, setting the unresponsive threshold to 100 ms wouldn't make a difference, since the few failures are significantly above 100 ms.

test_openWindow.js on leopard, however, is all over the place when using just the sum of unresponsive periods:




There aren't really any outliers here, just a large spread of values. A reasonable failure threshold here would have to be twice the mean to ensure that oranges only occur occasionally.

In this case, switching to a sum of squares makes the outliers more obvious, although the standard deviation becomes quite large:




And in case it wasn't obvious, the results are completely different on a different OS. Take test_openWindow.js on Windows 7:




Most results are clustered, but there are 5-6 real outliers, depending on how you define an outlier. This test-platform combination looks to be a real potential for regular oranges unless an extremely generous failure threshold is defined.

In conclusion, it's going to be kind of tough to define failure thresholds such that most runs pass and that real regressions are identified. There doesn't seem to be a huge difference between using the sum of unresponsive periods versus the sum of their squares, although in some instances the latter makes the outliers more obvious. Raising the minimum acceptable unresponsive period unsurprisingly causes more passes but doesn't really improve the variance in the failures. Regardless, it looks like I will have to go through the sampled results and, for each test, set a failure threshold that encompasses the majority of the failures, but even still there will be intermittent oranges. Comments and suggestions welcome!

peptest, mozilla

Previous post Next post
Up