David Baron's Weblog

Broadening crash analysis

Wednesday, 2009-10-14, 15:38 -0700

In my last blog entry I discussed tools to help us analyze the reports we get for a particular crash signature. While we don't yet have all the tools we want in order to do that analysis, we have a bunch of them, and a pretty good idea of what other things we need. As I mentioned there, I was looking at the relationship between libraries that are loaded and crash signatures to show the case where the cause of the crash is one of the libraries that was loaded. Since then, I've also looked at the relationship between addons installed and crash signatures: this shows many of the same relationships (but also loses a few and gains a few), but with a much cleaner and easier-to-interpret source of data. I've also been looking at the relationship between crash signatures and number of CPU cores, since crashes that happen disproportionately on multi-core machines tend to be caused by threading problems. Other things, per signature, that we need to look at more in the future, include change-over-time, where time refers both to the build that the user is using (for crashes caused by our code) and wall clock time (for crashes caused by other code). Which type of time the appearance of a crash correlates better with can even be a sign of its cause.

However, analyzing crashes per signature has significant weaknesses. In particular, there are a lot of crash signatures. We tend to look at the most common ones. When doing that, we end up gaining a reasonably good understanding of the 25 most common crash signatures. However, since there's a long tail of less frequent crash signatures, understanding the 25 most common crash signatures only gives us an understanding of about 20%-25% of our crashes. (Though if we claim to understand all the different crashes that happen inside the Flash plugin, which is probably a number slighly smaller than the number of crashes caused by that plugin, then we can quickly "understand" an additional 18% of our crashes on Windows and 31% on Mac. Hopefully the multi-process work will make those crashes no longer crash Firefox in a Firefox release not too far in the future.)

I'd like to be able to get a better understanding of our crashes by looking at all the crash data in aggregate, rather than grouped by signature. This would help us answer questions like which extensions cause the most Firefox crashes (which in turn tells us what the most important things to fix are), or what portion of our crashes are caused by extensions vs. plugins vs. our own code (which can help us allocate crash-fighting resources correctly). However, I haven't figured out how to do this, and I think it's reasonably hard.

I think it's hard because in the data we're looking at, there are a lot of correlations. From looking at a group of related crash reports, it's somewhat difficult to distinguish a crash that is caused by visiting pages with Japanese text from a crash caused by a particular piece of software that Japanese users use to input text from a crash caused by a piece of software installed by default on many computers shipped in Japan in the past year, and perhaps even harder to distinguish if we actually have some of each and we want to know their proportions.

Analysis of all the crashes in aggregate might not be the only way forward, though. For example, we might be able to get significant benefits from tools that detect that a group of crash signatures are related, thus effectively reducing the number that we have to look at.