David Baron's weblog: Crash analysis in the future

Last year I wrote about some things I was doing to analyze crash reports that Firefox users submit when they crash. I've been meaning for a while to explain a little more context about what we currently do with crash reports and what we could do in the future.

Fundamentally, we use crash reports for two different things: figuring out which crashes we should focus on, and fixing a specific crash.

We should aim to focus resources where they are most efficient, that is, where they produce the most benefit per unit cost. When we're fixing crashes, this means that:

We should focus on the crash bugs that happen the most often (affect the most users or crash each user most often). To put it another way, we should focus on crash reports where fixing the crash in that report will simultaneously fix the largest number of other reported crashes.
We should focus on crashes whose effects are more serious. For example, a crash that prevents a user from using Firefox is probably more serious that a crash that occurs intermittently, even though we will likely get more reports of an intermittent crash (since those users keep using Firefox) than of a permanent crash on startup (since those users will give up). It isn't always obvious which crashes are more important, but we're still better off thinking about it.
We should focus on crashes that are easier to fix. They may be easier to fix because we have relevant data (such as steps to reproduce the crash or data pointing to a particular extension that's responsible) or because the crash was recently introduced and therefore the code involved is fresh in the mind of its author.

Our current approach to focusing resources involves classifying crashes by their signature: the top meaningful frame in the stack trace of the crashing thread (or in some cases, more than one frame). We then look at which signatures are most frequent, and when new signatures appear. This fails to account for a number of common cases: when one signature shows up for multiple underlying bugs, when multiple signatures show up for a single underlying bug, or when a single underlying bug changes between different signatures over time due to unrelated code changes.

In the future, we should use better cluster detection methods to group crash reports into groups of similar crashes, based on all the data we have in the crash report (including stack trace, installed extensions and other libraries present, CPU and OS versions, and URLs being visited). We should also use the data in the crash reports (such as time since startup) and the frequency of crashes over time (since crashes that cause users to stop using Firefox will spike after a release and then fall) to detect which crashes are of the more serious types. Together, these improvements should help with prioritization of crashes.

Our use of data for fixing a specific crash is a bit closer to where I'd like it to be than our use of data for prioritizing crashes. A significant part of this is the data provided by Breakpad: the machine state at the time of the crash. In many cases, either we've been given steps from a user that are sufficient to see the crash for ourselves, or the machine state provided by breakpad is sufficient information for us to fix the problem. However, there are also many crashes where these don't happen. The tools I worked on last year provide significant additional pieces of information. However, those tools are fundamentally tools to assist in playing a game of “spot the outlier.” The process of looking for clues involves looking for what's unusual about the crash reports. Do all the users who are crashing have the same extension? Are they all on multi-core CPUs? Do they all have the same version of Windows? Right now, we have the ability to look up each of these pieces of data. However, developers' lives would be significantly easier (and they'd be less likely to miss important clues) if we had tools that automatically detected which characteristics of the crashes were unusual.

I'm hoping that future versions of Socorro will let us do many of these things.

David Baron's Weblog

Crash analysis in the future

Thursday, 2010-11-11, 14:58 -0800