artichoke

Mar 15, 2008 13:16


Sometimes the software we run generates errors.

I mean, it'd be nice if it didn't, but it does, and so what we hope for is that the error report includes enough information about the conditions that led to the error for us to track it down and fix it. Generating a report for a particular error is something pretty well understood; we generally use some variation on stack traces and core dumps, which works well enough.

The part I don't manage so well is what you do when you're receiving these errors from not just a single customer working with your application, but from the global set of all your users at once. (This problem is most obvious in web applications, but plenty of desktop applications have "report crash to developer" functionality now as well.) The approaches I've seen so far are
  1. show an error message to your user, and leave it to them to figure out how to communicate the error to you. I trust I don't need to go in to all the ways in which this sucks.
  2. dump everything into a log file, which is easy to do, but has insufficient structure to get a high-level view of what the current state of things is.
  3. send an email with every error, which works fine in many cases, but it makes bad problems worse, because now in addition to dealing with a bug that needs fixing, you now have to deal with a torrential flow of emails (with large chunks of debugging data attached) clogging your developers' inboxes.

The balance I need is to be informed of a new type of error as quickly as possible, but to not be flooded with redundant reports. I need to know if the problem is affecting 80% of our users, or just one in a thousand. I need all the debugging information stored somewhere for inspection if I need it, but not all pushed down to my email/phone/jabber/whatever in case I don't. I want to classify reports by exception type, code path, and perhaps other random details (browser version, IP address, etc).

I know I'm not the only one with these requirements, so I'm sure an application for managing this exists somewhere, I just haven't found it yet. What is it?

programming, lazyweb, exceptions, event management, hacker

Previous post Next post
Up