Do you want a Time Travel Debugger

Sep 30, 2021 16:11

I never got around to talking about what my current work do: http://undo.io There was some previous discussion on the topic on facebook: https://www.facebook.com/jack.vickeridge/posts/10103938681712440

What is a Time Travel Debugger

It records everything that happens in a program's execution, so you can step backwards as well as forwards, or rewind execution and then replay it again more carefully. Or you can "replay" it backward, e.g. going to the end of time, seeing your program crashed with a null pointer and then setting a watchpoint on that pointer and reverse-continuing until you find out where the pointer was set to that value.

There's two main modes of use, using it like a debugger sitting in front of a program, or using a companion recorder (which is actually an executable with much of the same code but packaged differently) to record your program in your overnight test suite, or running to replicate a bug that happens in a very long running process. Then once you've reproduced the bug once, you've almost finished, you can just load up the recording and step forward and back in a debugger until you figure out what went wrong.

That sounds impossible!

Yes, it does sound impossible, but it works.

It records literally everything the program does that interacts with the outside world in any way, e.g. any system call (including any file access, network access, gettime, even getpid, etc, etc), any instructions which write to shared memory, etc. That can get large for some programs (but customers do use it successfully!)

It saves a snapshot at several points during history (by forking the process there), so it can create the state of any point in history by forking another process from that snapshot, and playing it forward using the saved events instead of actually doing any of the things that interact with the outside world.

It does all this by rewriting the compiled program in memory, and maintaining a mapping between the rewritten memory and the original assembly. So you the user see the original source code and original assembly, with whatever level of debug info you originally compiled the program with. But behind the scenes, almost any non-trivial instruction is rewritten to do something else, to either to save the result of the instruction in the event log, or to replay the value from the event log.

That means that you can attach it to any program, compiled any way, just like any debugger can. You don't need to compile it with some magic -- people keep expecting this, and it could have been written that way, but instead, you can just connect it to any program you could attach gdb to.

Caveats

Recording multiple threads is slow, and recording multiple processes doesn't exist yet. We're working on it, but right now can help with some multithreading bugs but can't help with others.

Program execution is slower, between 2x and 10x. We are working to improve that. Replaying through execution can be faster than that (and you can usually go directly to the beginning, end, etc without any replaying).

This is all on linux only.

The interface and implementation is based on the gdb forntend/gdb server protocol. So by default it looks like debugging with gdb but with "reverse-next" as well as "next". And it works with any program which uses gdb backend, e.g. visual code, emacs, although some of those are tested extensively and some aren't.

But no linux debugger has a very good UI, so currently it is mainly used by people who have to debug using something like gdb anyway, but want to be able to solve harder bugs quicker. We are trying hard to make it easy for languages like python and java where the translation has to understand an interpreter as well as the code. This works in the sense that it can be recorded and replayed, but getting a good user experience is a lot harder.

Worth and Price

I always describe it as, the difference between "not having a debugger" and "having a debugger". If you have a debugger, maybe actually 90% of problems you can solve with print statements. But the 10% that you can't fix with print statements could take months to solve without a debugger, or hours with a debugger. It's hard to describe why you need a debugger to someone who hasn't tried using one. But almost no-one would go back to not having one.

A time travel debugger makes trivial the small proportion of issues that still feel impossible even with a debugger. You say, "yes, it fails intermittently but we don't know if we'll ever track it down unless someone wants to study the failure for nine months", but that might be only hours with the right tool.

Unfortunately, this tool takes a large amount of programmer effort to create, and is only viable if it's sold commercially. If you view it as "The 5% of bugs we have that take 9 months to track down, instead get solved in a few hours", you compare the cost to the salary for an extra programmer or two, it's very reasonable. But most people including me hate paying for tools, so it's hard to sell.

It has a great retention rate -- any companies which have subscribed to a contract, have almost always kept it, and programmers who have used it regularly (including me) are very very eager to keep having it available.

Currently there are several introductory offers. There's an educational license which is cheaper or free. There might be an offer of free licences to the right open source project if you're interested. There's a 30 day free trial, and a personal license, in the hopes people will become converts and persuade their employer to adopt it. There is standing offer that if you have an intractable hard to reproduce bug that's you'd like to see just go away, we can arrange some sort of trial to have someone come and help capture and diagnose that bug, and see if that leads to a longer term arrangement.

Ask questions in the comments. Feel free to download the trial -- if you've used gdb, it's fairly straightfoward to try out, and it's magical to see "step back, step forward".

Or if it sounds like you might be someone who would actually benefit from acquiring a license, I can put you in touch with helpful people -- we used to focus on big clients because there was a lot of shakedown, but now it works more reliably out of the box, it's plausible for a wider spectrum of companies and people. You can also comment at https://jack.dreamwidth.org/1137863.html using OpenID.
comments so far.

work, tech

Previous post Next post
Up