Tuesday 14 October 2008

Debugging war stories

Fishermen tell of the one that got away. Golfers tell of the amazing shot that happened when there was no-one to see. People who like debugging (and we are an odd breed) tell of the worst bug that they ever faced.

Well, there have been some really obscure ones. There was one that I tried to find every working day for 4 months in an operating system where the problem took 40 minutes to create, couldn’t be automated, there was no debugger and the crash killed the OS stone dead with no diagnostics. That was one to remember but with modern tools, you don’t get that sort of thing any more. Modern nightmares are a bit different and I would like to talk about some of the ones that I sometimes see. Oh, most of these will be in C++ because it makes more sense that way. They also happen in the runtime systems of various languages, most of which are in C or C++.

References to COM objects fail apparently randomly with a null pointer or a pointer that leads to garbage but there doesn’t seem to be any error in the code. Ah, how often have we seen this one? A variant is that a DLL has disappeared between function calls into it. The explanation is simple – the reference count is wrong so the (whatever type of thing it was) unloaded. You can’t see what unloaded it because it was on another thread or the system has cleaned it up under you without you doing anything because it looked unused. That is always fun because there can be dozens of areas in the code where you are seeing the access violation and you don’t know if you are seeing one bug or a dozen. It is relatively easy to track these down with a little judicious breakpointing and stepping just so long as you consider that you are altering the behaviour as soon as you add a debugger. If it doesn’t reproduce when there is debugging or tracing, oh, that can be a horror.

Data being wildly wrong for no obvious reason, more or less at random – for example, maybe you get a currency value that was fine when it went into the record being NAN (a binary pattern that can’t be a number) when you come to use it. Old hands will recognise that one as being probable heap corruption. There are great tools to help you with that one. If you are a fan of WinDbg, have a look at the GFLAGS command. In managed code, you can get similar things if you pass a data structure of some kind to an unmanaged DLL and don’t pin it in memory. As with the previous example, the cause of the crash is nowhere near where the actual error is. These are nasty types of error for most people but there are techniques for dealing with them.

Memory leaks used to be very popular – and very often misdiagnosed. People are sometimes a bit confused by memory usage. As regular readers of my old blog know, I am a big fan of object brokers. If you haven’t come across them before, they are memory allocators that you write yourself that will give you an object to use when you need it and you return it when you are done. From the point of view of the client code, what you have looks a lot like the heap – I ask for a blank MyObj structure by calling a function and I get a pointer. When I am done, I return it with a different function. They are not called new and release but so what? The difference is that the object broker isn’t creating and destroying them – it is maintaining a pool of them and they are not taken from and returned to the heap. I always like to have my object broker tell me how many objects it currently has on loan. That makes debugging memory issues much simpler. Oh, and some people will tell you that there is no need for object brokers now there is the low fragmentation heap. Well, I will hang on to mine. Why have the system do work that it doesn’t need to do? However…

Object brokers often cause reports of memory leakage. A common concern was that more memory was being held after an operation than before it. A lot of people raised this issue in the early days of managed code. What you commonly see with code that uses one or more brokers is that the memory usage will grow and then reach a stable plateau with a little variance caused by allocations that are not brokered – and there will always be some of those. It is always worth waiting to see if a rise in memory levels off after a while before deciding that you have a leak. However, you can get a situation with managed code where the garbage collector is overwhelmed and under very heavy load, the memory grows until the GC is forced to collect because allocations would otherwise be impossible. This is a pretty major housekeeping job and it requires access to a good deal of memory to keep track of what is going on – and there isn’t take much memory around because the process space is full of objects waiting for GC. Things get messy then.
Multithreaded hangs are always tricky and I have spoken at length about them before in my old blog. Nothing much has changed about how you debug those. It is still like trying to untangle a mad woman’s knitting in the dark while wearing gloves. This is certainly one case where prevention is much better than cure.

Of course, there are also logic bugs but each one of those is subtly different and it is hard to come up with a common approach more detailed than “Step through it and see what it really does”.

When I was a dev, I was told that I spent too much time debugging code but I have to say that the experience has stood me in excellent stead.

Signing off

Mark Long, Digital Looking Glass Ltd

No comments: