Friday 22 August 2008

Coding practices

Hello all

Bit of a change of pace today. I would like to talk about some common coding issues that I see when reviewing code. A lot of them are very natural things to do because of the way that a person thinks through a problem. These won’t be security related, at least not in the main.

Oh, and these points will be largely language agnostic though there are a few specifics. I will call them out when we get to them.

The biggest mistake that I see is doing unnecessary work. People have come to rely on the optimiser to generate code that doesn’t do unneeded work. Well, the optimiser is a good thing and micro-optimisations are generally best done by it – lets consider an easy example (Pseudo-code)

“If (Object.Method(Param1.Field, Param2(a))) && (a=TRUE) then”

Ok, testing if a Boolean is true or not is very, very easy and quick. The processor optimisation means that both options can be pipelined at once. Cool. The second test is much, much more expensive. A good compiler will want to avoid the call if possible so best to check the Boolean first. Oh, and is this always a safe optimisation? Well, no, not if there are useful side-effects of the called function but very few people write code that intentionally changes state in what is essentially supposed to a check of state. When they do, a lot more debugging is generally needed. Compilers can do this sort of optimisation very well indeed. They will also do a lot of other cool things – if you have a local floating point variable that you only ever use in the first part of a function and another in the second part that you never use after the first one, well, no sense having 2 variables at all from the compiler’s viewpoint. In practice, hand optimising this would probably be harmful since that would reduce the maintainability of the code.

However, the optimiser is not going to help if the algorithm is not right. Here is a simple example:

For (i=1;i<1000;i++)
{
o = new MyObj;
// some more code
if (o) then release o;
}

Now, the logic is clear to a person. I will need an object of type MyObj and I should clear it up as soon as I am done with it. That is just plain good programming practice. It is nice that we check that we actually have something to dispose of before we call release.

Do we need a new instance each time around the loop? In this case, almost certainly not since there is no constructor. We can probably just allocate an object once outside of the loop and dispose of it at the end so

o = new MyObj;
if (!o) then ThrowException(UnableTocreate);
For (i=1;i<1000;i++)
{
// some more code
}
release o;

Hmmm. More lines of code but much more efficient. What about of there was a complex constructor? If that can, we would need to create a new one each time, yes? Well, only if there were no other way of setting the fields. The less work that you make for the heap manager, the better. This is true for every language and every operating system. It will almost invariably be cheaper to reset fields in an object than have a brand new allocation and bring forward the time of the next decompression of the heap or garbage collection or what have you.

Possibly interesting story from my Microsoft days. A common complaint in the early days of .NET was that the Garbage collection was often taking more than 20% of the processes run time. The GC is actually pretty damned efficient but people saw this as an unacceptable overhead. In fact, their unmanaged apps were probably spending about the same time in memory allocation and reclamation or more but there was no perfmon counter for it so people thought of it as a free action. Making the numbers visible made people see the problem but they assumed that it was a new problem.

Object brokers help code efficiency a great deal. The down side is that time spent thinking about object brokers is time not spent thinking about the problem that the code is to perform – and convention wisdom seems to be that performance tuning is something that you do when you see where the problems are. I can see the merit in this argument but poor memory management will sap the life of the whole process and not generally show up as a hotspot because it is in the run time.

Any other good points for object brokers? Why yes, thanks for asking. There are several but the ones that I like best are:

1. You can keep track of the number of each type of object and track down leaked resources very quickly indeed. Memory usage bloats. You look at the brokers and see that there are 20,000 extant employee records in a company with 300 employees. I guess that someone isn’t returning them and there should only be a few places that do that operation.

2. Debugging – since you have access to them all, you can have a debug function dump them all to a file and get a snapshot of the state.

3. Need to change the algorithm for providing the objects? Just the one place to change.


Now, I spoke at some length on my old blog about exceptions but they are a very good thing and a very bad thing. The good thing about them is that they are a powerful and structured way of handling what has gone wrong with your processing. You throw and exception in the case of an error (raise an error if you are VB6 programmer – and thank you for keeping the faith) and your code merrily goes on its way, confident that we passed that check and the state is as it expected. Exceptions are great for signalling that something exceptional has happened and needs to be handled. When you see them used for anything else, then you have to consider whether this is a stunningly brilliant move or simply a really bad idea. I am still waiting to see the brilliant alternative use of exceptions but feel free to mail me if you have one. Just to recap then, an exception is the highest priority change of execution that you can have in user mode. The processor pipeline? Gone. The processor cache? Gone. Probability of page faults? Very high. To quote figures from Raymond Chen, a reference to memory in L1 cache takes about 2-3 times longer than an average instruction. Level 2 cache will take about the same as 6-10 instructions. Main memory is 25-50 instructions. If it is not in the working set and has to come off disk? That is 10,000,000 instructions. If your exception causes 10 page faults and 2 of them are not in working set, that will give you the same overhead as 20 million instructions. Did you really want that much overhead for a mechanism to tell you that you have reached the end of the string that you were parsing or some other routine thing like that? No, probably not. Of course, you don’t do that, gentle reader, but I bet you have less faith in Bob down the hall or the intern who wrote what turned out to be your best selling product.

Oh, and on the subject of exceptions, my biggest red flag (shown here in Pseudo.NET but it is common to all languages. VB6 users would call it On Error Resume Next)

Try
{
// some code
} Catch (…) {}

An empty catch block is a way of suppressing all exceptions. You didn’t get memory when you asked for it? Never mind and carry on. You didn’t save the file? Never mind, carry on. I have only once written code that intentionally ignored failures and that was used to emergency shut down a bit of high voltage equipment that needed stages shutdowns to avoid damage. A typical case where the emergency shutdown would be called was when the hardware was electrocuting someone. At that point, all considerations were secondary to stopping the power and if hardware cooked, hardware cooked. I would be very inclined to ask questions when I saw someone else doing the same.

So, that is all that I have time for today but my next entry will be on debug code, when and where to have it. Or possibly something different if there is breaking news in another area of interest.

By the way, questions are welcome. You can reach me at Contact@DigitalLookingGlass.co.uk and don’t be shy. You can have 2 hours free consultancy on debugging, code reviews or malware or ask me to address a point in my blog if you want. Feel free to disagree with anything that I say. Debate is good.

Signing off

Mark Long, Digital Looking Glass

6 comments:

Anonymous said...

Two minor comments:

Usually, it's a potentially wasteful thing to do to write a test such as

if (a=true)

because it will (if the compiler isn't smart enough) compile down to an unnecessary test and jump when it evaluates a=true and then tests the if on the resulting true or false.

if (a)

would do the same, but without that extra test.

I've also run into a few cases where an empty error trap is useful. For instance, when filling a treeview, instead of checking if parent items exist before trying to create them (which is very slow, so slow that it's unworkable on large data sets), I just add them anyway and ignore the error if they already exist. Much faster, and I know what error I will get and I know that it's safe to ignore it. These are, of course not cases one see every day, and in the typical case, I agree completely with you. Even if such an empty error trap is used, it should be limited to exactly the line where the problem will occur, in order to minimize the risk, and, if the language permits it, be limited to the specific error expected.

Mark Long said...

Hi there

Any good optimiser should generate the same code for those two statements so I wouldn't worry too much about how you code it. I am certain the Borland and Microsoft compilers would do the load, test and branch as 3 instructions only in either case and the pipeline would magic away the jump for you.

Hmmm, catching and ignoring a specific exception is not an empty trap because it filters on the exception type. It is ignoring all exceptions that I think is a very bad thing. Yes, I agree that adding to a collection is the main scenario where you will see many exceptions and not be able to do much about it. It makes collections much less suitable for tasks where the input stream may contain repeats.

Glad to see that you are reading the blog.

Mark

Anonymous said...

I agree that a compiler should recognize and optimize the "=true" and "=false" tests, but I am a nervous guy and prefers to be sure. I also think it makes the code more readable, so there is really no reason against it.

A somewhat similar case is the:

if not a then
do stuff
else
do other stuff
end if

The not is not needed and makes it misinterpretation more likely (somewhat like a double negation). The preferable way to write it is:

if a then
do other stuff
else
do stuff
end if

Of course, clean examples like these are simple, but reality is usually a lot more messy, and every little bit helps. It's even worse if a happens to be a "not variable", such as DontWrite or FileNotOpen, in which case we stack another layer of logical nots on it.

These cases have no performance impact, but thinking about them makes nicer code.

Of course I'm reading the blog, it's good!

Mark Long said...

This raises an interesting question. Is it better to write readable code or try to write code that the optimiser will like? I would generally say that writing the code in a readable way is best because:

1. Second guessing the optimiser is hard
2. Second guessing the next version of the optimiser is even harder because it hasn't been written yet
3. Gains are likely to be very small at best unless you are optimising a hotspot in the code. If that bit of the code is run only on the second Tuesday when there is an "r" in the month, an optimisation to save an instruction probably isn't worth it.

Fortunately, Parse trees make complex logic into simple logic as all the twiddly stuff is in the nodes. When you walk the parse trees to generate the code, the structure is plain - though generating a nice parse tree from complex code can be tricky. When you are considering an if statement, you don't need to worry about where the nodes lead. If you are switching around the logic, you can do it at this level.

Have you looked at compiler design at all? Pure compiler design (as described in the "dragon" book among others) is rarely seen in practice because blurring the lines between the stages gives much better compiler performance and richer information for code optimisation.

As for the negation of the term, different compilers may approach this differently but one approach is to thing of it as

if (!a)
{
// code from the else clause
}
else
{
// code from the then clause
}

Of course, with managed code or classic VB, no point in thinking about it since micro-optimisations will be lost in the wash and even with C++, a slightly better algorithm will generally do much more for performance than reordering your code.

Glad that you are liking it

Mark

Anonymous said...

I agree, readability is almost always top priority, especially as often, readable code shows clear thinking and clear thinking makes for effective code.

Also, readability is so much more than jihads about where the openening bracket should be, where it sadly often stops. It is, much like any writing, a way to convey your thinking in a clear manner to a reader.

That said, there are small things one can avoid that occasionally cause performance hits, such as unneccessary type conversions, using ++a instead of a++, having the most exclusive test (or the least expensive) first in an if and so on.

I'm by no means an expert on compilers. I've written a simple one for a project, and I've written a few parsers for script languages for scritping my software, that's all. All those have focused on simplicity rather than performance. I do, however, like to occasionally disassemble and look at what the compiler has made out of my code.

Mark Long said...

I think that you will find having written any compilers puts you ahead of the crowd.

I once wrote a compiler for a very simple language in 4 hours for a bet. Error handling was not in the specification and it was rough as old boots but I won the bet. The prize was a meal and a bottle of wine as I recall.

One thing that compilers have to consider during code generation now is thread safety. That is an interesting little minefield to wander in.

Mark