May 17, 2006 12:19
Yesterday I was messing around with 64-bit coding using Win XP x64 and the Platform SDK... After slaving away at writing a double-precision Mandelbrot loop using the SSE2 intrinsics, it turns out that the simple C++ loop was 30% faster.
So either: 1) I suck at hand optimizing, 2) there's something wrong with the intrinsics, 3) there's something wrong with SSE2 on the Athlon 64, or 3) the 64-bit compiler is REALLY good.
The 64-bit compiler already uses SSE and SSE2 automatically (there is no x87 FP in x86-64) so I'm really trying to beat the compiler at its own game.
However, my code used the vector SSE2 instructions (SIMD) whereas I think the compiler only uses the scalar instructions, so theoretically it could/should be 2x faster... but that's assuming that the actual implementation of SSE2 on the Athlon64 actually has parallel double-precision FP units, which, come to think of it, is rather unlikely.
So it's probably having to serialize the execution of the vector operations, making the vector code slower than the scalar code simply due to the fact that the branch prediction and speculative execution isn't going to work as well with deeper dependencies.
AMD's CodeAnalyst could yield some good information, but for now I'm going to suspect the CPU itself until I actually have a look at the disassembly.