Jan 05, 2005 17:01
Hello, friends!
Question for the coders around...
Why is just putting floating point math into my code slow as fuck? In pseudocode something like this:
Loop(
Vin= random
Loop2(
i1= ((dt * Vin) + (L * i1))/((dt * R) + L)
VR1= i1 * R
i2= ((dt * VR1) + (L * i2))/((dt * R) + L)
VR2= i2 * R
i3= ((dt * VR2) + (L * i3))/((dt * R) + L)
Dummyout= i3 * R
)
)
That's about 24 floating point operations, and in my current setup, inner loop is 100 iterations, and outer loop is 44100 (OMG AUDIO) so each second of audio takes about 105 million floating point ops. I complete a minute worth in about 15 seconds. Meaning I'm getting about 420 million floats per second on my 1.8Ghz Athlon Barton. I guess this is the best I can hope for? Athlons should do a couple billion, but I suppose that is only with SSE or 3DNow?
I'd like to get the iterations up from 100 to around 512, but that would require 550 million floats per second per channel just to keep even, or about 1.2 billion floats per second for two channels. CPU's SHOULD do this but... looks like it will take more than just normal calls to the FPU?
EDIT: Oops, this IS three stages, forgot that. I could almost do 256 iterations (512 total then) but I still need to speed up from 420 million->550 million. Of course just a faster CPU could do that but there must be some way to extract more performance from the FPU.
Is it because the floats are surrounded by integer-math loops, I seem to remember x86 having to switch modes in & out of the FPU or something...
EDIT: First thing I did was pre-compute ((dt * R) + L) since it doesn't change, and then take the reciprocal of that, since multiplying is faster than dividing. Sped it up from about 15 seconds down to 12 or so. That alone gets me to about 499 iterations possible. (249 per channel). Getting close, but unfortunately there's going to be a LOT of other stuff going on if I ever actually do this, so I need a lot more performance still. The fact that such a huge reduction in the number of operations performed only bought me a little improvement tells me the program is spending a LOT of time doing stuff other than the operations themselves...
EDIT: The term was now ((dt * V) + (L * i#))*Divisor... so I distributed Divisor into dt and L, precomputing those, so now it looks like ((Divdt * V) + (DivL * i#)) up to around 960 iterations per second, easily got 256/channel covered atm then, which MIGHT be good enough. Also, when the second stereo channel is added, it goes inside the same loops, so it won't quite double the time, so I COULD probably get close to 512/channel right now.
EDIT: VR1 & VR2 are only used as inputs to the next computation, so I rolled R into Divdt for those two, DivdtR, and then replace VR1 & VR2 with i1 and i2.
Now it looks like:
Loop(
Loop2(
i1= ((Divdt * V) + (DivL * i1))
i2= ((DivdtR * i1) + (DivL * i2))
i3= ((DivdtR * i2) + (DivL * i3))
// Now unrolled 15 more times
Dummyout= i3 * R
)
)
Up to close to 1400 computations/second. IE, 740 million floats/second or so. Any ideas on further imrpovement? I'm 99% certain these equations can't be any more simplified.
EDIT: I precomputed Divdt * V since it doesnt' change in the inner loop, unrolled it 64 times, and interleaved the second channel. Adding the second channel doubles my performance because they're done completely in parallel, keeping the pipelines more full. Up to 2560+ operations per sample possible, or about 1.4 billion floats/second.