Comments | mister_borogove: a cache puzzle

mister_borogove

a cache puzzle

May 18, 2005 08:30

I have N (> 500) elements of 64 bytes each. I have a level-1 cache of 8K, so it could hold at most 128 elements ( Read more... )

geek

Comments 10

anonymous May 18 2005, 17:57:04 UTC

My first thought was to do them in 11x11 blocks... You would have 22 cache missed and 99 cache hits on each block. Basically N^2 should fit within the cache. But yah, no reason to be symmetrical, so actually I think your 127x1 works better. Keep 127 in the cache, and take the one miss to those process 127. Each block has a cache miss of 127 + N, with N/254 blocks to process (half of N/127 because of the reciprocity). So, if I got my math right, your cache miss / hit rate is close to (1/2N + (N^2/254))/N(factorial)...not taking into account the diagonal or partial block rounding. The ratio improves as N increases, so that sounds like the most cache-friendly way to do it to me. (not to mention the code to deal with Nx1 is probably a lot simpler and flexible than calcating 11x11 with remainders, etc!)

anonymous May 18 2005, 18:08:50 UTC

sorry, that should read cache miss / total calculations, not cache miss / hit rate. Hit rate is obviously total-miss.

if N < 127 then just sub 2N for 254 (same for remainder).

if N<=120 the whole thing fits in the cache and then your miss rate is simply N, so there's no real need to optimize when N<120, eh?

mister_borogove May 18 2005, 19:16:40 UTC

It's ~60x~60, not ~11x~11 - i+j wants to be < 128, not i*j. I had the same brain fart. :)

anonymous May 19 2005, 14:11:03 UTC

add, multiply, whatever...it's all "makes bigger" to me... ;)

and that should be N(factorial)/2 in the formula above...the total number of calculations is >>1 because of the symmetry.

Obviously cache misses are painful, but I think that factorial is really gonna get ya. When I did a N-particle gravity simulator, I reduced the calculation space by partitioning the world and only calculating interactions between particicles in the same partition plus the 26 neighboring ones. It assumes that particles that are greater than 1 partition space apart have negligible effect on each other (although there's a worst case max calculation distance of just under 2*sqrt(3)). And there's a maxwell's demon worst case too. Of course, this was a decade ago...CPU speeds have improved somewhat.

Thread 6

anonymous May 19 2005, 14:13:30 UTC

The element size of 64bytes is convenient...do the cache lines divide evenly along that boundary? Oh, and make damn sure you're aligned properly. ;)

mister_borogove May 19 2005, 18:13:20 UTC

I could pack them down to 44 bytes (pos, vel, acc vectors = 9 plus mass and radius = 11 single precision floats) but if I pad them out to 64 I get cache line alignment, 16-byte aligned vectors that work well with SSE, additional properties TBD such as charge, optical color, elasticity, squamosity, mimsiness...

sambushell May 19 2005, 14:30:55 UTC

I claim that the way to approach this is to consider: how much work could I do on data that's entirely within the cache? As you say, that means working in 64x64 blocks, since x*y is maximised over x+y<=c at x=y=c/2.

so:

for (i_major = 0; i < N; i_major += 64)
for (j_major = i_major; j < N; j_major += 64)
for (i = i_major; (i < N) && (i < i_major + 64); i++)
for (j = max(i+1, j_major); (j < N) && (j < j_major + 64); j++)
... If you step away from square blocks (eg, 65x63), then the blocks are smaller and there are more of them. Initially, I'd guess that the average cache-miss count per block is still the same (given that we fill the cache with each block), so we're worse off. But you can actually compute the number of cache misses in the entire process (making some smart-LRU assumptions, counting the actual number of entries in each block which weren't in the previous block), which might help test the theory that this is a local minimum ( ... )

mister_borogove May 19 2005, 15:17:42 UTC

How much work could I do on data that's entirely within the cache? As you say, that means working in 64x64 blocks, since x*y is maximised over x+y<=c at x=y=c/2.

But it takes 64 cache misses to refill for the next 64x64 block, so you're getting 64 computed cells per cache miss. That's why we came to 127x1 - you get nearly twice as many cells per cache miss.

Cache line length is either 32 or 64, and we're going to try to align 64-byte elements to cache lines and pack them tight. This means that, what, the effective size of the cache is divided by the number of ways?

I'm actually targeting all processors which can do SSE (P3, P4, PM, Athlon XP, Athlon FX-64, etc), which is a lot of different cache architectures, but most of them have at least 8K L1 data cache. I was considering splitting things into supercolumns of ~120 and superrows of X (where X takes advantage of the L2 cache size).