May 18, 2005 08:30
I have N (> 500) elements of 64 bytes each. I have a level-1 cache of 8K, so it could hold at most 128 elements.
I have some operations which can be done on an element completely independently of other elements. I have some other operations which require that each element interact with the N-1 other elements.[1] Reciprocal interactions can be computed together (i.e. it's most efficient that, at the same time that I process the pair (i,j), I process the pair (j,i).[2]
So what's the best way to arrange the processing in order to minimize cache misses? If I do:
for (i = 0; i < N; i++)
for (j = i+1; j < N; j++)
...
Then during the early part of the processing, the i element will stay in cache, but basically all the j's will miss. Towards the end where there are < 128 elements to go, the j's will start to stick, but as N gets much larger, the miss proportion gets a lot worse.
The best thing I could think of is to do 64x64 blocks at a time - when not close to the "diagonal", each axis will get half the cache. Anything more effective or more elegant?
Hmm, actually, no, as N approaches infinity (and so i and j ranges overlap rarely), it's better to do 127x1 -- each cache miss on the short side lets you do another 127 ops without a miss, whereas in the 64x64 case each miss only pays for 64 ops.[3] Is there any reason to depart from this strategy for smaller N?
[1] It's a particle physics simulation. They move independently, but they gravitationally attract one another. I'm aware of the existence of O(N log N) solutions like Barnes-Hut, but for the moment I'm only considering how to best optimize the straight O(N-squared) solution.
[2] The distance-between-cities grid in your atlas is arranged the same way, a (N)(N-1)/2 triangle.
[3] In practice, I'll want to save a few cache lines for stack, constants, other needed sim data, and competing threads, so I'll actually aim to use, say, 120 elements at once.
geek