Premature optimization: wizzard0

wizzard0

Premature optimization

Feb 07, 2013 05:15

Here are the things that continue to bite me, so I'll try to write them down, maybe somebody else will learn a bit too.

1. At some point, all increasingly sophisticated algorithms give diminishing returns.

This is the same as with compression. There are only so much kinds of compressible data - most kinds of data are incompressible, it's just that we mostly happen to work with compressible ones.

In general, algorithms can't be optimized, but you can pick some corner cases and optimize for them.

2. Today, every machine is a multiprocessor, supercomputer machine. In this setting, you're not ALU-bound, you're memory bound. And latency-bound (remember Amdahl's law?).

Hardware is getting cheaper. But your time is not. And latencies are not.

So, one should always optimize for latency, and sometimes for memory bandwidth, but not for ALU throughput. This means optimizing for data layout. Forget algorithms. Layout is everything. Seek-friendly, cache-friendly, etc etc etc.

You can afford 1 HDD seek per frame, and about 100-200 SSD seeks. SSD's rock!
And, while you can process up to 100M floats per frame (on todays' desktop CPU), you can only really reach 100-150K distinct memory locations (approx.)

Basically, you can make only so much "decisions" per frame. SIMD, VLIW are all cool and nice, but it does not mean you can do "more". It only means you can work with bigger entities, that's all. Vectors instead of bytes. But not "more bytes".

One thing that makes GPUs so fast is that they can nicely mask these latencies.. but only as long as their beefy memory controller can keep up dispatching bytes to 'threads'. So, never, ever buy GPUs with 128 (or even 64) bit memory bus.

3. When buying hardware, it's always better to buy top previous generation models instead of middle-tier current-gen. GPUs are the most notable example, but it's the same with CPUs. Just look at cpubenchmark.net and so on. The only exception currently is power efficiency.

But, hey, if you're concerned about power, you're probably buying the assembled device (laptop etc) anyway, so just look at the reviews.

EDIT: Mobile CPUs are quickly catching up with desktop ones. Including the latencies. But the throughput is severely capped there - because of power consumption. In some sense, mobile devices are more "balanced" than desktops now. And downscaling does not always work, because screens are basically the same resolution on computers and phones, just the DPI is different.

EDIT2: IMHO, latency (and security) are the real sources of composability and abstraction limits. When things stacked on each other suddenly start to run slow, the problem is not solvable neither by hardware nor by adding an abstraction layer. Same applies to the situation when the system just goes haywire in an unpredictable way.

This entry was originally posted at http://wizzard.dreamwidth.org/262249.html. It has

comments. Please comment there using OpenID.

железки, мысли, hardware, programming