Кросспост из
блога автора. Комментировать лучше
там, но можно и тут
Нежно любимый всеми Agner Fog
пишет
The optimization manuals at
www.agner.org/optimize/#manuals have now been updated. The most important additions are:
AMD Piledriver and Jaguar processors are now described in the microarchitecture manual and the instruction tables.
Intel Ivy Bridge and Haswell processors are now described in the microarchitecture manual and the instruction tables.
The micro-op cache of Intel processors is analyzed in more detail
The assembly manual has more information on the AVX2 instruction set.
Уже качаю, будет чтение на ночь
Заодно от там AMD Pilediver обижает
Supports fused multiply-and-add instructions in both the FMA3 and FMA4 form. FMA3 is compatible with Intel processors. See Wikipedia for a discussion of the incompatibility between these instruction sets.
The throughput of FMA3 instructions is only half as much as the throughput of FMA4 instructions, even though they are doing exactly the same calculations.
Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes. No explanation for this has been found. This design flaw is likelty to negate any advantage of using the AVX instruction set.
The VMASKMOVPS instruction with a memory source operand takes more than 300 clock cycles on the Jaguar when the mask is zero, in which case the instruction should do nothing. This appears to be a design flaw. This instruction is not very common, though.
Таки что, детектить CPU и запрещать AVX для этих горшков?