Tonight, I started on Calyx again. I started with the display/hardware layer. One of my design requirements was alpha-blending support in the system.
Long ago, when I started on dGUI/Calyx, I found what was purported to be a very fast software alpha blender. Which is great, for drivers that won't implement it. So, today, I went looking for the document from which I got the concept and base code. This time around, I realised that it could be made better than it was.
The following is contained in a comment from the Calyx source tree. Prettified for HTML, of course. Yeah, there's probably an error or two in here somewhere. I don't care.
http://www.gamedev.net/reference/articles/article1594.asp That's where the idea behind this blending method comes from. I intend to take his idea and make it even faster.
His table initaliser (reformatted for readability):
int InitTable() {
float fValue, fAlpha;
int iValue, iAlpha;
for (iAlpha = 0; iAlpha < 256; iAlpha++) {
fAlpha = ((float)iAlpha) / 255;
for (iValue = 0; iValue < 256; iValue++) {
fValue = ((float)iValue) / 255;
AlphaTable.Levels[iAlpha].Values[iValue] = clipByte((int)((fValue * fAlpha) * 255));
}
}
return true;
}
For each of the 256 alpha levels, 256 values are generated. For some unknown reason, he decided to do this math piecewise. Although, I guess it might make sense in some way for explanation.
fAlpha = ((float)iAlpha) / 255;
fValue = ((float)iValue) / 255;
the actual value = (fAlpha * fValue) * 255;
Let's do some simple replacement (float casting removed):
the actual value = ((iAlpha / 255) * (iValue / 255)) * 255;
Simplify the parentheses:
the actual value = ((iAlpha * iValue) / 65025) * 255;
Multiply the 255 through:
the actual value = (iAlpha * iValue) / 255;
BOOYA! One multiply and one divide, as opposed to two of each. I win.
The clipByte function:
__inline unsigned __int8 clipByte(int value) {
value = (0 & (-(int)(value < 0))) | (value & (-(int)!(value < 0)));
value = (255 & (-(int)(value > 255))) | (value & (-(int)!(value > 255)));
return value;
}
The idea here is to take a 32-bit number and strip it down to 8 bits. Because our math here is unsigned, it is not important for us to keep the sign bit. We just want the final 8 bits, nothing else matters.
We can do this in a macro:
#define clipByte(x) ((x) & 255)
I imagine that becomes less than 15 ix86 instructions (it's just a simple 8-bit AND)...
And, now, my version of the additive alpha blend code:
#define cb(x) ((x) & 255)
uint8 _cx_alpha_values[256][256];
void cx_init_alpha() {
int iv, ia;
for (ia = 0; ia < 256; ia++)
for (iv = 0; iv < 256; iv++)
_cx_alpha_values[ia][iv] = (ia * iv) / 255;
}
/* Six-line alpha blend. Oh yeah. */
uint32 cx_alpha_blend(uint32 top, uint32 bottom)
{
uint32 blend = 0;
uint8 * st = _cx_alpha_values[(top >> 24)];
blend |= cb(st[cb(top >> 16)] + cb(bottom >> 16)) << 16;
blend |= cb(st[cb(top >> 8)] + cb(bottom >> 8)) << 8;
blend |= cb(st[cb(top)] + cb(bottom));
return (blend | 0xff00000);
}
Sadly, I can't correctly test it until I get enough of the Calyx framework together...