**Code:**

uint16_t blend2(uint32_t A, uint32_t B, uint32_t C) {

grow(A); grow(B); grow(C);

return pack((A * 2 + B + C) >> 2);

}

**Code:**

#define Interp02(c1, c2, c3) \

(((((c1 & Mask_2) * 2 + (c2 & Mask_2) + (c3 & Mask_2) ) >> 2) & Mask_2) + \

((((c1 & Mask13) * 2 + (c2 & Mask13) + (c3 & Mask13) ) >> 2) & Mask13))

Unsure if the equality test on some of those functions will help or not. Certainly will for solid-color screens, but how common/rare is that? Extra test could make it slower in some cases.

Ignoring that ... Lots of masking and repeated multiplications there.

It's masking FF00FF, performing math on that, then masking 00FF00 and doing the same again, and combining the results. Looks to be working on 24-bit input.

Mine splits the channels apart and does the multiplication only once, works on SNES 15-bit input (can do 16-bit too.)

The idea is that n*4 in the worst case can spill over by two extra bits:

%11111*4=%(11)11100, the part in parenthesis have spilled over, which would alias into the next color channel. But if we have some zero values between them, we can shift around and mask. So mine turns:

0rrrrrgggggbbbb into:

000000ggggg00000 0rrrrr00000bbbbb

Then does the math on them, shifts back, and then packs it back together.

I couldn't say which was faster (would guess mine), you'd have to bench-mark it. I just like mine more for readability.