To be totally honest....I'm slightly bummed by your post because that was going to be the "upgrade" that I implemented today and was going to post about it!
Oops, sorry about that :(
(Less the rotation of the pattern tree for the corners - I don't know what that means yet but I'll definitely check it out.)
Okay, the original HQ2x code is here:
http://code.google.com/p/hqx/source/bro ... src/hq2x.c
Basic pattern explanations:
#define PIXEL00_0 *dp = w;
#define PIXEL00_10 Interp1(dp, w, w);
#define PIXEL00_11 Interp1(dp, w, w);
#define PIXEL00_20 Interp2(dp, w, w, w);
#define PIXEL01_0 *(dp+1) = w;
#define PIXEL01_10 Interp1(dp+1, w, w);
#define PIXEL01_11 Interp1(dp+1, w, w);
#define PIXEL01_20 Interp2(dp+1, w, w, w);
PIXEL00 = top-left, 01=top-right, 10=bottom-left, 11=bottom-right
Then there is a 256-entry switch on the 8-diff pattern for each pixel of input that produces four unique outputs:
Case 177 = rule 4: blend2(E, D, B);
In HQ2x official, the pixel grid around the current pixel is used:
So for rule 1, PIXEL00_20 = 542. Blend what is above and to the left only.
(side-note: blend rules only ever work with five adjacent pixels, for the top-left, that's 12468, IIRC.)
Yet PIXEL01_20 = 526. Blend with what is above and to the right only. This leads way to a symmetry. We don't even need to look at PIXEL10_20 to know that it will be 584.
So rather than encode four rule sets per output pixel, we only really need to store the first rule set, that is, PIXEL00.
So we ignore 5, since we don't "diff" (compare to see if two pixels are similar or not) 5 to 5. So our pattern is actually "12346789" in one byte.
So after rendering the top-left, we can rotate the pattern matches by two steps: each step moves everything counter-clockwise two times. So let's do this:
Notice how 01_20 = 526 in the same spot that 00_20 had 542?
The 256-byte rotate table stores this calculation for us, since there's no easy bit-twiddling way to do it. So our new pattern byte after rotation is 74182963. We can do this again for the bottom right, and one last time for the bottom left pixel.
The PIXEL01/10/11 rules are only different because they're all considering 1-9 patterns to be the same for each corner.
So in our case 177, PIXEL01_22 is used: Interp2(dp+1, w, w, w);
=10100110=$a5=rule 165. And surprise, there we have: PIXEL01_20
I built my symmetry table off of PIXEL00 only. And as it turns out, there are actually six? cases where the symmetry does not hold up in MaxSt's original tables.
I believe them to be unintentional bugs as a result of MaxSt presumably building this table by hand. I did a blind test on my forum of my rotation HQ2x versus MaxSt's HQ2x table, and everyone favored mine, as it eliminated a couple extra jaggy points. But I admit that is subjective.
The difference was something like 6 pixels per video frame, because the differences were in very uncommon pattern matches.
I do wish I knew how MaxSt generated his initial switch table, or if he really just did it painstakingly by hand. I'd like to think there is some sort of computer algorithm that can build that comparison table, and that we can improve upon that algorithm to consider more than five surrounding pixels, to make a super-HQNx algorithm. But ... I haven't been able to figure it out.
> But you and blargg beat me to it!
Let's see if I remember ...
I came up with the pattern table + rotation, and used a script to build the pattern array by parsing the original hq2x.c file.
The 16->32->16 bit-packing for simple blending rules without having to split things out per channel came from Allegro's alpha-blending code, that I remembered from the '90s.
blargg took the original pixel differencing code, and implemented a much faster version using a simplified YUV table and some magic constants. He also came up with a distinction between same/diff that is used in there at some point. This differencing is also not a 100% match to the original algorithm, but is 99.9% the same, and way faster.
I came up with a tiny optimization where the pattern match for 6 can be moved to 4 for the next pixel, but it didn't seem to help performance and required extra code.
blargg had some amazing non-grow/pack variants for the most common blend cases using his clamped-RGB add/sub code, but it turned out to be slower on more modern x86 processors for some crazy reason (I guess multiplies have gotten a lot faster lately.)
blargg also has a lot of optimizations that help on the PowerPC more than on the x86. I kept them out to keep the code simpler.
Hopefully that's everything. I'd be really appreciative if anyone can find a way to reduce my code even more. The algorithm becomes somewhat beautiful with the symmetry, and the smaller it is, the more impressive it is.