HQ2X Algorithm Ported to Verilog
Moderator: Moderators
HQ2X Algorithm Ported to Verilog
Hello all!
I have some very exciting news! Well, it's exciting me to at least. I have successfully ported the HQ2X pixel scaling algorithm to Verilog HDL. Here are some of my test image results directly out of my emulator (captured from my FPGA emulator using my VGA capture card). I also provide links to zoomed-in/magnified versions of the test images because the actual images are so small.
Test 2 Image (Input - Unmagnified)
Test 2 Image (Input - Magnified)
Test 2 Image (Output - Unmagnified)
Test 2 Image (Output - Magnified)
Test 3 Image (Input - Unmagnified)
Test 3 Image (Input - Magnified)
Test 3 Image (Output - Unmagnified)
Test 3 Image (Output - Magnified)
Test 4 Image (Input - Unmagnified)
Test 4 Image (Input - Magnified)
Test 4 Image (Output - Unmagnified)
Test 4 Image (Output - Magnified)
Test 5 Image (Input - Unmagnified)
Test 5 Image (Input - Magnified)
Test 5 Image (Output - Unmagnified)
Test 5 Image (Output - Magnified)
Test 6 Image (Input - Unmagnified)
Test 6 Image (Input - Magnified)
Test 6 Image (Output - Unmagnified)
Test 6 Image (Output - Magnified)
I will be posting a gameplay video of the scalar operating live on my VeriNES very soon!
UPDATE: Videos have been added in a later post. Enjoy!
Pz!
Jonathon
I have some very exciting news! Well, it's exciting me to at least. I have successfully ported the HQ2X pixel scaling algorithm to Verilog HDL. Here are some of my test image results directly out of my emulator (captured from my FPGA emulator using my VGA capture card). I also provide links to zoomed-in/magnified versions of the test images because the actual images are so small.
Test 2 Image (Input - Unmagnified)
Test 2 Image (Input - Magnified)
Test 2 Image (Output - Unmagnified)
Test 2 Image (Output - Magnified)
Test 3 Image (Input - Unmagnified)
Test 3 Image (Input - Magnified)
Test 3 Image (Output - Unmagnified)
Test 3 Image (Output - Magnified)
Test 4 Image (Input - Unmagnified)
Test 4 Image (Input - Magnified)
Test 4 Image (Output - Unmagnified)
Test 4 Image (Output - Magnified)
Test 5 Image (Input - Unmagnified)
Test 5 Image (Input - Magnified)
Test 5 Image (Output - Unmagnified)
Test 5 Image (Output - Magnified)
Test 6 Image (Input - Unmagnified)
Test 6 Image (Input - Magnified)
Test 6 Image (Output - Unmagnified)
Test 6 Image (Output - Magnified)
I will be posting a gameplay video of the scalar operating live on my VeriNES very soon!
UPDATE: Videos have been added in a later post. Enjoy!
Pz!
Jonathon
Last edited by jwdonal on Mon Sep 05, 2011 2:03 pm, edited 4 times in total.
- infiniteneslives
- Posts: 2104
- Joined: Mon Apr 04, 2011 11:49 am
- Location: WhereverIparkIt, USA
- Contact:
It's probably too late, especially considering you've already made it.
But I have a 6KB C++ implementation of HQ2x that relies on a 256-byte condensed pattern table and rotation of the pattern tree for the four corners. blargg helped with much of this, too. Compiles down to maybe 1KB of binary code. Might be more suitable for saving transistors.
Source here if you're interested, consider it public domain.
http://pastebin.com/YXpmqvW5
You'd want to do the YUV conversion manually instead of via table, of course.
But I have a 6KB C++ implementation of HQ2x that relies on a 256-byte condensed pattern table and rotation of the pattern tree for the four corners. blargg helped with much of this, too. Compiles down to maybe 1KB of binary code. Might be more suitable for saving transistors.
Source here if you're interested, consider it public domain.
http://pastebin.com/YXpmqvW5
You'd want to do the YUV conversion manually instead of via table, of course.
hq2x is one of the well-known pixel art interpolation algorithms.3gengames wrote:I barely know what this is, but it's sweet! Good job.
Oh and thanks byuu.
jwdonal: That's the oldest trick in the book You made the "output" images slightly brighter than the input images. And studies show louder is better.
Thank you all for the kind words.
I will post here again once I complete the upgrade. Thanks byuu!
tepples:
It absolutely is more resource efficient for both software and hardware implementations. To be totally honest....I'm slightly bummed by your post because that was going to be the "upgrade" that I implemented today and was going to post about it! (Less the rotation of the pattern tree for the corners - I don't know what that means yet but I'll definitely check it out.) But you and blargg beat me to it!byuu wrote:It's probably too late, especially considering you've already made it.
But I have a 6KB C++ implementation of HQ2x that relies on a 256-byte condensed pattern table and rotation of the pattern tree for the four corners. blargg helped with much of this, too. Compiles down to maybe 1KB of binary code. Might be more suitable for saving transistors.
I will post here again once I complete the upgrade. Thanks byuu!
Yes, I will be integrating it into my VeriNES and will post some videos on Youtube. But I want to make some optimizations to the code first (e.g. what was stated above).infiniteneslives wrote:So do you have an application of this planned for the NES or your NOAC? Or just playing around with the stuff?
tepples:
Oops, sorry about that :(To be totally honest....I'm slightly bummed by your post because that was going to be the "upgrade" that I implemented today and was going to post about it!
Okay, the original HQ2x code is here:(Less the rotation of the pattern tree for the corners - I don't know what that means yet but I'll definitely check it out.)
http://code.google.com/p/hqx/source/bro ... src/hq2x.c
Basic pattern explanations:
Code: Select all
#define PIXEL00_0 *dp = w[5];
#define PIXEL00_10 Interp1(dp, w[5], w[1]);
#define PIXEL00_11 Interp1(dp, w[5], w[4]);
#define PIXEL00_20 Interp2(dp, w[5], w[4], w[2]);
#define PIXEL01_0 *(dp+1) = w[5];
#define PIXEL01_10 Interp1(dp+1, w[5], w[3]);
#define PIXEL01_11 Interp1(dp+1, w[5], w[2]);
#define PIXEL01_20 Interp2(dp+1, w[5], w[2], w[6]);
Then there is a 256-entry switch on the 8-diff pattern for each pixel of input that produces four unique outputs:
Code: Select all
case 177:
{
PIXEL00_20
PIXEL01_22
PIXEL10_20
PIXEL11_12
break;
}
In HQ2x official, the pixel grid around the current pixel is used:
123
456
789
So for rule 1, PIXEL00_20 = 542. Blend what is above and to the left only.
(side-note: blend rules only ever work with five adjacent pixels, for the top-left, that's 12468, IIRC.)
Yet PIXEL01_20 = 526. Blend with what is above and to the right only. This leads way to a symmetry. We don't even need to look at PIXEL10_20 to know that it will be 584.
So rather than encode four rule sets per output pixel, we only really need to store the first rule set, that is, PIXEL00.
So we ignore 5, since we don't "diff" (compare to see if two pixels are similar or not) 5 to 5. So our pattern is actually "12346789" in one byte.
123
4X6
789
So after rendering the top-left, we can rotate the pattern matches by two steps: each step moves everything counter-clockwise two times. So let's do this:
369
2X8
147
Notice how 01_20 = 526 in the same spot that 00_20 had 542?
The 256-byte rotate table stores this calculation for us, since there's no easy bit-twiddling way to do it. So our new pattern byte after rotation is 74182963. We can do this again for the bottom right, and one last time for the bottom left pixel.
The PIXEL01/10/11 rules are only different because they're all considering 1-9 patterns to be the same for each corner.
So in our case 177, PIXEL01_22 is used: Interp2(dp+1, w[5], w[3], w[2]);
177=B1=pattern match:
101
1X0
001
Rotated:
101
0X0
110
=10100110=$a5=rule 165. And surprise, there we have: PIXEL01_20
I built my symmetry table off of PIXEL00 only. And as it turns out, there are actually six? cases where the symmetry does not hold up in MaxSt's original tables.
I believe them to be unintentional bugs as a result of MaxSt presumably building this table by hand. I did a blind test on my forum of my rotation HQ2x versus MaxSt's HQ2x table, and everyone favored mine, as it eliminated a couple extra jaggy points. But I admit that is subjective.
The difference was something like 6 pixels per video frame, because the differences were in very uncommon pattern matches.
I do wish I knew how MaxSt generated his initial switch table, or if he really just did it painstakingly by hand. I'd like to think there is some sort of computer algorithm that can build that comparison table, and that we can improve upon that algorithm to consider more than five surrounding pixels, to make a super-HQNx algorithm. But ... I haven't been able to figure it out.
> But you and blargg beat me to it!
Let's see if I remember ...
I came up with the pattern table + rotation, and used a script to build the pattern[] array by parsing the original hq2x.c file.
The 16->32->16 bit-packing for simple blending rules without having to split things out per channel came from Allegro's alpha-blending code, that I remembered from the '90s.
blargg took the original pixel differencing code, and implemented a much faster version using a simplified YUV table and some magic constants. He also came up with a distinction between same/diff that is used in there at some point. This differencing is also not a 100% match to the original algorithm, but is 99.9% the same, and way faster.
I came up with a tiny optimization where the pattern match for 6 can be moved to 4 for the next pixel, but it didn't seem to help performance and required extra code.
blargg had some amazing non-grow/pack variants for the most common blend cases using his clamped-RGB add/sub code, but it turned out to be slower on more modern x86 processors for some crazy reason (I guess multiplies have gotten a lot faster lately.)
blargg also has a lot of optimizations that help on the PowerPC more than on the x86. I kept them out to keep the code simpler.
Hopefully that's everything. I'd be really appreciative if anyone can find a way to reduce my code even more. The algorithm becomes somewhat beautiful with the symmetry, and the smaller it is, the more impressive it is.
byuu, that - is - freakin - awesome ! I'm still deciphering your notes and matching them up to your C code but just from skimming it it's really cool! It's amazing that you guys figured all this out. I will post a more in-depth reply once I implement these upgrades myself but I can't thank you enough for sharing this.
If I figure out any more optimizations I will def be happy to tell you what they were (if they relate to something that could be done in software and not just hardware).
UPDATE:
If I figure out any more optimizations I will def be happy to tell you what they were (if they relate to something that could be done in software and not just hardware).
UPDATE:
Very minor fix, but that should say "98764321".byuu wrote:So our pattern is actually "12346789" in one byte.
> It's amazing that you guys figured all this out.
I really have no idea how I came up with the pattern rotation. I think I just got really lucky.
I still don't understand blargg's super-optimized YUV diff/same functions, but they're definitely a lot faster than the LGPL-C version from MaxSt.
> Very minor fix, but that should say "98764321".
Sorry, it's all pretty much from memory. May be other small mistakes as well, as I did this quite a long time ago.
I really have no idea how I came up with the pattern rotation. I think I just got really lucky.
I still don't understand blargg's super-optimized YUV diff/same functions, but they're definitely a lot faster than the LGPL-C version from MaxSt.
> Very minor fix, but that should say "98764321".
Sorry, it's all pretty much from memory. May be other small mistakes as well, as I did this quite a long time ago.
Okay, so I've finally figured out how (or moreso why) this works. And just for everyone's benefit I'm going to go ahead and re-post what byuu posted with some corrections because the errors threw me off for a while (no offense _whatsoever_ intended toward byuu - I am very thankful for his help. The errors, like he said, are just because he was doing everything from memory.).
Anyway, the errors result from the incorrect assumed pattern of "12346789". There were actually just a couple of errors as a result of that first error, but I figured it would make more sense to re-post without snipping just so people don't have to look back and forth between the two posts. Here goes...
-------------------------------------------------------
So we ignore 5, since we don't "diff" (compare to see if two pixels are similar or not) 5 to 5. So our pattern is actually "98764321" in one byte.
123
4X6
789
So after rendering the top-left, we can rotate the pattern matches by two steps: each step moves everything counter-clockwise two times. So let's do this:
369
2X8
147
Notice how 01_20 = 526 in the same spot that 00_20 had 542?
The 256-byte rotate table stores this calculation for us, since there's no easy bit-twiddling way to do it. So our new pattern byte after rotation is 74182963. We can do this again for the bottom right, and one last time for the bottom left pixel.
The PIXEL01/10/11 rules are only different because they're all considering 1-9 patterns to be the same for each corner.
So in our case 177, PIXEL01_22 is used: Interp2(dp+1, w[5], w[3], w[2]);
177=B1=10110001=[98764321]=pattern match:
100
0X1
101
Rotated counter-clockwise by 2 positions:
011
0X0
101
New pattern=[74182963]=10100110=$A6=rule 166
And surprise, there we have: PIXEL01_12
The significance of this (i.e. the "surprise") is that PIXEL01_12 blends the (now rotated) pixels at locations 5 & 6. Which are the exact same pixels (i.e. pixels 5 & 8) that PIXEL11_12 would have blended had we stuck with rule 177 using the non-rotated pattern.
-------------------------------------------------------
The optimization that byuu created is extremely elegant which consequently makes it difficult to explain really well with just words. I plan on adding some hand-drawn graphs to this post to visually describe this process. Fixing the errors, and then drawing all this out is the only thing that allowed me to finally understand it. :-P I just need to clean up what I've already created and scan it into my machine. I'll put the files on dropbox and post a link.
UPDATE:
Here ya go. Hopefully this doesn't just confuse people more. :( lol
http://dl.dropbox.com/u/36237540/hq2x/h ... xample.pdf
Pz!
Jonathon W. Donaldson
Anyway, the errors result from the incorrect assumed pattern of "12346789". There were actually just a couple of errors as a result of that first error, but I figured it would make more sense to re-post without snipping just so people don't have to look back and forth between the two posts. Here goes...
-------------------------------------------------------
So we ignore 5, since we don't "diff" (compare to see if two pixels are similar or not) 5 to 5. So our pattern is actually "98764321" in one byte.
123
4X6
789
So after rendering the top-left, we can rotate the pattern matches by two steps: each step moves everything counter-clockwise two times. So let's do this:
369
2X8
147
Notice how 01_20 = 526 in the same spot that 00_20 had 542?
The 256-byte rotate table stores this calculation for us, since there's no easy bit-twiddling way to do it. So our new pattern byte after rotation is 74182963. We can do this again for the bottom right, and one last time for the bottom left pixel.
The PIXEL01/10/11 rules are only different because they're all considering 1-9 patterns to be the same for each corner.
So in our case 177, PIXEL01_22 is used: Interp2(dp+1, w[5], w[3], w[2]);
177=B1=10110001=[98764321]=pattern match:
100
0X1
101
Rotated counter-clockwise by 2 positions:
011
0X0
101
New pattern=[74182963]=10100110=$A6=rule 166
And surprise, there we have: PIXEL01_12
The significance of this (i.e. the "surprise") is that PIXEL01_12 blends the (now rotated) pixels at locations 5 & 6. Which are the exact same pixels (i.e. pixels 5 & 8) that PIXEL11_12 would have blended had we stuck with rule 177 using the non-rotated pattern.
-------------------------------------------------------
The optimization that byuu created is extremely elegant which consequently makes it difficult to explain really well with just words. I plan on adding some hand-drawn graphs to this post to visually describe this process. Fixing the errors, and then drawing all this out is the only thing that allowed me to finally understand it. :-P I just need to clean up what I've already created and scan it into my machine. I'll put the files on dropbox and post a link.
UPDATE:
Here ya go. Hopefully this doesn't just confuse people more. :( lol
http://dl.dropbox.com/u/36237540/hq2x/h ... xample.pdf
Pz!
Jonathon W. Donaldson
> The significance of this (i.e. the "surprise") is that PIXEL01_12 blends the (now rotated) pixels at locations 5 & 6. Which are the exact same pixels (i.e. pixels 5 & 8) that PIXEL11_12 would have blended had we stuck with rule 177 using the non-rotated pattern.
Typo in bold should be PIXEL11_12.
Yeah, sounds like you have it now, you're reminding me how it works again :D
I unfortunately had my forum in purge mode to conserve space, and lost the old post where I explained this when I first figured it out.
Thanks for the corrections.
Always felt some sort of animation showing the rotation in real-time would be kinda neat to explain this.
You can technically forgo the need to rotate if you store a 1024-byte table, but I like the smaller version. Would be really cool if someone could come up with a better 'rotate' bit-twiddling algorithm so that we don't need the second 256-byte table.
Oh, I also don't think I mentioned it, but I load in all eight surrounding pixels, but since there are only five "points" that HQ2x ever blends against, I rearrange their order when calling the blend functions, which is based around the same rotation, but I saw no reason to shuffle eight 16-bit values.
Typo in bold should be PIXEL11_12.
Yeah, sounds like you have it now, you're reminding me how it works again :D
I unfortunately had my forum in purge mode to conserve space, and lost the old post where I explained this when I first figured it out.
Thanks for the corrections.
Always felt some sort of animation showing the rotation in real-time would be kinda neat to explain this.
You can technically forgo the need to rotate if you store a 1024-byte table, but I like the smaller version. Would be really cool if someone could come up with a better 'rotate' bit-twiddling algorithm so that we don't need the second 256-byte table.
Code: Select all
n = ((n >> 2) & 0x11) | ((n << 2) & 0x88)
| ((n & 0x01) << 5) | ((n & 0x08) << 3)
| ((n & 0x10) >> 3) | ((n & 0x80) >> 5);
To make rotation as easy as rotation, order the bits as
Code: Select all
0 1 2
7 x 3
6 5 4
Not as I understand the algorithm. 01_12 is correct as I have it. Maybe it's just that I worded it funny or something.byuu wrote:Typo in bold should be PIXEL11_12.
01_12 is blending 5 & 6. Pixel 6 _was_ the pixel that was in location 8 prior to the matrix rotation. And prior to the rotation you would have had to use 11_12 to blend 5 & 8. So, 01_12 (5,6) in rule 166 has taken the place of 11_12 (5,8) in rule 177.
Maybe the new wording will help. I drew this all out on paper and it does work. Maybe I should scan that one too. We may just be thinking about it differently. I'm guessing there is more than one way to look at this optimization - maybe that's what happening.
Pz!
Jonathon
> To make rotation as easy as rotation, order the bits as
0 1 2
7 x 3
6 5 4
Reordering won't help.
01273654 ROL 2 = 27365401
273
6x5
401
That is not rotated in any way. We would want:
234
1x5
076
> Not as I understand the algorithm
Okay, I'll just defer to your judgment then, you've studied a lot more than I have lately now =)
0 1 2
7 x 3
6 5 4
Reordering won't help.
01273654 ROL 2 = 27365401
273
6x5
401
That is not rotated in any way. We would want:
234
1x5
076
> Not as I understand the algorithm
Okay, I'll just defer to your judgment then, you've studied a lot more than I have lately now =)