Fast 2D blitting on Super FX

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

Stef wrote:Wow, that is really impressive :)
Thanks!
Something i wonder, do you have any kind of double buffering with the SFX memory ?
Yes. The SNES can change the screen base register after the frame has been drawn, which allows the Super FX to start work on the next frame before the current one has been fully transferred. This can actually happen before the first available VBlank, as the Super FX's stop instruction issues an interrupt to the SNES by default (you can mask it if you want). The SNES can also temporarily suspend the Super FX's access to Game Pak RAM by changing a flag, which puts the Super FX in a wait state the next time it needs RAM access; this allows the SNES to proceed with the transfer during VBlank without having to forcibly kill the Super FX program.

I'm hoping to be able to keep lag to a minimum when doing 4bpp by drawing one half of the playfield first (possibly the bottom half, because it's likely to be faster) and copying it to VRAM before the other half is finished. Hopefully this won't result in glaring priority issues near the seam... actually, now that I think of it, I could use the anti-wrapping method in FXtest1 to prevent that...
psycopathicteen wrote:I did a demo like this a while ago and I got 256 bullets at 30fps, though I had to make the bullets really small like 5x5.
For the convenience of the reader: viewtopic.php?f=12&t=13834&start=45#p164693

Impressive work. (Also gave me an idea for transferring 2bpp graphics into a sprite table - rendering into the source format is going to be interesting, but the transfer itself works beautifully...)
Stef
Posts: 263
Joined: Mon Jul 01, 2013 11:25 am

Re: Fast 2D blitting on Super FX

Post by Stef »

Oh yeah i remember than 256 bullets stuff at 30 FPS on stock SNES, very impressive as well ;)
You are both obsessed by bullet hell shooter X'D
Yes. The SNES can change the screen base register after the frame has been drawn, which allows the Super FX to start work on the next frame before the current one has been fully transferred. This can actually happen before the first available VBlank, as the Super FX's stop instruction issues an interrupt to the SNES by default (you can mask it if you want). The SNES can also temporarily suspend the Super FX's access to Game Pak RAM by changing a flag, which puts the Super FX in a wait state the next time it needs RAM access; this allows the SNES to proceed with the transfer during VBlank without having to forcibly kill the Super FX program
That is cool so the SFX can be used almost at 100% if you cleverly use the double buffering :)
I wanted to do that sort of software sprite rendering (mostly for bullet) on the Megadrive as the sprite multiplexing cannot work as i expected unfortunately. You cannot use 2bpp rendering on Megadrive (or to be more precise you can do it but it won't bring any speed improvement in that case) so i have to use classic 4bpp rendering. To be honest given the code done by psycopathicteen, i believe it will be quite difficult to match the same performance level on the MD using 4bpp rendering. That kind of code is perfectly adapted to 65816, it uses the fast disp+indexed addressing mode, fast immediate ops and also take advantage of the 16 bits memory operation allowed by the 65816.
Last edited by Stef on Fri Mar 17, 2017 2:31 am, edited 1 time in total.
ARM9
Posts: 57
Joined: Sun Aug 11, 2013 6:07 am

Re: Fast 2D blitting on Super FX

Post by ARM9 »

Very cool tech demo, great to see more people programming the superfx!

If you were to target PAL you could do 4bpp at 50fps.

One problem with double buffering is that no cart has more than 64K ram, so if someone were to do double buffering at 8bpp they'd either have to gut the resolution or put 128K on donor carts. I suppose you could do the latter regardless as some sort of weak, makeshift copy protection.
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

ARM9 wrote:Very cool tech demo, great to see more people programming the superfx!
Thanks! Hopefully this is just the beginning...
If you were to target PAL you could do 4bpp at 50fps.
That's true. The demo resolution of 224x192 doesn't seem to quite fit, but a slightly smaller screen would. In fact, at 2bpp there seems to be enough room for the whole screen (256x224), as long as you don't use overscan...

In my case, though, this demo is just an algorithm test/training exercise. I'm attempting a faithful port of an existing game, and while most of it is indeed too colourful for 2bpp, large chunks of it are too busy for the Super FX to maintain 50 fps at 4bpp. I have serious doubts about holding 30. Unless I've grossly overestimated the rendering load, the extra DMA bandwidth would be an embarrassment of riches for the most part.

Besides, I'm Canadian and the game is Japanese...
One problem with double buffering is that no cart has more than 64K ram, so if someone were to do double buffering at 8bpp they'd either have to gut the resolution or put 128K on donor carts. I suppose you could do the latter regardless as some sort of weak, makeshift copy protection.
I may not end up needing 128 KB of GPRAM (the only 8bpp rendering I've encountered so far is a fairly small chunk of the title screen), but I do need CPU ROM, which no existing Super FX cart has any of... I don't think the usual emulators can even load CPU ROM - not that it matters, as due to mid-scanline shenanigans nothing below higan v095 accuracy can so much as run my display engine, so I suppose I can just use a manifest...
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

The KISS principle is real.

Last night I spent quite some time on what I thought was a really clever method of drawing a 1bpp dictionary-compressed graphic with the Super FX. It included brain-bending interwoven branches (at least, it was brain-bending at 3 in the morning) and took advantage of the fact that if you test a bit, branch on it, increment a non-negative number and then (if the branch didn't happen) arrive at the point where an earlier unconditional branch in a different code path arrives after testing the same bit, you can branch on that tested bit and it will work fine in all cases because the increment unsets the zero flag and you only arrive at the latest branch via the increment if the previous branch didn't happen; that is, if the zero flag was unset by the test.

Here's a snippet of the unrolled inner loop:

Code: Select all

    [...]
    beq 2zero+1	; branch on second bit or R1
    to R14	; preserve R0 (value not used)
    and R5	; check third bit
    bra 2one	; skip increment of R1
    plot	; only executes if second bit was one
2zero:
    to R14	; preserve R0 (value not used)
    and R5	; check third bit
    beq 3zero	; branch on third bit
    inc R1	; only executes if second bit was zero
2one:
    beq 3zero+1	; branch on third bit or R1
    [...]
This method uses a quarter of the ROM of a simple method I rejected in the OP for being too slow, and... almost matches its performance in the best case, being over 50% slower in the worst case.

Code: Select all

    getb
    inc R14
    color
    plot
    mult R3	; where R3 contains 0010h
    swap
    color
   ; loop	; unroll for more speed if you know something about the length, which I do
    plot
I guess I'm using the simple method, unless there's a way to substantially improve the branch hell method...

Actually, even a completely naive method has the potential to be faster when using the clock trick, since the ROM buffer would load in 3 cycles. Of course it doubles the ROM usage again, wasting seven bits per byte if what you want is 1bpp...

Code: Select all

    getc
    inc R14
    loop
    plot
My dither-based methods are faster for relatively substantial mostly solid graphics (like large bullets), but don't really work well at all with very short solid runs interspersed with lots of short gaps (like text). And the hardcoding method I used in the bullet hell demo upthread is impractical for large data sets.

The dictionary compression method is independent of the inner drawing loop, so I can still use it with an adjustment for the larger data size. This mostly saves RAM and compute time. The graphics table really wasn't that large to begin with, so the ROM savings from the branch method aren't really critical right now.
none
Posts: 117
Joined: Thu Sep 03, 2020 1:09 am

Re: Fast 2D blitting on Super FX

Post by none »

I had also considered using a sort of data-as-code format for large quantities of small identical objects. The entire graphic would be hardcoded and require no ROM access, metadata handling, or branching.
This is called a compiled sprite, maybe that helps in searching for ideas.

Also maybe you can treat sprites that clip against the screen border differently and use a slower method there.

To save on rom, you could compile the optimized sprite blitting routines on startup from your compressed image data and store them in RAM (idk if thats viable with SuperFX).
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

...uh, that post is five years old. If you look a bit down the thread, you'll find that I've already tried it. It worked great - on a small round bullet, I got about 4/3 of the performance of a fast dither method. That's whole-program performance, not inner-loop performance, implying an even larger improvement to the drawing itself. I don't need horizontal clipping (which is the hard part when coding for speed), because my window is less than 256 pixels wide, so I can safely draw off the edge.

It's interesting to see that this has already been thought of and tried on other systems. If what you want is speed above all else, it's pretty clearly a good idea.

Putting code in RAM is possible, but if it's rendering code it will compete with the PLOT circuitry for access to the RAM buffer while it's loading. And even with the tiny bullet in my demo upthread, the whole routine doesn't quite fit in the cache - some of the bullet deletion code runs off the end and has to be loaded from ROM every time it's used.

I don't think I'm going to be that desperate for ROM - I've got 2 MB of it, and only 128 KB of RAM at most, including the framebuffers and game state. Furthermore, anything that the Super FX doesn't need to know (static graphics and HDMA tables, music and sound effects, SNES CPU code) can go in the 6 MB of additional ROM exclusive to the SNES CPU.

Larger, more colourful bullets work fine with the dither method. Anything much bigger and more complicated than that 6x7 glowball has to be done that way in any case, because anything that doesn't fit in the cache takes five times as long to execute even if there are no buffer clashes...
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

93143 wrote: Thu Feb 23, 2017 3:49 amSpeaking of shifting, I found that using a table of random numbers was (in my application, which required simultaneous access to a sine table in a different bank) not dramatically faster than just running xorshift16
93143 wrote: Thu Feb 23, 2017 11:43 pmI think I can do better than that: [...]
23 cycles instead of 28. Nipping at the heels of a reasonable table load algorithm
I managed to shave off another cycle by being less dumb with the initialization (from R1; add R1 rather than move R0,R1; add R0), but...

...it turns out that not only has the original game's PRNG been reverse-engineered, but I can replicate it exactly in 15 cycles on the GSU. The FFT is a bit lumpy, and I'm not sure it would do as well against diehard as what I've been using, but that's not really the point, is it?
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

So, uh... I recently found out that Nikku4211 posted a video of my 640-bullet render test on YouTube. It looked somewhat glitchy, but I put that down to bad encoding (it is pretty bad; I personally think that lower-resolution videos should be encoded with at least as much quality as high-res videos, but it seems YouTube disagrees).

However, I got an FXPak Pro for Christmas, I loaded my demo, and... it's doing the same thing, large as life on my very own Panasonic Tau CRT TV set.

I've posted on the EverDrive forum about this, but it's a slow board and it may be a while before someone notices.

Does anyone have any idea what could be causing this? I haven't done any actual debugging yet, but my WAG right now is that Redguy's core doesn't properly block the ROM/spoof the IRQ vector and runs slower than the emulated Super FX in bsnes/higan, Mesen, or Snes9X. This would cause my code to occasionally load garbage or an incomplete frame rather than flashing the screen red when the Super FX misses the end of active display. But I haven't thought it through intensively, and it could easily be something else.

EDIT: I just tried setting the Super FX to "fast". It doesn't seem to make much difference. I'm not sure of the implications of this.
Post Reply