New technique for pushing video data faster

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: New technique for pushing video data faster

Post by tepples »

In this post about the Spectre attacks on branch prediction, Dwedit mentioned ROP, or return-oriented programming. Before ROP became famous for use in stack-smashing attacks, this technique of storing a list of subroutines to be called was known as threaded code.

The name Popslide was derived from NOP slide and clockslide, both of which involve a jump into an unrolled loop, like a computed Duff's device.

The technique discussed here combines storing a subroutine sequence in the return stack (the "ROP") with jumping into an unrolled copy loop (like the "slides"). So could we call it ROPslide?
Drakim
Posts: 97
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

Sounds good :D

I've come up with another neat little trick. Just like with unrolled loops, it can sometimes be beneficial to jump a ways into a subroutine to skip some unnecessary code. If you set the PPU "increment mode" at the beginning of every subroutine, you can opt to keep track of it manually as you build your video stack outside of vblank, and simply plus $5 to the subroutine's address to skip over the initial LDA #, STA $2000 when you know the increment mode is already correct.

Unlike having "set increment mode to 1" and "set increment mode to 32" subroutines, this trick doesn't carry any extra overhead of jumping around additional times in the video stack.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: New technique for pushing video data faster

Post by tokumaru »

Drakim wrote:Actually, re-reading a couple of times, what do you mean "But the extra JMP and the bogus VRAM address are used only once", how does that work? What exactly do you do to end your video subroutines if you are only using the JMP one time during the entire sequence? Don't you have to do it one time per subroutine?
I went looking for the posts I made about this and apparently this is the last piece of code I posted about this. The basic idea was to have 2 unrolled loops, both copying up to 32 bytes, but one sets the VRAM address before RTS'ing to the next update, while the other is meant to be used for the final update, so it doesn't set the VRAM address at the end. With this setup, there's no overhead at all, even the JMP at the end can be eliminated if the 2nd unrolled loop flows directly into the rest of the NMI handler.

But then I decided that having two unrolled loops was too much trouble, and that it'd be better to just let the last update set a bogus VRAM address and RTS to the remainder of the NMI handler. Here's what the code could look like:

As soon as possible in the NMI handler:

Code: Select all

  ;(swap the stack pointer)

  ;execute the first update
  pla
  sta $2006
  pla
  sta $2006
  rts

RestoreStackPointer:

  ;(restore the stack pointer)
The unrolled data transfer loop, which has 32 possible entry points:

Code: Select all

Copy32Bytes:

  pla
  sta $2007

Copy31Bytes:

  pla
  sta $2007

  ;(...)

Copy1Byte:

  pla
  sta $2007

  ;execute the next update
  pla
  sta $2006
  pla
  sta $2006
  rts
In this case, the only overhead is setting up a bogus VRAM address that won't be used for anything, something that happens only once per vblank, after the final data transfer (can be during the pre-render scanline, so it doesn't waste any actual vblank time).

Like I said before, this is mostly for those looking to use generic VRAM update code without giving up on speed or lots of ROM space. I personally prefer to use specialized update routines so I can take advantage of certain characteristics of the NES architecture (e.g. palette mirroring, changing only the low byte of the VRAM address, etc.) to make updates even faster.
Drakim
Posts: 97
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

Oh, I understand finally!

Your technique is cool in that you shave off the RTS in the subroutine that sets the address before the mass copy:
Video_SetAddress:
pla
sta $2006
pla
sta $2006
rts
By baking the address setting into the previous mass copy. Very clever!

While this is probably the fastest way of pushing raw bytes from the faux stack to the VRAM directly in a "video command" manner (probably only beaten by doing it all static every vblank and not having any "command" system at all), I think you were right in switching to specialized routines, there are just too many situations on the NES that doesn't involve copying enormous amounts of bytes mindlessly. Plus, the specialized routines allows for micro optimizations here and there, like using "LDA #" instead of "PLA" when it's a static VRAM address, or skipping over the "set increment" code if the increment mode is already correct, or having specialized unrolled loops that doesn't PLA if the same value appears multiple times in a row. The sum of optimizations might very well end up being faster.

Still, I'm going to have a think about the possibility of using this in my own specialized system in the case of several mass-copying routines being queued up in a row. :beer:
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: New technique for pushing video data faster

Post by tokumaru »

Yeah, there are several opportunities for optimizations we can do with specialized update routines. You already mentioned how you can use a hardcoded VRAM address for palette updates, but there are many other optimizations you can do in this particular case:

- start writing at $3F01 rather than $3F00, and set the background color later, when you reach $3F10 (a mirror of $3F00);

- don't update any of the "non-displayable" entries ($3F04, $3F08, $3F0C) or their mirrors ($3F14, $3F18, $3F1C), just bit $2007 instead, to advance the VRAM address and skip these;

- read the palette data straight from the place where it's normally stored, not from the stack, saving the time it'd take to copy it there.

Let's see how much time we could save:

- set VRAM address to $3F01 (12 cycles);
- update 3 colors (24 cycles);
- skip $3F04 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F08 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F0C (4 cycles);
- update 3 colors (24 cycles); 108
- update the background color via $3F10 (8 cycles);
- update 3 colors (24 cycles);
- skip $3F14 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F18 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F1C (4 cycles);
- update 3 colors (24 cycles);
- call the next update (6 cycles);

That's a total of 242 cycles, compared to the 278 cycles that a "raw transfer" of 32 bytes would take, and don't forget the time you save outside of vblank, by not copying the palette data to the stack.
Drakim
Posts: 97
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

I'm reviving this old thread just to add a very important discovery I made, which has cost me months of my life.

Since the NMI interrupt is not preventable by the SEI instruction, you will eventually be unlucky and have the NMI interrupt right smack dab in the middle of using this technique. Normally this would only be a problem if you started doing a lot of complex operations in the NMI handler that involves the Stack, so I thought I had things under control by only doing this in my NMI handler:

Code: Select all

NMI_Interrupt:
        INC VBlankFlag
        RTI
I mean, it's so tiny, it would just run, and then it should resume exactly where we left off as if nothing happened right?

Wrong!

Because this video technique "flips" the CPU Stack when applying the video updates, I set my Stack Pointer to the very end before doing the a RTS to reverse trampoline into my queued up video routine. But before I got to that RTS instruction, the NMI kicks in, and like all interrupts it pushes stuff to the Stack so it can hop back to your code after it's done. But since the stack is at the very end, it ends up overflowing, meaning the new stack values ends up overwriting legit ones we were saving! Meaning you get a weird corruption of your stack at the very start, that won't come up until way way way later in your code. You will only be burned once you return to your normal code, and RTS your way all the way back to the beginning of your stack. In my case, that was after I'd cleared a level and returned to the world map.

I see two solutions to this. One could be to just start filling the video stack a little earlier than $FF, or to intentionally save bottom Stack values and then restoring them later.

Thank you for coming to my obscure bug TED talk. I hope this saves somebody else from suffering my fate of wasted eternities spent debugging.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: New technique for pushing video data faster

Post by tokumaru »

In my own programs that use the stack method for VRAM updates I have the initial SP value stored in a variable, instead of using a constant value. This way I can make use of the whole buffer when doing updates from within the NMI handler (because I know the updates will never be interrupted), and if I ever need to do VRAM updates from the main thread, I can just artificially shrink the update buffer a little in order to leave space for possible interrupts. I usually skipt 16 bytes or so, because my NMI handlers often update the music/audio engine, as well as other things, and those tasks need a bit more stack space.
Drakim
Posts: 97
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

That totally works.

I've been ignoring my NMI interrupts in favor of using IRQ for everything as I have more control over where, when and how it will run. For example, I have my IRQ set to trigger so that I can turn off rendering for a thin slice at the bottom of the TV, giving myself more vblank-time to do video updates, and eliminating scrolling artifacts as well.

The way I solved this particular stack issue was simply by leaving 3 bytes on the stack for the NMI to overwrite without corrupting my video data. Works like a charm.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: New technique for pushing video data faster

Post by tepples »

Popslide works similarly, reserving POPSLIDE_SLACK bytes (default 8) for the NMI handler. This means the buffer starts at $0108 and can handle the interrupt frame (return address and processor status), all three registers, and even one JSR. (I tend to use fairly simple NMI handlers that only do PPU things and increment a timer, leaving things like sound for the main thread.)
Fiskbit
Posts: 891
Joined: Sat Nov 18, 2017 9:15 pm

Re: New technique for pushing video data faster

Post by Fiskbit »

Drakim wrote: Fri Sep 29, 2023 3:25 pm I've been ignoring my NMI interrupts in favor of using IRQ for everything as I have more control over where, when and how it will run. For example, I have my IRQ set to trigger so that I can turn off rendering for a thin slice at the bottom of the TV, giving myself more vblank-time to do video updates, and eliminating scrolling artifacts as well.
Be careful with turning rendering off mid-screen. Depending on when rendering is turned off, it may cause 2 sprite tiles to disappear the next time rendering starts again, even if OAM DMA is performed in between. Mesen can emulate this somewhat with the optional "Enable PPU OAM row corruption emulation" feature, but you'll want to test on real hardware (and test multiple times, power cycling the console between tests).

Regarding popslide, leaving space for the interrupt is definitely the way to go, but you also want to be careful in case your buffer is something intended to persist, such as an attributes buffer. Most transfer buffers are temporary and it's OK if the stack frame ends up clobbering data you've already sent to the PPU, but for buffers with persistent data, it's a problem.
Drakim
Posts: 97
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

Fiskbit wrote: Fri Sep 29, 2023 6:28 pm Be careful with turning rendering off mid-screen. Depending on when rendering is turned off, it may cause 2 sprite tiles to disappear the next time rendering starts again, even if OAM DMA is performed in between. Mesen can emulate this somewhat with the optional "Enable PPU OAM row corruption emulation" feature, but you'll want to test on real hardware (and test multiple times, power cycling the console between tests).
I see...

Can I mitigate this by having 2 dummy sprites at the very top?
Fiskbit
Posts: 891
Joined: Sat Nov 18, 2017 9:15 pm

Re: New technique for pushing video data faster

Post by Fiskbit »

No, that won't work. When rendering is disabled in certain parts of the scanline, the exact dot on which it was disabled determines which 2 sprites are lost when rendering is next enabled. The corruption causes those sprites to be replaced with the data for sprites 0 and 1, so they're invisibly drawn under sprites 0 and 1 (and can cause other sprites on the scanline to drop out due to the 8-sprites-per-scanline limit). You can find more details and test ROMs in this and the following post.

Since you're turning rendering back on in vblank rather than mid-screen, you have two safe regions you can target to avoid this behavior. The first is approximately dots 320-340 (as seen in Mesen's event viewer), which is safe regardless of when you turn rendering back on, but this window is very narrow and hard to reliably hit. If you're targeting that window, you'll want to do a lot of real hardware testing to make sure you're landing there. The other is approximately dots 65-256, but this region is visible, so turning rendering off here is not ideal. There are tricks you can do to land in this large window without visible glitching, such as swapping in blank pattern tables in the previous hblank. You could also minimize glitching by turning off background rendering in the previous hblank and then sprites in the target window (which is less likely to be noticed).

Another approach could be to turn off rendering in hblank, use the forced blank for VRAM transfers, and then turn sprite rendering back on very briefly before the end of the frame. This will cause corruption to apply when rendering turns back, allowing you to then do OAM DMA in vblank so sprites are always correct on the next frame.

If you go for the smaller window and can't quite get the timing down, the result could just be corruption of a few specific sprite slots, so you could avoid using those sprites if you that timing works for you.


Edit: An alternative approach would be to turn rendering on at the top of the screen late, instead of off at the bottom early. You'll probably want to turn sprites on a scanline later than the background to give the PPU time to handle OAM evaluation. The downside of this approach is that the PPU will no longer skip a dot every 2 frames, which will change the pattern of artifacts in the video output over time (with 3 states instead of 2). Battletoads does this, if you want to see it in action.

Whatever you do, test on real hardware. Mesen's emulation of the rendering-disable bug seems pretty decent, but it's not complete and the bug isn't understood yet at the hardware level.
Post Reply