New technique for pushing video data faster
Moderator: Moderators
Re: New technique for pushing video data faster
In this post about the Spectre attacks on branch prediction, Dwedit mentioned ROP, or return-oriented programming. Before ROP became famous for use in stack-smashing attacks, this technique of storing a list of subroutines to be called was known as threaded code.
The name Popslide was derived from NOP slide and clockslide, both of which involve a jump into an unrolled loop, like a computed Duff's device.
The technique discussed here combines storing a subroutine sequence in the return stack (the "ROP") with jumping into an unrolled copy loop (like the "slides"). So could we call it ROPslide?
The name Popslide was derived from NOP slide and clockslide, both of which involve a jump into an unrolled loop, like a computed Duff's device.
The technique discussed here combines storing a subroutine sequence in the return stack (the "ROP") with jumping into an unrolled copy loop (like the "slides"). So could we call it ROPslide?
Re: New technique for pushing video data faster
Sounds good
I've come up with another neat little trick. Just like with unrolled loops, it can sometimes be beneficial to jump a ways into a subroutine to skip some unnecessary code. If you set the PPU "increment mode" at the beginning of every subroutine, you can opt to keep track of it manually as you build your video stack outside of vblank, and simply plus $5 to the subroutine's address to skip over the initial LDA #, STA $2000 when you know the increment mode is already correct.
Unlike having "set increment mode to 1" and "set increment mode to 32" subroutines, this trick doesn't carry any extra overhead of jumping around additional times in the video stack.
I've come up with another neat little trick. Just like with unrolled loops, it can sometimes be beneficial to jump a ways into a subroutine to skip some unnecessary code. If you set the PPU "increment mode" at the beginning of every subroutine, you can opt to keep track of it manually as you build your video stack outside of vblank, and simply plus $5 to the subroutine's address to skip over the initial LDA #, STA $2000 when you know the increment mode is already correct.
Unlike having "set increment mode to 1" and "set increment mode to 32" subroutines, this trick doesn't carry any extra overhead of jumping around additional times in the video stack.
Re: New technique for pushing video data faster
I went looking for the posts I made about this and apparently this is the last piece of code I posted about this. The basic idea was to have 2 unrolled loops, both copying up to 32 bytes, but one sets the VRAM address before RTS'ing to the next update, while the other is meant to be used for the final update, so it doesn't set the VRAM address at the end. With this setup, there's no overhead at all, even the JMP at the end can be eliminated if the 2nd unrolled loop flows directly into the rest of the NMI handler.Drakim wrote:Actually, re-reading a couple of times, what do you mean "But the extra JMP and the bogus VRAM address are used only once", how does that work? What exactly do you do to end your video subroutines if you are only using the JMP one time during the entire sequence? Don't you have to do it one time per subroutine?
But then I decided that having two unrolled loops was too much trouble, and that it'd be better to just let the last update set a bogus VRAM address and RTS to the remainder of the NMI handler. Here's what the code could look like:
As soon as possible in the NMI handler:
Code: Select all
;(swap the stack pointer)
;execute the first update
pla
sta $2006
pla
sta $2006
rts
RestoreStackPointer:
;(restore the stack pointer)
Code: Select all
Copy32Bytes:
pla
sta $2007
Copy31Bytes:
pla
sta $2007
;(...)
Copy1Byte:
pla
sta $2007
;execute the next update
pla
sta $2006
pla
sta $2006
rts
Like I said before, this is mostly for those looking to use generic VRAM update code without giving up on speed or lots of ROM space. I personally prefer to use specialized update routines so I can take advantage of certain characteristics of the NES architecture (e.g. palette mirroring, changing only the low byte of the VRAM address, etc.) to make updates even faster.
Re: New technique for pushing video data faster
Oh, I understand finally!
Your technique is cool in that you shave off the RTS in the subroutine that sets the address before the mass copy:
While this is probably the fastest way of pushing raw bytes from the faux stack to the VRAM directly in a "video command" manner (probably only beaten by doing it all static every vblank and not having any "command" system at all), I think you were right in switching to specialized routines, there are just too many situations on the NES that doesn't involve copying enormous amounts of bytes mindlessly. Plus, the specialized routines allows for micro optimizations here and there, like using "LDA #" instead of "PLA" when it's a static VRAM address, or skipping over the "set increment" code if the increment mode is already correct, or having specialized unrolled loops that doesn't PLA if the same value appears multiple times in a row. The sum of optimizations might very well end up being faster.
Still, I'm going to have a think about the possibility of using this in my own specialized system in the case of several mass-copying routines being queued up in a row.
Your technique is cool in that you shave off the RTS in the subroutine that sets the address before the mass copy:
By baking the address setting into the previous mass copy. Very clever!Video_SetAddress:
pla
sta $2006
pla
sta $2006
rts
While this is probably the fastest way of pushing raw bytes from the faux stack to the VRAM directly in a "video command" manner (probably only beaten by doing it all static every vblank and not having any "command" system at all), I think you were right in switching to specialized routines, there are just too many situations on the NES that doesn't involve copying enormous amounts of bytes mindlessly. Plus, the specialized routines allows for micro optimizations here and there, like using "LDA #" instead of "PLA" when it's a static VRAM address, or skipping over the "set increment" code if the increment mode is already correct, or having specialized unrolled loops that doesn't PLA if the same value appears multiple times in a row. The sum of optimizations might very well end up being faster.
Still, I'm going to have a think about the possibility of using this in my own specialized system in the case of several mass-copying routines being queued up in a row.
Re: New technique for pushing video data faster
Yeah, there are several opportunities for optimizations we can do with specialized update routines. You already mentioned how you can use a hardcoded VRAM address for palette updates, but there are many other optimizations you can do in this particular case:
- start writing at $3F01 rather than $3F00, and set the background color later, when you reach $3F10 (a mirror of $3F00);
- don't update any of the "non-displayable" entries ($3F04, $3F08, $3F0C) or their mirrors ($3F14, $3F18, $3F1C), just bit $2007 instead, to advance the VRAM address and skip these;
- read the palette data straight from the place where it's normally stored, not from the stack, saving the time it'd take to copy it there.
Let's see how much time we could save:
- set VRAM address to $3F01 (12 cycles);
- update 3 colors (24 cycles);
- skip $3F04 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F08 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F0C (4 cycles);
- update 3 colors (24 cycles); 108
- update the background color via $3F10 (8 cycles);
- update 3 colors (24 cycles);
- skip $3F14 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F18 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F1C (4 cycles);
- update 3 colors (24 cycles);
- call the next update (6 cycles);
That's a total of 242 cycles, compared to the 278 cycles that a "raw transfer" of 32 bytes would take, and don't forget the time you save outside of vblank, by not copying the palette data to the stack.
- start writing at $3F01 rather than $3F00, and set the background color later, when you reach $3F10 (a mirror of $3F00);
- don't update any of the "non-displayable" entries ($3F04, $3F08, $3F0C) or their mirrors ($3F14, $3F18, $3F1C), just bit $2007 instead, to advance the VRAM address and skip these;
- read the palette data straight from the place where it's normally stored, not from the stack, saving the time it'd take to copy it there.
Let's see how much time we could save:
- set VRAM address to $3F01 (12 cycles);
- update 3 colors (24 cycles);
- skip $3F04 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F08 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F0C (4 cycles);
- update 3 colors (24 cycles); 108
- update the background color via $3F10 (8 cycles);
- update 3 colors (24 cycles);
- skip $3F14 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F18 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F1C (4 cycles);
- update 3 colors (24 cycles);
- call the next update (6 cycles);
That's a total of 242 cycles, compared to the 278 cycles that a "raw transfer" of 32 bytes would take, and don't forget the time you save outside of vblank, by not copying the palette data to the stack.
Re: New technique for pushing video data faster
I'm reviving this old thread just to add a very important discovery I made, which has cost me months of my life.
Since the NMI interrupt is not preventable by the SEI instruction, you will eventually be unlucky and have the NMI interrupt right smack dab in the middle of using this technique. Normally this would only be a problem if you started doing a lot of complex operations in the NMI handler that involves the Stack, so I thought I had things under control by only doing this in my NMI handler:
I mean, it's so tiny, it would just run, and then it should resume exactly where we left off as if nothing happened right?
Wrong!
Because this video technique "flips" the CPU Stack when applying the video updates, I set my Stack Pointer to the very end before doing the a RTS to reverse trampoline into my queued up video routine. But before I got to that RTS instruction, the NMI kicks in, and like all interrupts it pushes stuff to the Stack so it can hop back to your code after it's done. But since the stack is at the very end, it ends up overflowing, meaning the new stack values ends up overwriting legit ones we were saving! Meaning you get a weird corruption of your stack at the very start, that won't come up until way way way later in your code. You will only be burned once you return to your normal code, and RTS your way all the way back to the beginning of your stack. In my case, that was after I'd cleared a level and returned to the world map.
I see two solutions to this. One could be to just start filling the video stack a little earlier than $FF, or to intentionally save bottom Stack values and then restoring them later.
Thank you for coming to my obscure bug TED talk. I hope this saves somebody else from suffering my fate of wasted eternities spent debugging.
Since the NMI interrupt is not preventable by the SEI instruction, you will eventually be unlucky and have the NMI interrupt right smack dab in the middle of using this technique. Normally this would only be a problem if you started doing a lot of complex operations in the NMI handler that involves the Stack, so I thought I had things under control by only doing this in my NMI handler:
Code: Select all
NMI_Interrupt:
INC VBlankFlag
RTI
Wrong!
Because this video technique "flips" the CPU Stack when applying the video updates, I set my Stack Pointer to the very end before doing the a RTS to reverse trampoline into my queued up video routine. But before I got to that RTS instruction, the NMI kicks in, and like all interrupts it pushes stuff to the Stack so it can hop back to your code after it's done. But since the stack is at the very end, it ends up overflowing, meaning the new stack values ends up overwriting legit ones we were saving! Meaning you get a weird corruption of your stack at the very start, that won't come up until way way way later in your code. You will only be burned once you return to your normal code, and RTS your way all the way back to the beginning of your stack. In my case, that was after I'd cleared a level and returned to the world map.
I see two solutions to this. One could be to just start filling the video stack a little earlier than $FF, or to intentionally save bottom Stack values and then restoring them later.
Thank you for coming to my obscure bug TED talk. I hope this saves somebody else from suffering my fate of wasted eternities spent debugging.
Re: New technique for pushing video data faster
In my own programs that use the stack method for VRAM updates I have the initial SP value stored in a variable, instead of using a constant value. This way I can make use of the whole buffer when doing updates from within the NMI handler (because I know the updates will never be interrupted), and if I ever need to do VRAM updates from the main thread, I can just artificially shrink the update buffer a little in order to leave space for possible interrupts. I usually skipt 16 bytes or so, because my NMI handlers often update the music/audio engine, as well as other things, and those tasks need a bit more stack space.
Re: New technique for pushing video data faster
That totally works.
I've been ignoring my NMI interrupts in favor of using IRQ for everything as I have more control over where, when and how it will run. For example, I have my IRQ set to trigger so that I can turn off rendering for a thin slice at the bottom of the TV, giving myself more vblank-time to do video updates, and eliminating scrolling artifacts as well.
The way I solved this particular stack issue was simply by leaving 3 bytes on the stack for the NMI to overwrite without corrupting my video data. Works like a charm.
I've been ignoring my NMI interrupts in favor of using IRQ for everything as I have more control over where, when and how it will run. For example, I have my IRQ set to trigger so that I can turn off rendering for a thin slice at the bottom of the TV, giving myself more vblank-time to do video updates, and eliminating scrolling artifacts as well.
The way I solved this particular stack issue was simply by leaving 3 bytes on the stack for the NMI to overwrite without corrupting my video data. Works like a charm.
Re: New technique for pushing video data faster
Popslide works similarly, reserving POPSLIDE_SLACK bytes (default 8) for the NMI handler. This means the buffer starts at $0108 and can handle the interrupt frame (return address and processor status), all three registers, and even one JSR. (I tend to use fairly simple NMI handlers that only do PPU things and increment a timer, leaving things like sound for the main thread.)
Re: New technique for pushing video data faster
Be careful with turning rendering off mid-screen. Depending on when rendering is turned off, it may cause 2 sprite tiles to disappear the next time rendering starts again, even if OAM DMA is performed in between. Mesen can emulate this somewhat with the optional "Enable PPU OAM row corruption emulation" feature, but you'll want to test on real hardware (and test multiple times, power cycling the console between tests).Drakim wrote: ↑Fri Sep 29, 2023 3:25 pm I've been ignoring my NMI interrupts in favor of using IRQ for everything as I have more control over where, when and how it will run. For example, I have my IRQ set to trigger so that I can turn off rendering for a thin slice at the bottom of the TV, giving myself more vblank-time to do video updates, and eliminating scrolling artifacts as well.
Regarding popslide, leaving space for the interrupt is definitely the way to go, but you also want to be careful in case your buffer is something intended to persist, such as an attributes buffer. Most transfer buffers are temporary and it's OK if the stack frame ends up clobbering data you've already sent to the PPU, but for buffers with persistent data, it's a problem.
Re: New technique for pushing video data faster
I see...Fiskbit wrote: ↑Fri Sep 29, 2023 6:28 pm Be careful with turning rendering off mid-screen. Depending on when rendering is turned off, it may cause 2 sprite tiles to disappear the next time rendering starts again, even if OAM DMA is performed in between. Mesen can emulate this somewhat with the optional "Enable PPU OAM row corruption emulation" feature, but you'll want to test on real hardware (and test multiple times, power cycling the console between tests).
Can I mitigate this by having 2 dummy sprites at the very top?
Re: New technique for pushing video data faster
No, that won't work. When rendering is disabled in certain parts of the scanline, the exact dot on which it was disabled determines which 2 sprites are lost when rendering is next enabled. The corruption causes those sprites to be replaced with the data for sprites 0 and 1, so they're invisibly drawn under sprites 0 and 1 (and can cause other sprites on the scanline to drop out due to the 8-sprites-per-scanline limit). You can find more details and test ROMs in this and the following post.
Since you're turning rendering back on in vblank rather than mid-screen, you have two safe regions you can target to avoid this behavior. The first is approximately dots 320-340 (as seen in Mesen's event viewer), which is safe regardless of when you turn rendering back on, but this window is very narrow and hard to reliably hit. If you're targeting that window, you'll want to do a lot of real hardware testing to make sure you're landing there. The other is approximately dots 65-256, but this region is visible, so turning rendering off here is not ideal. There are tricks you can do to land in this large window without visible glitching, such as swapping in blank pattern tables in the previous hblank. You could also minimize glitching by turning off background rendering in the previous hblank and then sprites in the target window (which is less likely to be noticed).
Another approach could be to turn off rendering in hblank, use the forced blank for VRAM transfers, and then turn sprite rendering back on very briefly before the end of the frame. This will cause corruption to apply when rendering turns back, allowing you to then do OAM DMA in vblank so sprites are always correct on the next frame.
If you go for the smaller window and can't quite get the timing down, the result could just be corruption of a few specific sprite slots, so you could avoid using those sprites if you that timing works for you.
Edit: An alternative approach would be to turn rendering on at the top of the screen late, instead of off at the bottom early. You'll probably want to turn sprites on a scanline later than the background to give the PPU time to handle OAM evaluation. The downside of this approach is that the PPU will no longer skip a dot every 2 frames, which will change the pattern of artifacts in the video output over time (with 3 states instead of 2). Battletoads does this, if you want to see it in action.
Whatever you do, test on real hardware. Mesen's emulation of the rendering-disable bug seems pretty decent, but it's not complete and the bug isn't understood yet at the hardware level.
Since you're turning rendering back on in vblank rather than mid-screen, you have two safe regions you can target to avoid this behavior. The first is approximately dots 320-340 (as seen in Mesen's event viewer), which is safe regardless of when you turn rendering back on, but this window is very narrow and hard to reliably hit. If you're targeting that window, you'll want to do a lot of real hardware testing to make sure you're landing there. The other is approximately dots 65-256, but this region is visible, so turning rendering off here is not ideal. There are tricks you can do to land in this large window without visible glitching, such as swapping in blank pattern tables in the previous hblank. You could also minimize glitching by turning off background rendering in the previous hblank and then sprites in the target window (which is less likely to be noticed).
Another approach could be to turn off rendering in hblank, use the forced blank for VRAM transfers, and then turn sprite rendering back on very briefly before the end of the frame. This will cause corruption to apply when rendering turns back, allowing you to then do OAM DMA in vblank so sprites are always correct on the next frame.
If you go for the smaller window and can't quite get the timing down, the result could just be corruption of a few specific sprite slots, so you could avoid using those sprites if you that timing works for you.
Edit: An alternative approach would be to turn rendering on at the top of the screen late, instead of off at the bottom early. You'll probably want to turn sprites on a scanline later than the background to give the PPU time to handle OAM evaluation. The downside of this approach is that the PPU will no longer skip a dot every 2 frames, which will change the pattern of artifacts in the video output over time (with 3 states instead of 2). Battletoads does this, if you want to see it in action.
Whatever you do, test on real hardware. Mesen's emulation of the rendering-disable bug seems pretty decent, but it's not complete and the bug isn't understood yet at the hardware level.