New technique for pushing video data faster

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

Drakim
Posts: 97
Joined: Mon Apr 04, 2016 3:19 am

New technique for pushing video data faster

Post by Drakim »

(If this technique is already known, then excuse my tomfoolery :lol: )

So I was looking into popslide by tepples, and while reading though the code another technique occurred to me. It doesn't work with the NES Stripe Image format, but it should be quite a lot faster (and thus be able to push more VRAM data per vblank).

While popslide and similar techniques works like a sort of mini-interpreter, using an unused part of the stack as a blob of instructions and data, my technique drops the "interpreter" part in favor of using the "Reverse RTS trick", filling the stack with addresses to various video updater methods followed by their data.

So, it might look something like this:

Code: Select all

SetPalette1:
  ; Set vram address to palette 1
  LDA #$3F
  STA $2006
  PLA #$01
  STA $2006
  ; Pull out the palette values and give them to VRAM
  PLA
  STA $2007
  PLA
  STA $2007
  PLA
  STA $2007
  RTS

FillArbitraryData:
  ; Set arbitrary address
  PHA
  STA $2006
  PHA
  STA $2006
  ; Loop here filling $2007
  RTS

Terminator:
  ; Restore stack to it's normal self here
  RTS

Stack:
SetPalette1_Low
SetPalette1_High
Color1
Color2
Color3
FillArbitraryData_Low
FillArbitraryData_High
DataLength
Data1
Data2
Data3
Data4
Terminator_Low
Terminator_High


Each "RTS" call on those methods will magically chain into the next one (costing only 6 cycles!), until it hits the terminator method which takes us out of this loop. Unlike a mini-interpreter, things don't become slower the more alternatives we create, so we can create very fast highly specialized methods for setting specific kinds of VRAM (like the palette).

We can even use a macro to create an unrolled "LDA # -> STA" loop for the absolutely fastest bandwidth 1 byte per 6 cycles for static ROM data like text.

Furthermore, since we can jump to arbitrary addresses, we can create huge unrolled loops of "PLA -> STA", and then jump an arbitrary distance into that unrolled loop and use that as our starting point to push a certain amount of bytes, at no additional cost!

The possibilities are limitless. You can write methods that push data exactly to the dot, not even needing to push a length byte to the stack.

The cost of this technique:
Adjusting the stack and starting the process: 16 cycles (including initial JSR)
Static cost to start a segment: 6 cycles
Cost per segment: variable (but always lower than popslide due to specialization)
Cost to end the process: 15 cycles (including final RTS)
Last edited by Drakim on Tue Jan 23, 2018 6:10 am, edited 1 time in total.
User avatar
pubby
Posts: 583
Joined: Thu Mar 31, 2016 11:15 am

Re: New technique for pushing video data faster

Post by pubby »

This has been used before :wink: but it's still a really cool use of CPS and I don't think most people know about it.

I still prefer zeropage buffers for most updates though.
User avatar
Dwedit
Posts: 4922
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Re: New technique for pushing video data faster

Post by Dwedit »

Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: New technique for pushing video data faster

Post by tepples »

Dwedit wrote:Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.
Technically correct, in that you can empty the buffer one-third faster (6 cycles per byte compared to 8 cycles per byte). But with the stride of five bytes per immediate value and the update buffer straddling more than one 256-byte page, it might not be possible to fill the buffer quite as fast. In addition, devoting more of the 2048 bytes of unexpanded CPU RAM to such a buffer is likely to cause other parts of the program, such as game logic, to become slower as algorithmic space/time tradeoffs elsewhere in the program get tilted toward using more time. This is why the 30 fps version of "Bad Apple" requires WRAM expansion (the Kirby's Adventure board), whereas the 15 fps version does not (the Mega Man 4 and Mega Man 6 board).
User avatar
slembcke
Posts: 172
Joined: Fri Nov 24, 2017 2:40 pm
Location: Minnesota

Re: New technique for pushing video data faster

Post by slembcke »

That's pretty clever. I've been musing about similar stack abuses for tail call hacking. In hindsight now it seems perfectly clear, push the destination to the stack and return. Thanks. :D
User avatar
Bregalad
Posts: 8055
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Re: New technique for pushing video data faster

Post by Bregalad »

Why do you use those strange names instead of $2006 and $2007? Makes the code completely unreadable to me.
pubby wrote:This has been used before :wink:
PLA/STA chains, definitely. The RTS I'm not so sure whether it had been used before in this context.
I still prefer zeropage buffers for most updates though.
Me too, but in some (rare) cases it's not fast enough.
Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.
LDA #imm / STA $2006 is about the fasest you can go, the only faster possibility is to include clever use of X and Y register for 2 most common values and avoid a LDA here; also some clever usage of INX/INY/DEX/DEY/LSR/ASL/ROL/ROR will be the same speed but with less bytes. I think Tokumaru already made a thread about that. Since generating such code in RAM would be difficult, it was mostly intended at staying in ROM, even if this eats a lot of it.
Drakim
Posts: 97
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

Bregalad wrote:Why do you use those strange names instead of $2006 and $2007? Makes the code completely unreadable to me.
Fixed!
Bregalad wrote:
pubby wrote:Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.
LDA #imm / STA $2006 is about the fasest you can go, the only faster possibility is to include clever use of X and Y register for 2 most common values and avoid a LDA here; also some clever usage of INX/INY/DEX/DEY/LSR/ASL/ROL/ROR will be the same speed but with less bytes. I think Tokumaru already made a thread about that. Since generating such code in RAM would be difficult, it was mostly intended at staying in ROM, even if this eats a lot of it.
With my RTS technique, you can point to RAM and execute it as a routine just like pubby prefers, or you can point to ROM to a series of tightly placed opcodes like Tokumaru's tricks, or you can put the values you want on the stack itself and point to a routine that pulls them out.

I got a lot of ROM space so I've been creating various specialized routines to do certain tasks as fast as physically possible. One of my routines is just straight up 1024 "STA $2007" in a row, I jump to the routine start plus an offset so that if I only want to write 50 bytes, it jumps to start+(1024-50)*3, and bam, 50 unrolled STA statements, all at the same low cost of an RTS statement like everything else.

Compared to having to do LDA $,X -> CMP $ -> BEQ or a jumptable for "custom" VRAM vblank routines it's both easier to implement, faster AND gives more options. Although RTS is more expensive than a relative branch, you'd have to make your video stripe uselessly primitive to not go above 6 cycles for the selection process (and then you'd still lose out at length due to lack of specialization routines).
Bregalad wrote:
pubby wrote:This has been used before :wink:
PLA/STA chains, definitely. The RTS I'm not so sure whether it had been used before in this context.
Does that mean I get to give the technique a fancy name? :o
User avatar
FrankenGraphics
Formerly WheelInventor
Posts: 2064
Joined: Thu Apr 14, 2016 2:55 am
Location: Gothenburg, Sweden
Contact:

Re: New technique for pushing video data faster

Post by FrankenGraphics »

Does that mean I get to give the technique a fancy name? :o
By all means, yes! Also makes future convos easier to have when you can name drop techniques.
On the other hand if you don't people are going to refer to it as "drakim slide" and thus you'd be immortalized, like how i'd refer to a certain controller/dpcm fix as "rahsennors' controller/dpcm fix". :wink:
Rahsennor
Posts: 479
Joined: Thu Aug 20, 2015 3:09 am

Re: New technique for pushing video data faster

Post by Rahsennor »

I used exactly this technique for an unreleased game called Hull Breach!. I got it straight off the wiki, and thought it was a well-known method. I called it "RTS chaining".

Then I decided assembler wasn't for me, and my compiler can't cope with this trick, so I haven't used it since.
User avatar
FrankenGraphics
Formerly WheelInventor
Posts: 2064
Joined: Thu Apr 14, 2016 2:55 am
Location: Gothenburg, Sweden
Contact:

Re: New technique for pushing video data faster

Post by FrankenGraphics »

an unreleased game called Hull Breach!.
You rang?
User avatar
pubby
Posts: 583
Joined: Thu Mar 31, 2016 11:15 am

Re: New technique for pushing video data faster

Post by pubby »

I've used it before too.

One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.
Drakim
Posts: 97
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

Rahsennor wrote:I used exactly this technique for an unreleased game called Hull Breach!. I got it straight off the wiki, and thought it was a well-known method. I called it "RTS chaining".

Then I decided assembler wasn't for me, and my compiler can't cope with this trick, so I haven't used it since.
Are you sure you got it off the wiki? I can't find a single instance of the phrase "RTS chaining" or even "chaining" on the wiki. Heck, I can't even find "RTS chaining" related to programming anywhere on the web. Are you sure that's the proper name for the technique, and that you aren't just mixing it up with the "RTS Trick"? (or is my google-fu too weak?)

Edit: Oh wait, I read that wrong, you called it RTS chaining. Any idea what it was called in the wiki?
pubby wrote:I've used it before too.

One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.
I'm not sure I understand, how can you use JSR to push addresses onto the stack without actually jumping there at the same time? If this worked, shouldn't the vanilla "RTS trick" be using this as well?

Edit: Ah, I think I may have found something similar being discussed before:

https://forums.nesdev.com/viewtopic.php?f=2&t=13285

I'm unsure why this technique hasn't caught on more, it seems like the strictly superior way, unless you really need a full 256 sized stack for gameplay.
User avatar
Bregalad
Posts: 8055
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Re: New technique for pushing video data faster

Post by Bregalad »

Drakim wrote: I'm unsure why this technique hasn't caught on more, it seems like the strictly superior way, unless you really need a full 256 sized stack for gameplay.
Well this technique is very great but it's neither the easiest (using "normal" buffer with "normal" code is easier to handle, but slower) nor the fastest (creating raw chains of lda #imm - sta $2007 is faster), so people coding games where standard VRAM bandwith is enough will do it the standard way, and people doing fancy stuff pushing the system to its extreme limits will use the fastest way. For this technique to be useful, you need significantly extra VRAM bandwidth, but still not enough so that extreme technique is not worth it.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: New technique for pushing video data faster

Post by Oziphantom »

Drakim wrote:Compared to having to do LDA $,X -> CMP $ -> BEQ or a jumptable for "custom" VRAM vblank routines it's both easier to implement, faster AND gives more options. Although RTS is more expensive than a relative branch, you'd have to make your video stripe uselessly primitive to not go above 6 cycles for the selection process (and then you'd still lose out at length due to lack of specialization routines).
Nah

Put your On X table at the bottom of the Stack so $100 + then you do a 1 byte stack dispatch

get X to have value x2
txs
rts

If all your code runs in 1 frame then the stack position should be fixed per frame, so you can always just restore the same point at the end of the NMI. If you go over a frame and interrupt yourself X_X

The other catch with RTS chaining is you can't use the stack, not really a problem if you are just doing speed code from an NMI, but as a general system it not very practical, and making a self mod JSR chain(+6 per call) gives almost as much speed without the limitations.
Drakim
Posts: 97
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

Bregalad wrote:Well this technique is very great but it's neither the easiest (using "normal" buffer with "normal" code is easier to handle, but slower)
Is it? Hmmm, I mean, I concede the point that just setting up a "normal dumb" buffer is way easier, and it's probably what you should do when you are learning what's going on.

I was more thinking about popslide's way of using the bottom-stack as a faster buffer (which seems like the de facto way of doing it, to the point where people are comparing alternative implementations), which involves the exact same stack switching techniques as I use. But where popslide (and any similar interpreter) has to write a tiny mini-parser for the stack data, consuming bytes and acting on it, my code is (after the stack switch) just an "RTS" statement.

It jumps you right into a subroutine such as Set_BackgroundColor or Set_Palette, it's way easier and more straightforward at that point, in my opinion. Adding another subroutine is super trivial, you don't have to dig into popslide's source code to carefully figure out where to insert your new functionality. So I'd personally grade it easier and faster.
nor the fastest (creating raw chains of lda #imm - sta $2007 is faster), so people coding games where standard VRAM bandwith is enough will do it the standard way, and people doing fancy stuff pushing the system to its extreme limits will use the fastest way.
Which what you are saying is true in the strict technical sense, I'm not sure I agree with your overall point.

While I do think that if you are writing some psedo-3d tech demo that needs to upload to vram really fast in a specific way, then nothing can beat your handcrafted code in speed.

But if you are making an actual real game, which sometimes has a screen full of text, sometimes scrolls horizontally, sometimes vertically, sometimes cycles colors for shiny treasures, sometimes prints one and one letter in dialog, I'm not so certain you could beat the RTS technique. Simply because the RTS technique works like a reverse jumptable, and the overhead setup for stack switching is only 16 cycles. And no matter how crafty you are, your game will need some means of selectively pushing vram since you can't do everything every frame. And whatever you pick, I think the RTS technique would beat it, but maybe I'm being overly optimistic!
Post Reply