It is currently Tue Jun 19, 2018 9:11 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 35 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Mon Jan 22, 2018 6:38 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
(If this technique is already known, then excuse my tomfoolery :lol: )

So I was looking into popslide by tepples, and while reading though the code another technique occurred to me. It doesn't work with the NES Stripe Image format, but it should be quite a lot faster (and thus be able to push more VRAM data per vblank).

While popslide and similar techniques works like a sort of mini-interpreter, using an unused part of the stack as a blob of instructions and data, my technique drops the "interpreter" part in favor of using the "Reverse RTS trick", filling the stack with addresses to various video updater methods followed by their data.

So, it might look something like this:
Code:
SetPalette1:
  ; Set vram address to palette 1
  LDA #$3F
  STA $2006
  PLA #$01
  STA $2006
  ; Pull out the palette values and give them to VRAM
  PLA
  STA $2007
  PLA
  STA $2007
  PLA
  STA $2007
  RTS

FillArbitraryData:
  ; Set arbitrary address
  PHA
  STA $2006
  PHA
  STA $2006
  ; Loop here filling $2007
  RTS

Terminator:
  ; Restore stack to it's normal self here
  RTS


Stack:
SetPalette1_Low
SetPalette1_High
Color1
Color2
Color3
FillArbitraryData_Low
FillArbitraryData_High
DataLength
Data1
Data2
Data3
Data4
Terminator_Low
Terminator_High


Each "RTS" call on those methods will magically chain into the next one (costing only 6 cycles!), until it hits the terminator method which takes us out of this loop. Unlike a mini-interpreter, things don't become slower the more alternatives we create, so we can create very fast highly specialized methods for setting specific kinds of VRAM (like the palette).

We can even use a macro to create an unrolled "LDA # -> STA" loop for the absolutely fastest bandwidth 1 byte per 6 cycles for static ROM data like text.

Furthermore, since we can jump to arbitrary addresses, we can create huge unrolled loops of "PLA -> STA", and then jump an arbitrary distance into that unrolled loop and use that as our starting point to push a certain amount of bytes, at no additional cost!

The possibilities are limitless. You can write methods that push data exactly to the dot, not even needing to push a length byte to the stack.

The cost of this technique:
Adjusting the stack and starting the process: 16 cycles (including initial JSR)
Static cost to start a segment: 6 cycles
Cost per segment: variable (but always lower than popslide due to specialization)
Cost to end the process: 15 cycles (including final RTS)


Last edited by Drakim on Tue Jan 23, 2018 6:10 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Mon Jan 22, 2018 1:41 pm 
Offline
User avatar

Joined: Thu Mar 31, 2016 11:15 am
Posts: 308
This has been used before :wink: but it's still a really cool use of CPS and I don't think most people know about it.

I still prefer zeropage buffers for most updates though.


Top
 Profile  
 
PostPosted: Mon Jan 22, 2018 2:11 pm 
Offline
User avatar

Joined: Fri Nov 19, 2004 7:35 pm
Posts: 4050
Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.

_________________
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!


Top
 Profile  
 
PostPosted: Mon Jan 22, 2018 7:51 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20160
Location: NE Indiana, USA (NTSC)
Dwedit wrote:
Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.

Technically correct, in that you can empty the buffer one-third faster (6 cycles per byte compared to 8 cycles per byte). But with the stride of five bytes per immediate value and the update buffer straddling more than one 256-byte page, it might not be possible to fill the buffer quite as fast. In addition, devoting more of the 2048 bytes of unexpanded CPU RAM to such a buffer is likely to cause other parts of the program, such as game logic, to become slower as algorithmic space/time tradeoffs elsewhere in the program get tilted toward using more time. This is why the 30 fps version of "Bad Apple" requires WRAM expansion (the Kirby's Adventure board), whereas the 15 fps version does not (the Mega Man 4 and Mega Man 6 board).


Top
 Profile  
 
PostPosted: Mon Jan 22, 2018 11:01 pm 
Offline
User avatar

Joined: Fri Nov 24, 2017 2:40 pm
Posts: 31
That's pretty clever. I've been musing about similar stack abuses for tail call hacking. In hindsight now it seems perfectly clear, push the destination to the stack and return. Thanks. :D


Top
 Profile  
 
PostPosted: Tue Jan 23, 2018 3:21 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7436
Location: Chexbres, VD, Switzerland
Why do you use those strange names instead of $2006 and $2007? Makes the code completely unreadable to me.

pubby wrote:
This has been used before :wink:

PLA/STA chains, definitely. The RTS I'm not so sure whether it had been used before in this context.

Quote:
I still prefer zeropage buffers for most updates though.

Me too, but in some (rare) cases it's not fast enough.

Quote:
Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.

LDA #imm / STA $2006 is about the fasest you can go, the only faster possibility is to include clever use of X and Y register for 2 most common values and avoid a LDA here; also some clever usage of INX/INY/DEX/DEY/LSR/ASL/ROL/ROR will be the same speed but with less bytes. I think Tokumaru already made a thread about that. Since generating such code in RAM would be difficult, it was mostly intended at staying in ROM, even if this eats a lot of it.


Top
 Profile  
 
PostPosted: Tue Jan 23, 2018 6:47 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Bregalad wrote:
Why do you use those strange names instead of $2006 and $2007? Makes the code completely unreadable to me.


Fixed!

Bregalad wrote:
pubby wrote:
Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.

LDA #imm / STA $2006 is about the fasest you can go, the only faster possibility is to include clever use of X and Y register for 2 most common values and avoid a LDA here; also some clever usage of INX/INY/DEX/DEY/LSR/ASL/ROL/ROR will be the same speed but with less bytes. I think Tokumaru already made a thread about that. Since generating such code in RAM would be difficult, it was mostly intended at staying in ROM, even if this eats a lot of it.


With my RTS technique, you can point to RAM and execute it as a routine just like pubby prefers, or you can point to ROM to a series of tightly placed opcodes like Tokumaru's tricks, or you can put the values you want on the stack itself and point to a routine that pulls them out.

I got a lot of ROM space so I've been creating various specialized routines to do certain tasks as fast as physically possible. One of my routines is just straight up 1024 "STA $2007" in a row, I jump to the routine start plus an offset so that if I only want to write 50 bytes, it jumps to start+(1024-50)*3, and bam, 50 unrolled STA statements, all at the same low cost of an RTS statement like everything else.

Compared to having to do LDA $,X -> CMP $ -> BEQ or a jumptable for "custom" VRAM vblank routines it's both easier to implement, faster AND gives more options. Although RTS is more expensive than a relative branch, you'd have to make your video stripe uselessly primitive to not go above 6 cycles for the selection process (and then you'd still lose out at length due to lack of specialization routines).

Bregalad wrote:
pubby wrote:
This has been used before :wink:

PLA/STA chains, definitely. The RTS I'm not so sure whether it had been used before in this context.


Does that mean I get to give the technique a fancy name? :o


Top
 Profile  
 
PostPosted: Tue Jan 23, 2018 6:53 am 
Offline
Formerly WheelInventor

Joined: Thu Apr 14, 2016 2:55 am
Posts: 1606
Location: Gothenburg, Sweden
Quote:
Does that mean I get to give the technique a fancy name? :o


By all means, yes! Also makes future convos easier to have when you can name drop techniques.
On the other hand if you don't people are going to refer to it as "drakim slide" and thus you'd be immortalized, like how i'd refer to a certain controller/dpcm fix as "rahsennors' controller/dpcm fix". :wink:

_________________
http://www.frankengraphics.com - personal NES blog


Top
 Profile  
 
PostPosted: Tue Jan 23, 2018 6:02 pm 
Offline

Joined: Thu Aug 20, 2015 3:09 am
Posts: 371
I used exactly this technique for an unreleased game called Hull Breach!. I got it straight off the wiki, and thought it was a well-known method. I called it "RTS chaining".

Then I decided assembler wasn't for me, and my compiler can't cope with this trick, so I haven't used it since.


Top
 Profile  
 
PostPosted: Tue Jan 23, 2018 6:20 pm 
Offline
Formerly WheelInventor

Joined: Thu Apr 14, 2016 2:55 am
Posts: 1606
Location: Gothenburg, Sweden
Quote:
an unreleased game called Hull Breach!.


You rang?

_________________
http://www.frankengraphics.com - personal NES blog


Top
 Profile  
 
PostPosted: Tue Jan 23, 2018 10:38 pm 
Offline
User avatar

Joined: Thu Mar 31, 2016 11:15 am
Posts: 308
I've used it before too.

One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.


Top
 Profile  
 
PostPosted: Wed Jan 24, 2018 1:26 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Rahsennor wrote:
I used exactly this technique for an unreleased game called Hull Breach!. I got it straight off the wiki, and thought it was a well-known method. I called it "RTS chaining".

Then I decided assembler wasn't for me, and my compiler can't cope with this trick, so I haven't used it since.


Are you sure you got it off the wiki? I can't find a single instance of the phrase "RTS chaining" or even "chaining" on the wiki. Heck, I can't even find "RTS chaining" related to programming anywhere on the web. Are you sure that's the proper name for the technique, and that you aren't just mixing it up with the "RTS Trick"? (or is my google-fu too weak?)

Edit: Oh wait, I read that wrong, you called it RTS chaining. Any idea what it was called in the wiki?

pubby wrote:
I've used it before too.

One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.


I'm not sure I understand, how can you use JSR to push addresses onto the stack without actually jumping there at the same time? If this worked, shouldn't the vanilla "RTS trick" be using this as well?

Edit: Ah, I think I may have found something similar being discussed before:

https://forums.nesdev.com/viewtopic.php?f=2&t=13285

I'm unsure why this technique hasn't caught on more, it seems like the strictly superior way, unless you really need a full 256 sized stack for gameplay.


Top
 Profile  
 
PostPosted: Wed Jan 24, 2018 3:26 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7436
Location: Chexbres, VD, Switzerland
Drakim wrote:
I'm unsure why this technique hasn't caught on more, it seems like the strictly superior way, unless you really need a full 256 sized stack for gameplay.

Well this technique is very great but it's neither the easiest (using "normal" buffer with "normal" code is easier to handle, but slower) nor the fastest (creating raw chains of lda #imm - sta $2007 is faster), so people coding games where standard VRAM bandwith is enough will do it the standard way, and people doing fancy stuff pushing the system to its extreme limits will use the fastest way. For this technique to be useful, you need significantly extra VRAM bandwidth, but still not enough so that extreme technique is not worth it.


Top
 Profile  
 
PostPosted: Wed Jan 24, 2018 3:39 am 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 447
Drakim wrote:
Compared to having to do LDA $,X -> CMP $ -> BEQ or a jumptable for "custom" VRAM vblank routines it's both easier to implement, faster AND gives more options. Although RTS is more expensive than a relative branch, you'd have to make your video stripe uselessly primitive to not go above 6 cycles for the selection process (and then you'd still lose out at length due to lack of specialization routines).


Nah

Put your On X table at the bottom of the Stack so $100 + then you do a 1 byte stack dispatch

get X to have value x2
txs
rts

If all your code runs in 1 frame then the stack position should be fixed per frame, so you can always just restore the same point at the end of the NMI. If you go over a frame and interrupt yourself X_X

The other catch with RTS chaining is you can't use the stack, not really a problem if you are just doing speed code from an NMI, but as a general system it not very practical, and making a self mod JSR chain(+6 per call) gives almost as much speed without the limitations.


Top
 Profile  
 
PostPosted: Wed Jan 24, 2018 8:12 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Bregalad wrote:
Well this technique is very great but it's neither the easiest (using "normal" buffer with "normal" code is easier to handle, but slower)


Is it? Hmmm, I mean, I concede the point that just setting up a "normal dumb" buffer is way easier, and it's probably what you should do when you are learning what's going on.

I was more thinking about popslide's way of using the bottom-stack as a faster buffer (which seems like the de facto way of doing it, to the point where people are comparing alternative implementations), which involves the exact same stack switching techniques as I use. But where popslide (and any similar interpreter) has to write a tiny mini-parser for the stack data, consuming bytes and acting on it, my code is (after the stack switch) just an "RTS" statement.

It jumps you right into a subroutine such as Set_BackgroundColor or Set_Palette, it's way easier and more straightforward at that point, in my opinion. Adding another subroutine is super trivial, you don't have to dig into popslide's source code to carefully figure out where to insert your new functionality. So I'd personally grade it easier and faster.

Quote:
nor the fastest (creating raw chains of lda #imm - sta $2007 is faster), so people coding games where standard VRAM bandwith is enough will do it the standard way, and people doing fancy stuff pushing the system to its extreme limits will use the fastest way.


Which what you are saying is true in the strict technical sense, I'm not sure I agree with your overall point.

While I do think that if you are writing some psedo-3d tech demo that needs to upload to vram really fast in a specific way, then nothing can beat your handcrafted code in speed.

But if you are making an actual real game, which sometimes has a screen full of text, sometimes scrolls horizontally, sometimes vertically, sometimes cycles colors for shiny treasures, sometimes prints one and one letter in dialog, I'm not so certain you could beat the RTS technique. Simply because the RTS technique works like a reverse jumptable, and the overhead setup for stack switching is only 16 cycles. And no matter how crafty you are, your game will need some means of selectively pushing vram since you can't do everything every frame. And whatever you pick, I think the RTS technique would beat it, but maybe I'm being overly optimistic!


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 35 posts ]  Go to page 1, 2, 3  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group