It is currently Tue Feb 20, 2018 6:29 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 35 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
PostPosted: Wed Jan 24, 2018 8:44 am 
Offline

Joined: Fri Nov 24, 2017 2:40 pm
Posts: 28
RTS Chaining, I think they meant this:
http://wiki.nesdev.com/w/index.php/RTS_Trick

I've was experimenting with this stuff yesterday and had pretty good luck. Other than it being non-obvious, it's pretty easy to do. Basically just writing a 2 byte command token instead of just one. You can easily avoid interfering with the stack if you write to the very bottom. My terminator function restores the stack, so to execute the buffer I just write the terminator func and the current stack pointer, then reset the stack pointer to FF and return. Voila! It like it.

pubby wrote:
One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.


Say what? How does that work? Won't you end up jumping to the address you push? Not sure I understand.


Top
 Profile  
 
PostPosted: Wed Jan 24, 2018 9:27 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
slembcke wrote:
I've was experimenting with this stuff yesterday and had pretty good luck. Other than it being non-obvious, it's pretty easy to do. Basically just writing a 2 byte command token instead of just one. You can easily avoid interfering with the stack if you write to the very bottom. My terminator function restores the stack, so to execute the buffer I just write the terminator func and the current stack pointer, then reset the stack pointer to FF and return. Voila! It like it.


Yup, that's the beauty of it. If you add more stuff later in the frame you simply overwrite the terminator function and write a new terminator function at the new offset in the stack.

I've decided to use 127 bytes for my video stack so that I can use the BMI instruction to quickly check if the video stack pointer has become too large each time I try to push in more.

I've also been experimenting with making the terminator function's address be symmetrical to speed it up further (something like $C0C0), since then you can just mindlessly fill the bottom of the stack in advance (with an unrolled loop), and not have to bother appending it after have written in your "2 byte command token". It speeds up high-video update situations but slows down low-video update situations, which is generally a good tradeoff.

Another cool trick I've mentioned earlier is that you can jump an arbitrary length into any of your subroutines as a starting point, which can be used in very neat ways.

slembcke wrote:
pubby wrote:
One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.


Say what? How does that work? Won't you end up jumping to the address you push? Not sure I understand.


Yeah, I've been twisting my head trying to figure out how it would work, but I just can't see it. Does he somehow exit out of the routines with a JMP? But even if he did that it would be the wrong part of the stack that's filled up. :|


Top
 Profile  
 
PostPosted: Wed Jan 24, 2018 9:36 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Oziphantom wrote:
Nah

Put your On X table at the bottom of the Stack so $100 + then you do a 1 byte stack dispatch

get X to have value x2
txs
rts

If all your code runs in 1 frame then the stack position should be fixed per frame, so you can always just restore the same point at the end of the NMI.


You are going to have to explain that again, I'm not able to follow. :oops:

Quote:
If you go over a frame and interrupt yourself X_X

Now that's a pretty good point. I guess I should either make really sure that no interrupt can happen while my video routine does it's work, or that interrupts are disabled while it works.

Quote:
The other catch with RTS chaining is you can't use the stack, not really a problem if you are just doing speed code from an NMI, but as a general system it not very practical

That's not completely true, if you leave yourself some room you can still use the faux video stack as a regular stack as long as you are careful. JSR, RTS, PHA and PLA will work just as expected as long as you don't push the stack past the $0/$FF point.

Quote:
and making a self mod JSR chain(+6 per call) gives almost as much speed without the limitations.

It increases the static cost of each video "segment" from 6 to 12 cycles, and you lose the ability to cram in extra values for each video subroutine to pull out with PLA, and each "jump entry" in your queue takes 4 bytes of RAM intead of 2.

Not so sure I agree it's almost as good, it seems like an inferior technique (but with the added bonus of being able to place it anywhere in RAM, or even have ROM versions for cutscenes and the like where things are exactly the same)


Top
 Profile  
 
PostPosted: Wed Jan 24, 2018 10:41 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7375
Location: Chexbres, VD, Switzerland
Drakim wrote:
Which what you are saying is true in the strict technical sense, I'm not sure I agree with your overall point.

Well there's nothing to agree or disagree upon. I was just trying to explain why using stack as VRAM-buffer is not so common even though we all agree it's a cool trick. My point was that while this trick is indeed very cool, it's neither the fasest nor the easiest. To which you respond, "but this trick is so cool" - I never said it wasn't !

Quote:
But if you are making an actual real game, which sometimes has a screen full of text, sometimes scrolls horizontally, sometimes vertically, sometimes cycles colors for shiny treasures, sometimes prints one and one letter in dialog, I'm not so certain you could beat the RTS technique.

Doing it with normal buffers without any optimisation allows a bandwith of aprox 80 bytes + sprite DMA, which is enough to update one line of nametable plus one line of attribute table plus palettes, or two lines of nametables plus attribute table but without palettes. With some partially unrolled loops or zero-page usage, it's possible easily to update two lines + attribute table + palettes + sprite DMA, with only light optimization. So the only application where further optimization is needed is for CHR-RAM updates, or maybe if updating a really large NT area in a single frame is needed.


Top
 Profile  
 
PostPosted: Wed Jan 24, 2018 12:02 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19643
Location: NE Indiana, USA (NTSC)
Drakim wrote:
Adding another subroutine is super trivial, you don't have to dig into popslide's source code to carefully figure out where to insert your new functionality. So I'd personally grade it easier and faster.

Popslide uses the same packet format as Super Mario Bros., which is described on both Data Crystal and NESdev Wiki. There are four packet types (horizontal literal, horizontal run, vertical literal, and vertical run). It's designed to be generic, where you can make your own subroutine that adds packets in a particular shape to the list without having to touch the updater code. The stuff in nstripe.s is mostly convenience for forming packets based on shapes that arose during the development of The Curse of Possum Hollow.

Having only one interpreter also simplifies the NMI handler, which is important to reduce the size of code in the fixed bank or the pseudo-fixed repeating area in a game using 32K switching.


Top
 Profile  
 
PostPosted: Wed Jan 24, 2018 4:36 pm 
Offline
User avatar

Joined: Thu Mar 31, 2016 11:15 am
Posts: 275
Drakim wrote:
Yeah, I've been twisting my head trying to figure out how it would work, but I just can't see it. Does he somehow exit out of the routines with a JMP? But even if he did that it would be the wrong part of the stack that's filled up. :|


It's just a jmp to a jsr which jumps back, and is only useful in a few very specific situations. I did use it in my music engine though.

Code:
    jmp foo
continue:

; ...

foo:
    jsr continue
    ; some code
    rts


Top
 Profile  
 
PostPosted: Wed Jan 24, 2018 11:03 pm 
Offline

Joined: Thu Aug 20, 2015 3:09 am
Posts: 325
FrankenGraphics wrote:
You rang?

Sorry...

Drakim wrote:
Are you sure you got it off the wiki? I can't find a single instance of the phrase "RTS chaining" or even "chaining" on the wiki. Heck, I can't even find "RTS chaining" related to programming anywhere on the web. Are you sure that's the proper name for the technique, and that you aren't just mixing it up with the "RTS Trick"? (or is my google-fu too weak?)

Edit: Oh wait, I read that wrong, you called it RTS chaining. Any idea what it was called in the wiki?

I can't find it on the wiki now either. My devlog notes the technique on 2016-05-18 as "RTS-display-list-on-stack" and then never mentions it again, but it's definitely right there in the source:

Code:
.org $C000

bank .byte 0,1,2,3,4,5,6,7 ; UNROM conflict-avoidance table

irq:
reset:
   lda #0
   sta $2000  ; disable NMI
   sta $2001  ; disable rendering
   sta bank+0 ; switch in bank 0
   jmp init


.dsb $FF00 - *

; copy 0-32 bytes to the PPU
nmi_copy:
   nop ; offset for RTS (I'm lazy)
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20 ; PLA/STA x32
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20 ; 4 bytes each
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
   pla
   sta $2006
   pla
   sta $2006
   rts

; write 32 zeroes to the PPU
nmi_zero:
   nop
   ldy #8
   lda #0
 nmi_zero1:
   sta $2007
   sta $2007
   sta $2007
   sta $2007
   dey
   bne nmi_zero1
   pla
   sta $2006
   pla
   sta $2006
   rts

; close the RTS chain
nmi_end:
   nop

   lda scroll_x
   sta $2005
   lda scroll_y
   sta $2005
   lda #$88     ; TODO: nametable select
   sta $2000

   ldx nmi_s
   txs
   ldy nmi_y
   ldx nmi_x
   lda nmi_a
nmi_abort:
   rti

nmi:
   lsr nmi_flag
   bcc nmi_abort
   sta nmi_a
   stx nmi_x
   sty nmi_y
   tsx
   stx nmi_s
   lda #>oam
   sta $4014
   ldx #$FF
   txs
   pla
   sta $2006
   pla
   sta $2006
   rts

.dsb $FFFA - *       ; pad to interrupt vectors

.word nmi
.word reset
.word irq

I also used this method on PC a fair bit, back when I was doing assembler coding for DOS and so on, so maybe I mentally inserted it into the "RTS Trick" page as a matter of course?

EDIT: I also saw Tokumaru's thread the day it was posted, so I may have got it from there. My treatment of $2006 is identical so it's likely that was it.


Top
 Profile  
 
PostPosted: Thu Jan 25, 2018 4:07 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Bregalad wrote:
Well there's nothing to agree or disagree upon. I was just trying to explain why using stack as VRAM-buffer is not so common even though we all agree it's a cool trick. My point was that while this trick is indeed very cool, it's neither the fasest nor the easiest. To which you respond, "but this trick is so cool" - I never said it wasn't !


Ah, my bad, sorry!

Quote:
Doing it with normal buffers without any optimisation allows a bandwith of aprox 80 bytes + sprite DMA, which is enough to update one line of nametable plus one line of attribute table plus palettes, or two lines of nametables plus attribute table but without palettes. With some partially unrolled loops or zero-page usage, it's possible easily to update two lines + attribute table + palettes + sprite DMA, with only light optimization. So the only application where further optimization is needed is for CHR-RAM updates, or maybe if updating a really large NT area in a single frame is needed.


Thanks, that's a pretty good point. It didn't occur to me that a game's scope might allow it to push everything to vram on every frame without hitting the ceiling, my mental idea of how much you could push was much much lower.

pubby wrote:
It's just a jmp to a jsr which jumps back, and is only useful in a few very specific situations. I did use it in my music engine though.

Ah, I see. That does indeed shave off 1 cycle (10 vs 9).

If you REALLY wanna go crazy on the savings you can do LDA, PHA, PHA if the subroutine is placed on an address where it's high and low byte are identical, then you have it down to 8 cycles. :D


Top
 Profile  
 
PostPosted: Thu Jan 25, 2018 9:23 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10288
Location: Rio de Janeiro - Brazil
I definitely posted a method exactly like this a while back.


Top
 Profile  
 
PostPosted: Thu Jan 25, 2018 9:14 pm 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 279
Drakim wrote:
Oziphantom wrote:
Nah

Put your On X table at the bottom of the Stack so $100 + then you do a 1 byte stack dispatch

get X to have value x2
txs
rts

If all your code runs in 1 frame then the stack position should be fixed per frame, so you can always just restore the same point at the end of the NMI.


You are going to have to explain that again, I'm not able to follow. :oops:

Its a One Byte dispatch method. I.e your $100 looks like
$100 <MyFunc-1
$101 >MyFunc-1
$102 <OtherFunc-1
$103 >OtherFunc-1

then if you want to dispatch OtherFunc you do
LDX #2
TXS
RTS
however normally you would not immed load X and put it on some "on X" variable.
Quote:
Quote:
The other catch with RTS chaining is you can't use the stack, not really a problem if you are just doing speed code from an NMI, but as a general system it not very practical

That's not completely true, if you leave yourself some room you can still use the faux video stack as a regular stack as long as you are careful. JSR, RTS, PHA and PLA will work just as expected as long as you don't push the stack past the $0/$FF point.

Quote:
and making a self mod JSR chain(+6 per call) gives almost as much speed without the limitations.

It increases the static cost of each video "segment" from 6 to 12 cycles, and you lose the ability to cram in extra values for each video subroutine to pull out with PLA, and each "jump entry" in your queue takes 4 bytes of RAM intead of 2.

Not so sure I agree it's almost as good, it seems like an inferior technique (but with the added bonus of being able to place it anywhere in RAM, or even have ROM versions for cutscenes and the like where things are exactly the same)
I was imagining that you would have a mostly fixed "display list" as things move, but don't get added or removed from the screen that often. Thus you save time by not regenerating the whole list each frame. If you are regenerating it each frame then sure any other stack use, will trash stuff, but you don't care as you are going to rebuilt it before next NMI. However if you FRAME out and don't get all the stack rebuilt before the NMI and some of your other systems have "trashed" the display list X_X


Top
 Profile  
 
PostPosted: Fri Jan 26, 2018 3:54 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
tokumaru wrote:
I definitely posted a method exactly like this a while back.


Indeed you did, so I'm calling it tokumaru's video stack from now on, if you don't mind :D

In your thread you asked for improvements, and there are a couple:

1. Your stack format is 2 bytes for the desired vram address, and then 2 bytes for the desired subroutine, chaining that for as much as you need, and then ending with 2 bytes for a fake vram address $0000 and then 2 bytes for the terminator function that restores the stack.

You can improve this by putting the vram address AFTER the desired subroutine address in your stack, and letting the subroutine itself PLA out the vram address and set it. Why? Because we can create hundreds of different subroutines if we desire, to do specialized tasks. For instance, we can make one to change the background fill color in the PPU, which is always $3F00, so it's better to write LDA #$3F, LDA #$00 than to use PLA, PLA. Not only is it faster cyclewise during the vblank, but it uses up less of the faux stack.

But even more important than that....

2. To loop back and continue to the next subroutine, you use "JMP UpdateVRAM -> PLAx2 vram address -> RTS", but now that the vram address isn't in the way you can drop all that and just use a normal "RTS" at the end of each subroutine, and it will magically jump to the next subroutine! This shaves off a LOT of cycles and simplifies the code as well.


Top
 Profile  
 
PostPosted: Fri Jan 26, 2018 7:06 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10288
Location: Rio de Janeiro - Brazil
Drakim wrote:
Indeed you did, so I'm calling it tokumaru's video stack from now on, if you don't mind :D

If this is supposed to be immortalized in the wiki or something I don't mind being credited somewhere, but having it named after me is a bit too much IMO. :mrgreen:

Quote:
1. Your stack format is 2 bytes for the desired vram address, and then 2 bytes for the desired subroutine, chaining that for as much as you need, and then ending with 2 bytes for a fake vram address $0000 and then 2 bytes for the terminator function that restores the stack.

You can improve this by putting the vram address AFTER the desired subroutine address in your stack, and letting the subroutine itself PLA out the vram address and set it.

The reason I did it like that was so I could reuse the same 32x unrolled loop of PLA+STA to transfer any number of bytes from 1 to 32, by simply JSRing to the middle of the sequence. If I were to set the address inside the routine, I'd need 32 separate routines for the same functionality without any loss of performance.

Quote:
Why? Because we can create hundreds of different subroutines if we desire, to do specialized tasks.

I did end up ditching the generic approach in favor of completely specialized routines myself, but aside from the rare instances where you'd use hardcoded addresses, my old method could do specialized routines too, which could set other VRAM addresses if needed.

Quote:
2. To loop back and continue to the next subroutine, you use "JMP UpdateVRAM -> PLAx2 vram address -> RTS", but now that the vram address isn't in the way you can drop all that and just use a normal "RTS" at the end of each subroutine, and it will magically jump to the next subroutine! This shaves off a LOT of cycles and simplifies the code as well.

Sure, that's much cleaner and straightforward, but like I said, setting the VRAM address before calling the next routine is merely a ROM saving feature so that the same generic transfer routine can deal with data lengths from 1 to 32. The superfluous address setting at the end of the chain will waste a few cycles, but that's all.

I wasn't 100% satisfied with that solution either, but someone looking for a completely generic VRAM update system of "copy X bytes from the stack to PPU address $XXXX" that doesn't have much ROM space might want to go with that.

I ended up switching to completely specialized update routines, so I have complete freedom to store the data in whatever format is more convenient for each task (many times outside of the stack, even), but I still have the RTS directly call the next update routine.


Top
 Profile  
 
PostPosted: Fri Jan 26, 2018 7:48 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
tokumaru wrote:
The reason I did it like that was so I could reuse the same 32x unrolled loop of PLA+STA to transfer any number of bytes from 1 to 32, by simply JSRing to the middle of the sequence. If I were to set the address inside the routine, I'd need 32 separate routines for the same functionality without any loss of performance.


You can solve this by first queuing up a specialized "set address" subroutine, and then queuing up the unrolled loop subroutine right after (plus the offset to get the desired amount of loops). It adds 6 cycles (the extra RTS) to do this, but just the unnecessary JMP statement at the end of the old solution costs 3 cycles, so this technique's unrolled loop only really costs 6 - 3 = 3 cycles more. As long as you do a few more video operations queued up afterwards you end up using less cycles, by avoiding the extra JMP each time.


Top
 Profile  
 
PostPosted: Fri Jan 26, 2018 8:12 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10288
Location: Rio de Janeiro - Brazil
But the extra JMP and the bogus VRAM address are used only once, and that's during a not so precious time (i.e. they can safely take place during the pre-render scanline) but an extra 6 cycles per update routine call when CPU time is at a premium (i.e. vblank), that's much worse IMO.

Note that I'm not arguing that this is the best possible solution, just that there might be demand out there for all of these solutions. The VRAM address presetting is for those who're looking for a fast generic vblank handler but don't have much ROM to spare (e.g. NROM projects with more VRAM updates than typical).


Top
 Profile  
 
PostPosted: Fri Jan 26, 2018 8:53 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
I think I'm misunderstanding you, or you are misunderstanding me. :wink:

Here is the gist what I understood you to be saying:

Quote:
We have to arrange the stack so that it's [videoaddress] first and [subroutineaddress-1] afterwards. When it's vblank time, we pull out the [videoaddress], use it to set $2006 correctly, and then we use RTS to jump to our [subroutineaddress-1], which now begins pushing data into $2007. When it's done, it calls JMP UpdateVRAM to begin the process again.

The reason we cannot do the opposite order, beginning by using RTS to jump to [subroutineaddress-1], and then have the subroutine itself pull the [videoaddress] and set $2006 correctly, and then begin pushing data into $2007, is because the subroutine might be an unrolled loop, and we might want to jump to [subroutineaddress-1+60] to skip a bunch of the unrolled loop.

If we did that, we'd be jumping past the part of the subroutine that pulls out the [videoaddress] and set $2006 correctly. Therefore that part cannot be inside the subroutine.


And here is the gist what I'm saying:

Quote:
To solve this issue, we simply have to avoid setting the address in the unrolled loop's subroutine. We do this by setting the stack like this: [subroutineSetAddress-1][videoaddress][subroutineUnrolledLoop-1+60]

What happens now is that our first RTS jumps us to subroutineSetAddress, which uses two PLA to set $2006 correctly, and then calls RTS, which jumps us to subroutineUnrolledLoop (plus the offset!) with the address already set.

This means we are using an additional RTS instruction for unrolled loops, which costs 6 more cycles, but we save 3 cycles for every single video subroutine by avoiding the "JMP UpdateVRAM", which means it more than makes up for it, and we end up using less cycles in the long run.


Edit: Actually, re-reading a couple of times, what do you mean "But the extra JMP and the bogus VRAM address are used only once", how does that work? What exactly do you do to end your video subroutines if you are only using the JMP one time during the entire sequence? Don't you have to do it one time per subroutine?


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 35 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group