It is currently Sat May 26, 2018 6:04 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 26 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Mon Sep 21, 2015 2:37 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10484
Location: Rio de Janeiro - Brazil
In another thread I started talking about a way to do VRAM updates that is completely stack-based, and can be used for the entire game. I decided to start this new thread mainly to share what I have designed so far, possibly helping people in search of a VRAM update system, but also to get some input from you guys and maybe make this better.

The first thing I have is this piece of code in the NMI handler, after the OAM DMA, that takes care of running all the updates:

Code:
   ;swap stack pointers
   tsx
   stx RealSP
   ldx FakeSP
   txs

UpdateVRAM:

   ;set the output address and jump to the byte copy code
   pla
   sta $2006
   pla
   sta $2006
   rts

RestoreSP:

   ;restore the stack pointer
   ldx RealSP
   txs

For each VRAM update, an output VRAM address is pulled from the stack and used, and the RTS jumps to some point in the middle of the following unrolled loop, depending on the amount of bytes that need to be copied:

Code:
Copy32Bytes:

   pla
   sta $2007

Copy31Bytes:

   pla
   sta $2007

   (...)

Copy2Bytes:

   pla
   sta $2007

Copy1Byte:

   pla
   sta $2007

CopyNothing:

   jmp UpdateVRAM

This will copy a certain amount of bytes and JMP back to the update loop, so the next update can be processed. I decided that 32 bytes is the ideal length for this unrolled loop because it allows you to update the whole palette and a whole row of name table entries. You'll most likely only need more than that for pattern updates only, but there's nothing wrong in breaking those up into pairs of tiles (32 bytes = 2 tiles).

When does the update loop end then, if the program keeps JMPing back to UpdateVRAM? Well, in order to process the updates as fast as possible, I decided to not explicitly check for the end of the updates, but instead feed the RTS of the last iteration of the update loop with the address to RestoreSP. Yes, this means that the last address written to $2006 is a bogus one (if it makes you feel better, you can write $0000, like some people do anyway), but since it takes 16 cycles to set that address, that's equivalent to 8 non taken branches that would be necessary inside the loop to check for a flag that breaks out of the loop. I'm optimizing for the worst case here, so whenever there are more than 8 updates, it's cheaper to set the address unnecessarily than to check for a flag every update.

What about PPU address increments? Same thing. Whenever the increment mode has to be changed, the RTS is fed with the address of a routine that changes the increment mode, before RTSing again to reach the actual byte copy code.

All decisions so far were made to make the most out of the vblank time, meaning there's a lot of work to do before vblank starts, so the stack is formatted correctly. Personally, I'd rather use indexed addressing to fill the update stack, to avoid having to manipulate the stack pointer mid-frame and to avoid having to write all the data backwards. If you want to build the update stack using PHA, you'll have to do the next steps in the opposite order.

For each update you need to "schedule", you have to first check whether the current PPU address increment mode is the one you need. If it isn't, you have to write the address (minus 1, since it will be called with an RTS) of the routine that changes the increment mode. Then, you have to write the output address for the data, and then the data.

Once all updates have been written, you need a bogus VRAM address, which can be anything really, but you can use $0000 if that makes you more comfortable. Then you need the address of RestoreSP - 1, because that's what will allow the update loop to end. Even if there are no updates at all, you still need these last 4 bytes so the program doesn't crash.

Another interesting advantage of this method is that you can cheat, and use the RTS to jump to other types of byte copying routines besides the chain of PLA + STA. For example, to update attribute table data, I prefer to copy the bytes directly from my shadow attribute tables, instead of wasting time copying them over to the stack. In cases like this, you can simply use the address of the specialized byte copy code (minus 1) instead of the addresses from the regular look-up table. This allows you to have a constant NMI handler for the whole program, but doesn't prevent you from doing more specialized updates if you need to. This is useful if you need to switch banks and copy data directly from ROM, for example. As long as you JMP back to UpdateVRAM in the end, everything is game.

Does anyone have any comments or suggestions? The thing that's bothering me the most is the bogus VRAM address at the end, but like I said, in the worst case, it's cheaper to have that than checking a flag for every update.


Top
 Profile  
 
PostPosted: Mon Sep 21, 2015 8:20 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10484
Location: Rio de Janeiro - Brazil
There's one important thing I forgot to talk about: when trying to buffer an update, there must a way to tell whether there's still enough time left to process it. I'm not sure if going by how full the buffer is is enough, since the same buffer is used for addresses and data, and these take different amounts of time to process (let's not forget the increment mode change, which is an additional routine). The most precise thing to do would probably be to count cycles. Initialize a variable with the total amount of cycles that can be used for VRAM updates and subtract from it whenever a new update is buffered. A look-up table could hold the number of cycles necessary for byte transfers of different lengths.


Top
 Profile  
 
PostPosted: Mon Sep 21, 2015 8:38 pm 
Offline
User avatar

Joined: Sun Jan 02, 2011 11:50 am
Posts: 522
Reminds me of how I add routines to NMI. I wanted it to be easy to hook anything into NMI. I load $0100 and up with the addresses of task I want to complete (minus 1), set the SP and RTS to the first routine. The next RTS will load the next task, and whatever the last task is, its RTS will jump to the routine that restores the stack pointer. I just avoid using the stack for anything in my "NMI" routines. If I really need to I can swap the stack pointer inside the routine.


Top
 Profile  
 
PostPosted: Mon Sep 21, 2015 9:31 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10484
Location: Rio de Janeiro - Brazil
Movax12 wrote:
I wanted it to be easy to hook anything into NMI. I load $0100 and up with the addresses of task I want to complete (minus 1), set the SP and RTS to the first routine. The next RTS will load the next task, and whatever the last task is, its RTS will jump to the routine that restores the stack pointer.

Cool idea, very clever and elegant! And you can still have VRAM addresses and data on the stack that the tasks can use, if you want to.

Doing the math though, it sounds like your way ends up being slightly slower, at least for the common case of simply copying variable amounts of bytes to a dynamic VRAM location. For each task, there's 6 cycles for the RTS, and then inside the task you need 16 cycles to set the VRAM address, then another RTS to jump to the middle of the unrolled loop. That's 28 cycles of preparation before the transfer itself.

In the code I wrote above, there's 16 cycles of setting up the VRAM address, followed by the 6 cycles of the RTS that jumps to the unrolled loop. Then, after all bytes are copied, there's 3 cycles for the JMP, for a total of 25 cycles of overhead for each block of data. There's also what's lost with the dummy $2006 writes in the end, which adds a bit more overhead to the whole process though, so it's not as elegant as your solution.

I was just thinking, and realized that the overhead for updating small amounts of data is quite annoying. Say, when updating a single metatile for a block that's been destroyed: that's 2 rows of 2 tiles followed by 1 attribute byte. These need 3 separate transfers (3 * 25 cycles of overhead = 75 cycles) and 5 bytes of data (5 * 8 cycles = 40 cycles). I thought that 115 cycles sounded like a lot to update just one metatile, but then it hit me: all metatile updates have the same structure, so instead of calling the generic byte copy routine 3 times I can make a routine specifically for updating single metatiles, that doesn't JMP back to the NMI handler until it's done with the 3 transfers. That brings the amount of time it takes to update the metatile down to 97 cycles, which is much more reasonable. So I guess I can have the simple byte copy routine for the more generic stuff, but I have the option to create new specialized routines whenever they seem necessary, so this solution is still very flexible even though the first couple of $2006 writes of each update are done outside the update routine itself. A little messy, but flexible. :lol:

Quote:
I just avoid using the stack for anything in my "NMI" routines. If I really need to I can swap the stack pointer inside the routine.

Yeah, that doesn't bother me at all. The goal is to move as much processing as possible away from the PPU update code, which should eliminate the need for regular stack operations.


Top
 Profile  
 
PostPosted: Mon Sep 21, 2015 9:49 pm 
Offline
User avatar

Joined: Wed Apr 02, 2008 2:09 pm
Posts: 1145
I realized too late this is probably describing what Movax12 does, but I've already gone and written it.
Quote:
but instead feed the RTS of the last iteration of the update loop with the address to RestoreSP. Yes, this means that the last address written to $2006 is a bogus one

Couldn't you use the same trick for the end of the loop?
i.e:
Code:
Copy1Byte:

   pla
   sta $2007

CopyNothing:

   rts

3 cycles slower per case, but since the extra setup of $2006 at the end is 22 cycles, it'd require more than 7 different update streams before it matters. Then again, I suppose attribute bytes would do that. But, because the routine would no longer need to always start from UpdateVRAM, different addresses could be pushed for different types of streams like:

Code:
UpdateVRAMinc32:
   lda shadow2000
   ora #%00000100
   sta shadow2000
   sta $2000

   pla
   sta $2006
   pla
   sta $2006
   rts

So something that needed to change the increment type would do it, set the address, and RTS to the number of updates. Where yours would presumably have to jmp to UpdateVRAM, RTS to what would presumably change the increment type, then RTS again to the right number of updates. (Unless you had a change increment type above every number of updates duplicated as well.)
Quote:
so instead of calling the generic byte copy routine 3 times I can make a routine specifically for updating single metatiles, that doesn't JMP back to the NMI handler until it's done with the 3 transfers

That's more or less what I do for vertical attributes, since they need a new address so often. I have a series of updates in ROM much like your copy32bytes etc that sets $2006 and then writes $2007.
Code:
   pla
   sta $2006
   
   pla
   sta $2006
   
   pla
   sta $2007

_________________
https://kasumi.itch.io/indivisible


Top
 Profile  
 
PostPosted: Mon Sep 21, 2015 10:07 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10484
Location: Rio de Janeiro - Brazil
Kasumi wrote:
3 cycles slower per case, but since the extra setup of $2006 at the end is 22 cycles, it'd require more than 8 different update streams before it matters. Then again, I suppose attribute bytes would do that.

Yeah. Updating the attributes of a column of metatiles requires 4 separate transfers, and then there's obviously the name table data. Based on this I think that it should be fairly common to have more than 8 updates per frame. And when there aren't, the overhead isn't such a big issue, since less updates means more free time which means the wasted cycles are not as important.

Quote:
But, because the routine would no longer need to always start from UpdateVRAM, different addresses could be pushed for different types of streams like:

There's that... Since I figured I could optimize specific types of updates (by not jumping back to UpdateVRAM all the time), the final update count could easily drop below 8, so this is indeed something to consider.

Quote:
That's more or less what I do for vertical attributes, since they need a new address so often. I have a series of updates in ROM much like your copy32bytes etc that sets $2006 and then writes $2007.

I see. Another trick to speed up vertical attribute updates is that you can update 2 bytes per address if you keep the increment 32 mode. Since 32 bytes is the same as 4 attribute rows, for each address you can write 1 attribute byte and the one that comes 4 rows down, so you only have to set the address 4 times for the whole column.


Top
 Profile  
 
PostPosted: Tue Sep 22, 2015 1:25 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7432
Location: Chexbres, VD, Switzerland
Oh, this is pretty clever and cool. A good idea would be to force the "Copy32Bytes" etc.... routines so that their address can be calculated directly from the # of bytes to copy without using a lookup table. Does it allow to update more data than more regular VRAM update systems (i.e. is it faster ?)


Top
 Profile  
 
PostPosted: Tue Sep 22, 2015 2:31 am 
Offline
User avatar

Joined: Sun Jan 02, 2011 11:50 am
Posts: 522
Movax12 wrote:
I wanted it to be easy to hook anything into NMI. I load $0100 and up with the addresses of task I want to complete (minus 1), set the SP and RTS to the first routine.


Looking at this after some sleep, not sure if I was clear. During "game logic" I can call a macro that adds a new routine to this list in the stack page. In NMI, you swap the SP and start through the list with RTS. I actually have three lists. One is for vram updates - they are called as soon as possible in NMI. Then I have lower priority routines, and a third for things that don't need to be done in any hurry, like read the controller. And if you need to do something once, the routine can remove itself from the list quite easily (if it was added to the list last).

So, compared to having some static JSRs in NMI, and/or checking flags to decide what needs to be done (and ignoring the overhead in the "game logic" thread) this is quite fast.


Top
 Profile  
 
PostPosted: Tue Sep 22, 2015 7:42 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10484
Location: Rio de Janeiro - Brazil
Bregalad wrote:
A good idea would be to force the "Copy32Bytes" etc.... routines so that their address can be calculated directly from the # of bytes to copy without using a lookup table.

Great idea! Since there are 4 bytes of instructions for copying each byte (PLA, STA $XXXX), the offset could be calculated with bit shifts. The only problem is that the less bytes you have, the greater the address, so in addition to being shifted, the number needs to be complemented. Actually, since RTS takes addresses - 1, we can use one's complement instead of two's complement:

Code:
   ;X = number of bytes to copy
   lda #>Copy32Bytes ;2
   pha ;5
   txa ;7
   asl ;9
   asl ;11
   eor #$ff ;13
   pha ;16 cycles

For this to work, Copy32Bytes must be at $XX80. That way, if I want to copy 32 bytes, the result will be: (32 * 4) XOR 255 = $XX7F, exactly 1 less than the start of the byte copy routine. If I want to copy 1 byte, the math is: (1 * 4) XOR 255 = $XXFB, exactly one less than the address of the last copy, which is at $XXFC-$XXFF. It's 2 cyles slower than using the jump table, but you save 64 or 66 bytes.

Quote:
Does it allow to update more data than more regular VRAM update systems (i.e. is it faster ?)

Well, the byte copy itself is as fast as it gets without having to resort to ZP or dynamic code, it's the overhead of going from one update to the next that slows things down. I tried to minimize that overhead as much as possible, but it does add up when you have lots of small transfers.

Movax12 wrote:
Looking at this after some sleep, not sure if I was clear.

It was pretty clear to me. :D Did I say anything that sounded like I didn't get it?


Top
 Profile  
 
PostPosted: Tue Sep 22, 2015 9:31 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10484
Location: Rio de Janeiro - Brazil
OK, here's another approach, targeting maximum speed - Create 2 unrolled copy loops:

Code:
CopyLoop1:
   .rept 32
   pla
   sta $2007
   .endr
   
   pla
   sta $2006
   pla
   sta $2006
   rts

Code:
CopyLoop2:
   .rept 32
   pla
   sta $2007
   .endr
   
   jmp RestoreSP

The first one is used for all updates except for the last, which uses the second. This means there will be overhead at all (except for setting a new VRAM address, but that's nearly always necessary anyway), since one update jumps directly to the next. It's still customizable, since you jump to routines other than these 2 loops (to change increment modes or fetch data from other sources).

The only drawback is that filling the buffer with index addressing becomes a little awkward, since you never know whether an update is the last one when you're buffering it. You'd have to save the index of the address of the most recent update so you can change it to use the second copy routine once all the logic is done and you know that no more updates will be buffered. Filling the buffer with PLA is more straightforward, since the first update you push is always the last one to be executed, so you can just start using the second copy routine and then use the first for all other updates.

EDIT: It may seem wasteful to have a second copy routine, but when you consider that it eliminates the need for specialized routines in certain cases (like the metatiles I mentioned before actually, since metatile updates have fixed lengths, it's still faster to use a custom routine), this might actually be saving storage space.

I really liked Movax12's idea of one routine calling the next, so I just took it to the next level.

EDIT 2: Actually, if you don't mind the dummy $2006 writes, you can do with just the first routine, and have the RTS break out of the loop, like in my original idea.


Last edited by tokumaru on Tue Sep 22, 2015 9:56 am, edited 4 times in total.

Top
 Profile  
 
PostPosted: Tue Sep 22, 2015 9:33 am 
Offline
User avatar

Joined: Sun Jan 02, 2011 11:50 am
Posts: 522
tokumaru wrote:
Movax12 wrote:
Looking at this after some sleep, not sure if I was clear.

It was pretty clear to me. :D Did I say anything that sounded like I didn't get it?


Not really, but I wasn't sure if I got it after reading it in the morning.


Top
 Profile  
 
PostPosted: Sun Sep 27, 2015 8:47 pm 
Offline
User avatar

Joined: Thu Jul 23, 2015 7:54 pm
Posts: 168
Location: USA
Thank you so much for this, tokumaru. I had originally planned on trying to read 2 bytes of the stack for the $2006 updates, reading another byte for how many bytes to copy, putting that in X, and then using a loop to copy X amount of bytes, and I was gonna try to do this with both palettes, 25 tiles in fixed positions, and then try to squeeze a few more tiles in. I was just waiting until I'd run out of VBlank time. This also made me stop being lazy and learn how the RTS trick works.


Top
 Profile  
 
PostPosted: Wed Feb 03, 2016 10:44 pm 
Offline

Joined: Wed Feb 03, 2016 10:39 pm
Posts: 32
I know that I'm totally resurrecting this, but I thought it was interesting enough that I looked into it.

Performance wise, PLA takes 4 cycles. Because of this, using the stack is no faster than using any other memory. So while this method is cool, it would be better to devote non-stack memory to updating nametables UNLESS you're not using the stack very much, and you're not updating the screen with too much information.


Top
 Profile  
 
PostPosted: Wed Feb 03, 2016 11:00 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20071
Location: NE Indiana, USA (NTSC)
I think one of the advantages of the PLA/STA $2007 loop is that you can determine the point to jump in by shifting the negative number of bytes to the left by 2.


Top
 Profile  
 
PostPosted: Wed Feb 03, 2016 11:25 pm 
Offline
User avatar

Joined: Mon Jan 03, 2005 10:36 am
Posts: 3068
Location: Tampere, Finland
thenewguy wrote:
So while this method is cool, it would be better to devote non-stack memory to updating nametables UNLESS you're not using the stack very much, and you're not updating the screen with too much information.

Or flip that around and get: Use the stack area for PPU update buffers, UNLESS you need most of it for other things (e.g. some recursive algorithm).

_________________
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 26 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group