It is currently Wed Oct 18, 2017 1:32 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 9 posts ] 
Author Message
PostPosted: Wed Aug 24, 2016 11:07 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10052
Location: Rio de Janeiro - Brazil
Everyone knows that the PPU offers a special address auto-increment mode of 32 to make it easier to draw columns of tiles to the name tables, but unfortunately it doesn't offer the option of incrementing the address by 8, which would be really handy for updating attributes.

One common way to avoid having to set the address before writing each attribute byte is to make use of the increment 32 mode anyway and only set the address for every 2 attribute bytes, which must be written non-sequentially (bytes 0 and 4, 1 and 5, 2 and 6, 3 and 7). This works really well when the attribute updates do not cross a screen boundary, which is indeed the case in many scrolling engines.

In my case however, I'm using a 4-screen name table layout, and column updates are always split between 2 screens, so there isn't much to gain from the increment 32 mode, because only some bytes can be written in pairs, depending on the alignment between the column and the screens, and handling all the different cases would probably negate the benefits in the end.

So, seeing as I'm stuck with setting the VRAM address for each byte, I started to think of ways to do this as fast as possible. The high byte of the address never changes within the same screen (since each attribute table is only 64 bytes), so it would seem logical to keep it permanently loaded in a register so I could simply write it to $2006 over and over without having to reload it. The low byte could be kept in the accumulator, so I could quickly jump to the next line with ADC #$08, which would leave an index register free for the data itself. I'd rather not load the data from a fixed memory position (since the space for VRAM updates is allocated dynamically), and using the stack is out of the question because it kills the ADC #$08 way of incrementing the address.

Just as I was about to give up on this solution, I remembered that $2006 and $2005 share the toggle that selects between 1st and 2nd write, and though that maybe I could use that to my advantage. As it turns out, the first $2005 write only affects bits in the lower half of the VRAM address (the coarse X scroll), and the second $2006 write overwrites the entire lower byte, so I don't really need to keep the high byte of the address loaded anywhere, I can simply write junk data to $2005 instead, and then write the actual low byte of the address, which the accumulator holds, to $2006. This means that both X and Y are free for me to load the data from a dynamic location. The code would then look something like this:

Code:
   ;NOTE: X points to the buffer position that contains the data for this update

   ;write the first byte (24 cycles)
   lda UpdateBuffer+0, x
   sta $2006
   lda UpdateBuffer+1, x
   sta $2006
   ldy UpdateBuffer+2, x
   sty $2007

   ;write the remaining bytes (18 cycles per byte)
   sta $2005 ;<- could just as well be stx or sty too, it doesn't matter!
   adc #$08
   sta $2006
   ldy UpdateBuffer+3, x
   sty $2007
   ;(...)

Of course, since the update is split across 2 screens, there'll be a "first byte" in each screen, and a total of 7 "remaining bytes", so the final cycle count for this is 174 (significantly better than the 200 something I had with my stack method), disregarding the logic necessary to handle the two variable-length updates (which in the worst case can be eliminated by having unrolled routines for all 8 possible alignments).

Now, I'm writing this post for 2 reasons: first, I want to run the idea by you guys to make sure it's solid, and that I didn't overlook anything. Second, I want to share the idea in case it may be useful to anyone, since combined $2005/$2006 writes is not something we usually discuss for purposes other than mid-screen scroll changes. So, what do you guys think? Is it safe to update the VRAM address in this manner? Am I forgetting anything?


Last edited by tokumaru on Thu Aug 25, 2016 12:29 pm, edited 1 time in total.

Top
 Profile  
 
PostPosted: Thu Aug 25, 2016 1:13 am 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5718
Location: Canada
Yes, I believe you're correct that a $2005/$2006 pair is a valid way to update just the low byte of the PPU address.


Top
 Profile  
 
PostPosted: Thu Aug 25, 2016 1:34 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10052
Location: Rio de Janeiro - Brazil
Cool. I wonder if this could be useful for other types of VRAM updates...


Top
 Profile  
 
PostPosted: Thu Aug 25, 2016 4:06 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7230
Location: Chexbres, VD, Switzerland
I do not get the point of your thread. I think you are overthinking this (although I might be wrong). At which speed are you scrolling exactly ? If you are scrolling at 16 pixels per frame or less, this discussion is irrelevant, since the NTSC VBlank is long enough to update 2 NT columns, 1 AT column, and sprite DMA, and there is even a little time left if I am not mistaken. Now, 16 pixels per frame is incredibly fast already.

So you should probably explain why doing it the "normal" way isn't fast enough for you. Also, the trick you tought of (using 32-mode increment and write 2 attribute byte per adress) is relevant, no matter if vertical scrolling is used aswell or not. You don't have to start your update where the screen start and stop the update where the screen stops, you can do a full update from the top of the NT down to the bottom, without caring about how your screen is vertically aligned, if this happens to be more efficient. Then you could unroll the loop and maintain $3f in either X or Y in order to not have to load that value. The code would be something like :

Code:
ldx #$3f
ldy #$c0
stx $2006
sty $2006
lda AttribColumnBuffer
sta $2007
lda AttribColumnBuffer+4
sta $2007

iny
stx $2006
sty $2006
lda AttribColumnBuffer+1
sta $2007
lda AttribColumnBuffer+5
sta $2006

iny
etc...


I think this would be pretty fast enough, and the fastest possible without selfmod code.


Top
 Profile  
 
PostPosted: Thu Aug 25, 2016 5:15 am 
Offline
Formerly ~J-@D!~
User avatar

Joined: Sun Mar 12, 2006 12:36 am
Posts: 445
Location: Rive nord de Montréal
iny won't cut it, Bregalad. You should skip 8 bytes between each attribute byte to stay along a column. With your code, you move along the row and you end up updating row #0 and row #4.

BTW, nice trick using 2005/2006 to update the PPU address, clever!


Top
 Profile  
 
PostPosted: Thu Aug 25, 2016 8:50 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10052
Location: Rio de Janeiro - Brazil
Bregalad wrote:
I do not get the point of your thread. I think you are overthinking this (although I might be wrong).

This was just an idea that saved me a few cycles compared to what I had before, and decided to share it in case the basic idea is useful to other people, even if not in the same context. I'm not saying this is THE way to do attribute columns, but in my particular design this does seem like the best way.

Quote:
At which speed are you scrolling exactly ? If you are scrolling at 16 pixels per frame or less, this discussion is irrelevant, since the NTSC VBlank is long enough to update 2 NT columns, 1 AT column, and sprite DMA, and there is even a little time left if I am not mistaken. Now, 16 pixels per frame is incredibly fast already.

This is not just about scrolling speed. I have some requirements I want my engine to meet, and although the particular combination of tasks you mentioned is possible, I have a lot more stuff going on, and in addition to being able to scroll both horizontally and vertically in the same frame, I need a lot of CHR-RAM animations, so by optimizing the time taken by each individual task I increase the chances of being able to do more stuff in any given frame.

Quote:
So you should probably explain why doing it the "normal" way isn't fast enough for you.

To put it simply, I need to scroll horizontally and vertically in the same frame, so I need each of these updates to take at most half of the available vblank time (sprite DMA already deducted, since it happens every frame). I was having trouble making the column update fit. At least without hacks or breaking the pattern that other updates obey.

Quote:
You don't have to start your update where the screen start and stop the update where the screen stops, you can do a full update from the top of the NT down to the bottom, without caring about how your screen is vertically aligned, if this happens to be more efficient.

This works well when you use vertical mirroring, because you can update from row 0 to row 7 regardless of where the column actually starts, because you'll be writing all 8 bytes anyway. It also works for vertical layouts (horizontal mirroring or 4-screen) like in SMB3, where the whole vertical space is updated, but I don't do that. I only update the part that the camera overlaps, which is 9 attribute blocks that always cross the NT/AT boundary. It can be aligned 8 different ways:

1 on top, 8 on bottom;
2 on top, 7 on bottom;
3 on top, 6 on bottom;
4 on top, 5 on bottom;
5 on top, 4 on bottom;
6 on top, 3 on bottom;
7 on top, 2 on bottom;
8 on top, 1 on bottom;

The number of paired writes you can do when the column is split like this is greatly reduced. You need at least 4 blocks to even be able to write one pair.

Quote:
Code:
ldx #$3f
ldy #$c0
stx $2006
sty $2006
lda AttribColumnBuffer
sta $2007
lda AttribColumnBuffer+4
sta $2007

iny
stx $2006
sty $2006
lda AttribColumnBuffer+1
sta $2007
lda AttribColumnBuffer+5
sta $2006

iny
etc...

Like Jarhmander said, INY won't cut it here, since you have to add 8 to move down one block. You can maintain the same cycle count if you use A for the low byte of the address and do ADC #8, and use Y for the data, though.

But this is loading the data from a fixed location, something I specifically said I wanted to avoid, because I allocate memory for my updates dynamically. Not having to keep the high byte loaded frees an index register I can use to load the data from a dynamic location, without needing more cycles (besides having to load the index in the first place, but I'm accounting for that in the overhead of each update).

Like I said, this "trick" saved me a few cycles on my setup, and even though other people may have different requirements for their column updates and thus don't need the trick for this specific purpose, it might still be useful in other scenarios.


Top
 Profile  
 
PostPosted: Thu Aug 25, 2016 9:18 am 
Offline
User avatar

Joined: Fri May 08, 2015 7:17 pm
Posts: 1771
Location: DIGDUG
Not...

Code:
ldx #$3f


But...

Code:
ldx #$23


...assuming Nametable #0.

_________________
nesdoug.com -- blog/tutorial on programming for the NES


Top
 Profile  
 
PostPosted: Thu Aug 25, 2016 11:10 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7230
Location: Chexbres, VD, Switzerland
OK, then revert the roles of A and Y and use ADC #$08

Code:
clc
ldx #$23    ; Or $27 or $2b or $2f to be derermined
lda #$c0
stx $2006
sta $2006
ldy AttribColumnBuffer
sty $2007
ldy AttribColumnBuffer+4
sty $2007

adc #$08
stx $2006
sta $2006
ldy AttribColumnBuffer+1
sty $2007
ldy AttribColumnBuffer+5
sty $2006

adc #$08
etc...


Just a suggetion, nothing more. I do not particularly like the idea of using $2005 to adress VRAM during VBlank, but if it works fine for you then great. I personally think it would be very hard to computer the corresponding VRAM adress of a $2005 scroll value, the other way arround is easier.
Quote:
This works well when you use vertical mirroring, because you can update from row 0 to row 7 regardless of where the column actually starts, because you'll be writing all 8 bytes anyway. It also works for vertical layouts (horizontal mirroring or 4-screen) like in SMB3, where the whole vertical space is updated, but I don't do that. I only update the part that the camera overlaps, which is 9 attribute blocks that always cross the NT/AT boundary. It can be aligned 8 different ways:

Oh that's true but since you have to write only 16 bytes when updating a whole column, I think the overhead of the logic to update only the necessary column might be as expensive as code to upload 8 unnecessary bytes... it's up to you to see.

If all you want is maximum speed, selfmodifying code is the only serious way to go and is going to be significantly faster than any classical unrolled loop staying in ROM.


Top
 Profile  
 
PostPosted: Thu Aug 25, 2016 11:45 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10052
Location: Rio de Janeiro - Brazil
Bregalad wrote:
OK, then revert the roles of A and Y and use ADC #$08

Yup, that'd be a good solution for most cases. But I personally would prefer to not to have a dedicated buffer for attribute columns, and freeing up an index register allows me to load the data from a dynamic location.

Quote:
I personally think it would be very hard to computer the corresponding VRAM adress of a $2005 scroll value, the other way arround is easier.

But the biggest advantage of this trick is that you don't have to compute anything. The value you write to $2005 is meaningless! The first $2005 write only modifies the lower 5 bits of the VRAM address (and the fine X scroll, which is elsewhere), and the second $2006 write will replace the entire low byte anyway.

The point of the $2005 write is simply to change the state of the 1st/2nd write toggle, without affecting the upper byte of the VRAM address. The actual value you write is meaningless, because it only affects the lower byte of the address, which will be overwritten in the following write ($2006) anyway.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 9 posts ] 

All times are UTC - 7 hours


Who is online

Users browsing this forum: Yahoo [Bot] and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group