It is currently Tue Oct 17, 2017 3:38 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 18 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: Character RAM data write
PostPosted: Sun Feb 01, 2015 2:47 pm 
Offline
User avatar

Joined: Sat Feb 16, 2013 11:52 am
Posts: 218
The wiki states:

Quote:
Assuming moderately unrolled VRAM loading code, the NTSC vblank is long enough to copy about 160 bytes per frame plus the OAM display list without turning rendering on late. This is enough for 10 tiles alone, or 8 tiles plus a nametable row, nametable column, or entire palette.


I'm:
-Not uploading OAM through DMA
-Not writing to nametable space
-Not writing attributes or palettes
-Not doing scanline trickery to cheat on vblank time

and yet I can only upload 6 tiles max. before the PPU gets confused and messes up writing (spilling beyond vblank). Any advice on how I can improve my code (below)? Or do I need to convert character data to sequential LDA/STA instructions?

Code:
;Copies 1 to 256 bytes of data sequentially to a specified address in PPU memory
PPUMemory_Copy:   ;Bankswitch manually before use.
   LDA $2002
   LDA PPU_TGTPTR + 1
   STA $2006
   LDA PPU_TGTPTR
   STA $2006      ;Set up $2007 writes to PPU Target Address
   
   LDA #$FF
   SEC
   SBC PPU_ORGPTR
   TAX            ;Calculate number of increments for overflow
   
   LDY #0
.loop
   LDA [PPU_ORGPTR], y
   STA $2007
   INY
   DEX
   BNE .cont
   INC PPU_ORGPTR + 1
.cont
   CPY Copy_Quantity; NESDev note: I'm loading the corresponding address for the label with the value 16*6. If I set it up to copy one more whole tile it glitches.
   BNE .loop
   
   RTS

_________________
This is a block of text that can be added to posts you make. There is a 255 character limit.


Last edited by Punch on Sun Feb 01, 2015 9:58 pm, edited 1 time in total.

Top
 Profile  
 
PostPosted: Sun Feb 01, 2015 3:15 pm 
Online

Joined: Sun Apr 13, 2008 11:12 am
Posts: 6276
Location: Seattle
You have 20 scanlines during vblank. 20 × 341 ÷ 3 = 2273 cycles.
You're currently uploading 96 bytes, or about 23 cycles per byte.

Let's look at your code:
Code:
   LDA [PPU_ORGPTR], y ; (ind),y is slowish—5/6 cy
   STA $2007 ; +4cy = 9/10cy
   INY ; +2cy=11/12cy
   DEX ; +2cy = 13/14cy
   BNE .cont ; +3cy most of the time = 16/17cy
   INC PPU_ORGPTR + 1 ; skipped most of the time
.cont
   CPY Copy_Quantity ; +3cy, assuming Copy_Quantity is in zpg = 19/20cy
   BNE .loop ; +3cy most of the time = 22/23cy.

Yup, that's why.

You'll need a lighter loop to get more bytes uploaded. Align things to pages, don't use multiple iterators, don't expect to be able to use a single general-purpose loop. You probably don't need to have a dedicated partially-unrolled "upload 16 bytes to replace a tile" but if that's your primary use case you may as well.


Top
 Profile  
 
PostPosted: Sun Feb 01, 2015 3:35 pm 
Online
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10046
Location: Rio de Janeiro - Brazil
Punch wrote:
and yet I can only upload 6 tiles max.

Well, look at all the branches and index register updates you're doing! Note that the wiki says "moderately unrolled VRAM loading code", and this is not unrolled at all.

The first thing you have to consider is that each tile is 16 bytes, so it's pointless to do all this housekeeping every byte if the transfer will only end on 16-byte boundaries. Unrolling means that you do more work per iteration of the loop, effectively reducing on the amount of branches and comparisons, which add up to a lot of cycles if performed frequently.

Since each tile is 16 bytes, the most obvious optimization you can do is copy 16 bytes during each iteration of the loop, without performing any checks along the way, since you know for sure the transfer won't end before the tile does:

Code:
   lda [PPU_ORGPTR], y
   sta $2007
   iny
   lda [PPU_ORGPTR], y
   sta $2007
   iny
   lda [PPU_ORGPTR], y
   sta $2007
   iny
   (...)
   lda [PPU_ORGPTR], y
   sta $2007
   iny

Any particular reason why you are counting down to the pointer overflow using a separate register? The Y register alone will tell you this when it wraps from 255 to 0, there's no need for extra logic here. This frees X to count tiles. Also, as long as the tiles are properly aligned to 16-byte boundaries in the ROM, you don't need to check for pointer overflows, once every tile is fine:

Code:
   bne .cont
   inc PPU_ORGPTR+1
.cont
   dex
   bne .loop

But honestly, even one INY for each byte will cost you a few tiles. If you can buffer the tiles in RAM first you can copy them to VRAM with:

Code:
   lda BUFFER+0, x
   sta $2007
   lda BUFFER+1, x
   sta $2007
   lda BUFFER+2, x
   sta $2007
   (..)
   lda BUFFER+15, x
   sta $2007

Which is obviously faster.

If you store the tiles interleaved in the ROM (16 1st bytes, then 16 2nd bytes, and so on), you can use indexed addressing to read tiles directly from ROM just as fast, using X to index which tile you're reading:

Code:
   lda Byte00, x
   sta $2007
   lda Byte01, x
   sta $2007
   lda Byte02, x
   sta $2007
   (..)
   lda Byte15, x
   sta $2007

The only problem with this is that you'll only be able to address 256 tiles this way. To overcome this, I started addressing groups of 4 (or more) tiles, rather than individual tiles, so you can address a total of 1024 tiles, which occupy 16384 bytes, conveniently the same size of a PRG-ROM bank in many mappers.


Top
 Profile  
 
PostPosted: Sun Feb 01, 2015 4:02 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19091
Location: NE Indiana, USA (NTSC)
tokumaru wrote:
But honestly, even one INY for each byte will cost you a few tiles. If you can buffer the tiles in RAM first you can copy them to VRAM with:

Code:
   lda BUFFER+0, x
   sta $2007
   lda BUFFER+1, x
   sta $2007
   lda BUFFER+2, x
   sta $2007
   (..)
   lda BUFFER+15, x
   sta $2007

Which is obviously faster.

This is the "moderately unrolled" loop I was referring to. Buffer up to 10 tiles in unused parts of the stack ($0100-$019F) and blast them to VRAM during vblank.


Top
 Profile  
 
PostPosted: Sun Feb 01, 2015 4:06 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5710
Location: Canada
LDA (ptr), Y is the slowest form of LDA, aside from the obscure LDA (ptr,X). For starters, try not accessing data through a pointer. If you need timing reference, see: http://www.6502.org/tutorials/6502opcodes.html#LDA

If you can trade size for speed, you can unroll your loop:
Code:
.repeat 128,I
    LDA block + I
    STA $2007
.endrepeat


You could also partially unroll:
Code:
:
    .repeat 32, I
        LDA block + I, X
        STA $2007
    .endrepeat
    TXA
    CLC
    ADC #32
    TAX
    CPX #128
    BCC :-

(There's some more work you could do to this snippet to optimize, but I'm just trying to demonstrate the partial unroll.)

If your data is static, you could store your CHR data as code rather than bytes to upload:
Code:
.byte $12,$34,$56,$12,$56,$78...

vs.
Code:
LDA #$12
STA $2007
LDA #$34
STA $2007
...

or even use A/X/Y to avoid redundant loads:
Code:
LDA #$12
STA $2007
LDX #$34
STX $2007
LDY #$56
STY $2007
STA $2007
STY $2007
LDA #$78
STA $2007
...

Battletoads actually unpacks its CHR data into RAM like this, and during vblank runs that code from RAM. Obviously this requires a lot more space than just the data bytes, though.

One trick that I like is to store your update on the stack. You temporarily save the stack pointer and then move it to your data, then each byte write looks like:
Code:
PLA
STA $2007

No need to increment an index register, keep track of a pointer, etc. it's inherent in the PLA. Very easy to unroll this as needed.


Top
 Profile  
 
PostPosted: Sun Feb 01, 2015 9:09 pm 
Offline
User avatar

Joined: Sat Jul 12, 2014 3:04 pm
Posts: 936
Such use of stack is, however, slower to execute than the "code-as-data" [that is, code as LDA #imm / STA $2007 as shown], as PLA is 4 cycles; immediate loads, 2. Brain does not have sufficient resources tonight, for some reason I thought my rebuttal was ninja'd by the post I was rebutting.


Top
 Profile  
 
PostPosted: Sun Feb 01, 2015 9:55 pm 
Offline
User avatar

Joined: Sat Feb 16, 2013 11:52 am
Posts: 218
Thanks everyone, I think I'll use the strategy of unrolling 16 byte writes (for each tile) and character data alignment in ROM. Maybe even a RAM buffer (on unused stack space?) to uncompress char data if I'm not too lazy to build one.
I can't afford to waste much space (specially with all the program ROM I'm going to use for tile storage -- I'm building an adventure game and each screen can have 144 unique tiles max (!). Poor design? I don't know really but I'm not proficient in art design for tiled games (or any kind of art, for that matter) so I need to follow up with this plan unless I want everything in my game blocky and uglier.


tokumaru wrote:
The only problem with this is that you'll only be able to address 256 tiles this way. To overcome this, I started addressing groups of 4 (or more) tiles, rather than individual tiles, so you can address a total of 1024 tiles, which occupy 16384 bytes, conveniently the same size of a PRG-ROM bank in many mappers.


How exactly? I don't understand how to address groups of 4, do you do something like this below or is it more involved than that?

Code:
$8000  $8100  $8200  $8300
A1        A2     A3     A4
B1        B2     B3     B4
------------
LDA $8000, x
STA $2007
LDA $8100, x
STA $2007
LDA $8200, x
(...)

_________________
This is a block of text that can be added to posts you make. There is a 255 character limit.


Top
 Profile  
 
PostPosted: Mon Feb 02, 2015 5:00 am 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5710
Location: Canada
Myask wrote:
Such use of stack is, however, slower to execute than the "code-as-data" [that is, code as LDA #imm / STA $2007 as shown], as PLA is 4 cycles; immediate loads, 2.


Well, if you want to compare them by speed alone, LDA immediate is the fastest possible, yes. It's a tradeoff of size (either ROM or RAM) for speed though.

The PLA method is reasonably fast, and makes conservative use of RAM. A partially unrolled LDA abs, X is similar in speed to using PLA, but the code required is slightly larger.

The LDA (indirect), Y proposed at the beginning of this thread is more or less the slowest naive method of doing this.


Punch: using striped data like that is an alternative way to unroll your loop. It makes incrementing the index register simpler, at least. Outside the vblank there is probably some increased comlexity somewhere to organize the data (ideally: done offline by a tool).


Top
 Profile  
 
PostPosted: Mon Feb 02, 2015 6:34 am 
Online
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10046
Location: Rio de Janeiro - Brazil
rainwarrior wrote:
Well, if you want to compare them by speed alone, LDA immediate is the fastest possible, yes. It's a tradeoff of size (either ROM or RAM) for speed though.

I considered using the immediate method, in ROM, for tiles that need to be updated constantly, like the main character and animated level objects. This is an incredible waste of ROM though (each byte expands to 5 bytes!), so this is only worth it if you have lots of space available. I started coding a tool to minimize this expansion a little, by using all 3 registers as a small dictionary of recently used values and using INX, DEX, INY, DEY, ASL, LSR, ROR and ROL instead of loading new values whenever possible, but never implemented most of the ideas. The hard part is to figure out which register is the best one for loading a new value to, considering transformations it will go through in the future and how soon the old value will be used again.

Quote:
The PLA method is reasonably fast, and makes conservative use of RAM. A partially unrolled LDA abs, X is similar in speed to using PLA, but the code required is slightly larger.

Unrolled PLAs are interesting because they use less ROM, and when you have it completely unrolled, you can easily jump in the middle of the code to copy variable amounts of bytes, if you ever need that.

Another place you can put the buffer in is zero page, and as long as you use ZP addressing (meaning the code has to be completely unrolled) you can transfer each byte in 7 cycles, slightly faster than the 8 cycles of the PLA or absolute indexed methods.

Quote:
Outside the vblank there is probably some increased comlexity somewhere to organize the data (ideally: done offline by a tool).

Yes, the complexity is all about converting the CHR data from linear to interleaved, which should be done by a little script before the ROM is assembled, so this doesn't have any negative impact on the NES program itself. The main advantage of this method is that you don't have to waste CPU time or RAM preparing the data and buffering, you can blast it straight from ROM to VRAM without any speed penalty.

Quote:
How exactly? I don't understand how to address groups of 4, do you do something like this below or is it more involved than that?

Code:
$8000  $8100  $8200  $8300
A1        A2     A3     A4
B1        B2     B3     B4
------------
LDA $8000, x
STA $2007
LDA $8100, x
STA $2007
LDA $8200, x
(...)

Something like this, but I'm not sure what A1, A2, etc. mean in your diagram. Let me draw a diagram of what I have in mind:

Code:
G = INDEX OF THE GROUP;
B = INDEX OF THE BYTE WITHIN THE GROUP;

$8000: G$00 B$00, G$01 B$00, G$02 B$00, G$03 B$00, G$04 B$00 (...) G$FE B$00, G$FF B$00
$8100: G$00 B$01, G$01 B$01, G$02 B$01, G$03 B$01, G$04 B$01 (...) G$FE B$01, G$FF B$01
$8200: G$00 B$03, G$01 B$03, G$02 B$03, G$03 B$03, G$04 B$03 (...) G$FE B$03, G$FF B$03
(...)
$BD00: G$00 B$3D, G$01 B$3D, G$02 B$3D, G$03 B$3D, G$04 B$3D (...) G$FE B$3D, G$FF B$3D
$BE00: G$00 B$3E, G$01 B$3E, G$02 B$3E, G$03 B$3E, G$04 B$3E (...) G$FE B$3E, G$FF B$3E
$BF00: G$00 B$3F, G$01 B$3F, G$02 B$3F, G$03 B$3F, G$04 B$3F (...) G$FE B$3F, G$FF B$3F

Since each tile is 16 bytes, a group of 4 would be 64 bytes, so you'd need an unrolled loop that transfers that many bytes (this code would use 384 bytes of ROM). The code would look like this:

Code:
   ldx GroupIndex
   ldy GroupCount

TransferBlock:

   lda $8000, x
   sta $2007
   lda $8100, x
   sta $2007
   lda $8200, x
   sta $2007
   (...)
   lda $BE00, x
   sta $2007
   lda $BF00, x
   sta $2007
   
   dey
   beq Done
   inx
   jmp TransferBlock

Done:

Yes, it does use quite a bit of ROM, but it's the only way I'm aware of to avoid the (slow) indirect indexed addressing and still copy the data straight from ROM to VRAM, not wasting CPU time on buffering. If you can program using any language on the PC you can make a tool to interleave the CHR data in 5 minutes.

Each iteration of this loop takes 521 cycles to execute, so you can safely copy 4 blocks of 4 tiles = 16 tiles each frame if you're doing nothing else during VBlank.


Top
 Profile  
 
PostPosted: Mon Feb 02, 2015 7:19 pm 
Offline

Joined: Thu Aug 28, 2008 1:17 am
Posts: 591
I don't think I saw it mentioned, but if the tile/sprite data has repeated bytes (at least 2), then you could get a small speed up with just a redundant STA (no need to re-read or load, etc). That, and possibly prime X and/or Y with the most constant used byte for that data segment to upload, that way it doesn't need to be just sequential repeats. It's only a small optimization, but it's something (I thinking more along the lines of generating the code for this during active display, so it wouldn't apply for existing rom code, unrolled or not).

_________________
__________________________
http://pcedev.wordpress.com


Top
 Profile  
 
PostPosted: Mon Feb 02, 2015 7:48 pm 
Online
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10046
Location: Rio de Janeiro - Brazil
tomaitheous wrote:
I don't think I saw it mentioned, but if the tile/sprite data has repeated bytes (at least 2), then you could get a small speed up with just a redundant STA (no need to re-read or load, etc). That, and possibly prime X and/or Y with the most constant used byte for that data segment to upload, that way it doesn't need to be just sequential repeats.

I did mention a tool I was coding to perform such optimizations on code that would be stored in ROM.

I wouldn't try something like this in realtime though, because the time spent on analyzing the data would be too disproportional (e.g. to know what the most common values are you'd have to scan all the data beforehand) to the little time you'd gain with the optimizations. And you can't even guarantee that there will be successful optimizations every time, so you can't count on that extra time anyway. An offline tool on the other hand could easily count the cycles as it generates code, so you'd always know if a particular block of CHR could be updated in the time you have available.

If you really need this little bit of extra time, it would make much more sense to blank some scanlines and get some extra VBlank time.


Top
 Profile  
 
PostPosted: Tue Feb 03, 2015 12:50 am 
Offline
User avatar

Joined: Sat Jul 12, 2014 3:04 pm
Posts: 936
tokumaru wrote:
Yes, it does use quite a bit of ROM
Or RAM, as Rainwarrior pointed out. Unrolled loops are pretty trivial to make loops to write into (W)RAM, after all. You don't have to STORE CHR data as code in the ROM itself, just have it so for when you're going to send it. Granted, generating the three-register-lookback version would be harder than just dropping the LDA #xx STA $2007 (A9xx8D0720) over and over...but having the unrolled loop with N iterations of that and deciding where to jump in rather than having loop-control means you'd only have to write the data each update frame, not rewrite the code. (Inserting PPU address-changes would likewise be relatively easy, just changing a 6 to a 7 as necessary...but you'd need to remember to change it back for the sake of the next update.)
tomaitheous wrote:
I don't think I saw it mentioned, but if the tile/sprite data has repeated bytes (at least 2), then you could get a small speed up with just a redundant STA
it was:
rainwarrior wrote:
Battletoads actually unpacks its CHR data into RAM like this

Including register re-stores and use of A, X, and Y?


Top
 Profile  
 
PostPosted: Tue Feb 03, 2015 4:46 am 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5710
Location: Canada
No, from what I recall Battletoads just has a block of LDA imm / STA $2007 in RAM, and updates the immediate values as needed. If you want to know the details, though, it's easy to just take a look in a debugger.


Top
 Profile  
 
PostPosted: Tue Feb 03, 2015 7:19 am 
Online
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10046
Location: Rio de Janeiro - Brazil
Myask wrote:
tokumaru wrote:
Yes, it does use quite a bit of ROM
Or RAM, as Rainwarrior pointed out.

Yes, that has been discussed before. The comment you quoted however was about the specific technique I was detailing (interleaved tiles accessed with absolute indexed addressing), which allows copying data straight from ROM to VRAM, which has the advantage of not stealing time from the game logic in order to buffer the values.

rainwarrior wrote:
No, from what I recall Battletoads just has a block of LDA imm / STA $2007 in RAM, and updates the immediate values as needed.

I didn't know Battletoads did this. I never expected a game without extra RAM to do this, since even a small block of 8 tiles (128 bytes) would expand to 640 bytes, over 30% of the RAM total (2KB).


Top
 Profile  
 
PostPosted: Wed Feb 04, 2015 7:15 am 
Offline

Joined: Thu Aug 28, 2008 1:17 am
Posts: 591
tokumaru wrote:
tomaitheous wrote:
I don't think I saw it mentioned, but if the tile/sprite data has repeated bytes (at least 2), then you could get a small speed up with just a redundant STA (no need to re-read or load, etc). That, and possibly prime X and/or Y with the most constant used byte for that data segment to upload, that way it doesn't need to be just sequential repeats.

I did mention a tool I was coding to perform such optimizations on code that would be stored in ROM.

I wouldn't try something like this in realtime though, because the time spent on analyzing the data would be too disproportional (e.g. to know what the most common values are you'd have to scan all the data beforehand) to the little time you'd gain with the optimizations.


No, definitely not realtime. The cell, or rather a chunk of data, would be processed external of runtime. That gives the added benefit of compression as well, but slight possible bloat as well (1 extra bit per byte, but really it's dependent on how you implement it). But yeah, predetermined at or before assemble time.



Quote:
And you can't even guarantee that there will be successful optimizations every time, so you can't count on that extra time anyway. An offline tool on the other hand could easily count the cycles as it generates code, so you'd always know if a particular block of CHR could be updated in the time you have available.


Exactly. You would know what the cycle count would be for that chunk of data to transfer. Yeah, it might be variable from chunk to chunk, but that doesn't mean you can't optimize for such results. The level could be designed in such a way that only certain objects or tile cells can be uploaded per frame, so you design the level around these factors. And knowing the cycle counts ahead of time, allows you figure that into transfer window of time, as well as the build window (active display area).



Quote:
If you really need this little bit of extra time, it would make much more sense to blank some scanlines and get some extra VBlank time.

Nothing says you can't combine both methods. It's a more extreme optimization, but like all level of optimizations - they have their set of limitations; maybe not great for every situation, but specifically advantageous when applicable.

_________________
__________________________
http://pcedev.wordpress.com


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 18 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: Google [Bot], pinkpuff, strat, tokumaru and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group