It is currently Tue Oct 17, 2017 7:21 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 38 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Thu Jul 23, 2015 1:12 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2283
In this post, tepples wrote:
"Is this Battletoads?"

Some people are fans of CHR ROM because it allows rapid switching of tiles for smooth animation of the player character. But in Kirby's Adventure, it ends up causing a lot of duplication because all frames of all enemies on screen at once need to fit in the same 2K bank of enemy tiles. So instead, I'm a fan of the Battletoads technique of loading sprite tiles into video memory as they're needed. I've already described how this works on Game Boy Advance, but the NES has far less video memory bandwidth and thus needs a bit more clever technique.

The engine I'm developing for this project has four object slots in video memory: one for the hero and three for enemies. These occupy CHR RAM $1800-$19FF, $1A00-$1BFF, $1C00-$1DFF, and $1E00-$1FFF. Each slot is divided into a pair of 16-tile buffers, plus several variables in main RAM:
  • Current cel: The cel ID currently being displayed in this slot.
  • Next cel: The cel ID whose tile data needs to be loaded into the back buffer of this slot.
  • Current buffer: Whether the slot's first or second buffer is its front buffer.
  • Information about what data has been loaded into each buffer of each slot.
In addition, a set of request flags controls which sprites should be switched to the next cel as soon as they are completely loaded.

On each frame that doesn't have any updates to tiles or map caused by scrolling, the sprite cel loader finds pieces of a cel to load. It prioritizes slots whose request bit is set, switching buffers and clearing the request bit if the cel is ready and loading a piece into the VRAM transfer buffer if not. Up to 8 tiles can be copied in each frame (NTSC without extended blanking). If a particular frame uses all 16 tiles, its update is split across two frames.

If there is still no scheduled VRAM transfer after the loader has processed all request bits, it loads pieces of the next cel speculatively. Speculative loading sets the next cel to the frame most likely to follow a slot's current cel, such as the next cel of a walk cycle. I count about five mispredicts per second on average, usually when an enemy spawns or when the player takes an unpredicted action, such as jumping, stopping a walk, beginning a punch combo, allowing a punch combo to expire, or taking a hit. A mispredict may delay loading a cel for a frame or two But otherwise, speculative loading puts a cel into VRAM just when it is needed, allowing the player and enemies to be animated at an acceptable frame rate.

The metasprite drawing code uses values $00-$7F normally for constant tiles. It uses $80-$8F for these switchable slots, ORing in the start tile of current buffer of the slot being drawn.


Is the NES really that bad with sprites? I wrote down a quick loading routine and counted the cycles and ended up with:

Code:
-;
lda ({tile_address}),y   //5
sta {vram_port}      //4 9
iny         //2 11
cpy #$10      //2 13
bne -         //3 16


It would take only 2048 cycles to upload 8 tiles, and vblank is more than 4096 cycles long.


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 1:29 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19092
Location: NE Indiana, USA (NTSC)
psycopathicteen wrote:
Is the NES really that bad with sprites? I wrote down a quick loading routine and counted the cycles and ended up with:

Code:
-;
lda ({tile_address}),y   //5
sta {vram_port}      //4 9
iny         //2 11
cpy #$10      //2 13
bne -         //3 16


It would take only 2048 cycles to upload 8 tiles, and vblank is more than 4096 cycles long.

Vblank on NTSC NES is closer to 2270 cycles long because the NES PPU always runs in 240-line mode. This also needs to include about 600 cycles of other tasks, such as OAM DMA and setting the scroll position. So the pattern loading routine is unrolled by a factor of 16 and always copies from a buffer in an otherwise unused part of the stack page ($0100-$017F).


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 1:39 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2283
I thought most games used forced blank.


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 1:45 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10047
Location: Rio de Janeiro - Brazil
psycopathicteen wrote:
Code:
-;
lda ({tile_address}),y   //5
sta {vram_port}      //4 9
iny         //2 11
cpy #$10      //2 13
bne -         //3 16

You really shouldn't compare and branch every byte, when you know that each tile is 16 bytes. Unrolling this loop to copy 16 bytes at a time already represents a big speed boost. Still, having to increment Y for every byte and using indirect indexed addressing is too slow for my taste. I'd rather interleave the bytes and use indexed addressing with increasing base addresses in an unrolled loop, or even buffer the tiles in RAM beforehand and copy them to VRAM with an unrolled loop.

Quote:
It would take only 2048 cycles to upload 8 tiles, and vblank is more than 4096 cycles long.

As it's been pointed out, your math is a little off. With only 2273 cycles of VBlank, you have to do better than this if you expect to animate objects and update other things, such as backgrounds, palettes and OAM.

psycopathicteen wrote:
I thought most games used forced blank.

Most games don't! The ones that do are usually unlicensed.


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 1:55 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19092
Location: NE Indiana, USA (NTSC)
This is what happens in a typical unrolled tile copy, at 140 cycles per 16-byte tile:
Code:
vram_copybuf = $0100
PPUDATA = $2007

; prep code omitted
; carry is clear at this point
copyloop:
  .repeat 16, I
    lda vram_copybuf+I,x
    sta PPUDATA
  .endrepeat
  txa
  adc #16
  tax
  cpx vram_copylen
  bcc copyloop
; fixup code omitted


The .repeat block in ca65 expands into this:
Code:
  lda $0100,x
  sta $2007
  lda $0101,x
  sta $2007
  lda $0102,x
  sta $2007
  ; ...
  lda $010F,x
  sta $2007


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 2:10 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5710
Location: Canada
If you store your data to upload on the stack and unroll your code, you can easily get 8 cycles per byte:
Code:
.repeat 16
    pla ; 4 cycles
    sta $2007 ; 4 cycles
.endrepeat

If you want to write a generator to unroll and store your tiles as code in ROM, you can get down to 6 cycles or less per byte:
Code:
    lda #$05 ; 2 cycles
    sta $2007 ; 4 cycles
    ldx #$39 ; 2 cycles
    stx $2007 ; 4 cycles
    ldy #$73 ; 2 cycles
    sty $2007 ; 4 cycles
    ...

If you can order the choice of register to make loads redundant (e.g. if you lda #$00 you can sta $2007 many bytes of zeroes), to save 2 more cycles each time. (You probably wouldn't do this in combination with a forced vblank, though, since you'd normally need a consistent cycle count for that.)

You can also dynamically build this code in RAM if you want to save ROM space, at the expense of extra setup time outside of vblank.


As for games that use forced vblank, there are very few. If you have bankable CHR-ROM, there's generally not a need, it's mostly just for animating tiles with CHR-RAM. Not a lot of games actually did that.


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 2:11 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5710
Location: Canada
Tepples, I hope your carry is clear before that adc #16.


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 2:17 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19092
Location: NE Indiana, USA (NTSC)
rainwarrior wrote:
Tepples, I hope your carry is clear before that adc #16.

The prep code clears it.

And PLA is as slow as LDA a,X.


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 3:14 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10047
Location: Rio de Janeiro - Brazil
rainwarrior wrote:
If you store your data to upload on the stack and unroll your code, you can easily get 8 cycles per byte:

You can also get 8 cycles per byte straight off the ROM if you're OK with copying groups of tiles instead of single tiles, and interleaving the bytes of all the groups (creating structures of arrays, as the 6502 likes it). For example, if using groups of 64 bytes (4 tiles) you could address 16KB of CHR data with an 8-bit index:

Code:
   offset = 0
.repeat 64
   lda $8000+offset, x
   sta $2007
   offset = offset + 256
.endr

This would work well for UNROM for example.

Quote:
If you want to write a generator to unroll and store your tiles as code in ROM, you can get down to 6 cycles or less per byte:

That's something I considered doing for a handful of animated objects, as well as the main character. Definitely not for all the graphics in a game.

Quote:
You can also dynamically build this code in RAM if you want to save ROM space, at the expense of extra setup time outside of vblank.

I have to say I'm not a fan of spending so much time just preparing data like that.

BTW, I just noticed we've had this conversation before.

Anyway, you know what would've been sweet? If there was an option to select $2004 or $2007 as the target for DMA writes. It wouldn't do much for name table updates (besides allowing a full background update in a single frame), but it would've been a great help for managing CHR-RAM. I know it's silly to think of what could have been... the console is what it is and we must accept it's limitations, but wouldn't it be nice if a mapper could add this feature?


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 3:21 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5710
Location: Canada
tokumaru wrote:
Anyway, you know what would've been sweet? If there was an option to select $2004 or $2007 as the target for DMA writes. It wouldn't do much for name table updates (besides allowing a full background update in a single frame), but it would've been a great help for managing CHR-RAM. I know it's silly to think of what could have been... the console is what it is and we must accept it's limitations, but wouldn't it be nice if a mapper could add this feature?

Wasn't that basically what the dual WRAM/CHR-RAM mapper idea was for?


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 3:38 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10047
Location: Rio de Janeiro - Brazil
rainwarrior wrote:
Wasn't that basically what the dual WRAM/CHR-RAM mapper idea was for?

That was nice, but way to complicated to implement, IMO. A DMA feature built from the ground up would be complicated too, I know. Being able to reuse the existing DMA functionality but routing writes to $2007 instead would be the really cool thing I think, but that's probably not possible.


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 4:17 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2283
So it could do 16 tiles per frame even without forced blank. So that means that if DKC got ported to the NES, the sprites would be half their size, half the amount, and half the framerate.


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 4:26 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5710
Location: Canada
psycopathicteen wrote:
if DKC got ported to the NES

The NES has bankable CHR-ROM solutions, though. Why not just use that? They probably would have used that on SNES if it was capable.


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 4:37 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19092
Location: NE Indiana, USA (NTSC)
rainwarrior wrote:
psycopathicteen wrote:
if DKC got ported to the NES

The NES has bankable CHR-ROM solutions, though. Why not just use that?

The four windows of MMC3 work for the player and three enemies at once. If there are more independently animated enemies, you have to group enemies into enemy sets and duplicate each enemy's sprite tiles in the tile bank associated with each enemy set in which it appears, as Kirby's Adventure does. This is part of why Teenage Mutant Ninja Turtles II stops the scroll so often, so that the two players never encounter more than two distinct enemy types at once.


Top
 Profile  
 
PostPosted: Thu Jul 23, 2015 4:46 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5710
Location: Canada
You could make a mapper that divides it as fine as you need? 16 slots would allow 16 characters with up to 16 tiles each (even though you could only display half of it in any given frame). If your characters aren't overlapping vertically, you could also use the MMC3's scanline counter to multiplex its existing 4 banks.

Also, we're forgetting that Hummer Team already ported Donkey Kong Country to the NES:
https://www.youtube.com/watch?v=fBeD-kEHy3E

(As you might have guessed, it uses 1k CHR-ROM banking.)


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 38 posts ]  Go to page 1, 2, 3  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: Bing [Bot], tepples, Yahoo [Bot] and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group