Yeah, I think that's the big one. Another is that each frame is buffered as much as possible in main RAM before sending to CHRRAM, mine doesn't attempt to do that and has to do a lot of work reading stuff out of CHRRAM and immediately putting it back whenever it updates a tile again. As I fully double buffer the whole pattern table I also have to waste time copying anything still dirty between frames (though I can often pair this with a frame's first update).tepples wrote:The secret is that the tank demo's tiles are double buffered, and any blank tile is never written to VRAM.
I'm theoretically able to push 32 tiles per frame (60 cycles per tile, 12 to set VRAM addr, 6 for each line), Bell's actually is somehwhat slower than that (93 cycles per tile, 29-ish+8*8 for each line), and then it still needs to update the name table. All of those updates are managed by nice families of subroutines in ROM, though, whereas my code spends a lot of time generating opcodes in RAM that it then runs straight through during vblank.
Where I lose is in making updates: if I had to update every line of a tile it costs 160 cycles (12 adr, 4 dummy read, 4 read+2 immed op+3 ZP write*8, 12 adr 6 write*8 + 12 for a JSR/RTS). I could improve that by getting rid of the JSR, but then I wind up using almost twice as much buffer for it (the ZP writes store into immediate ops in a routine that we then jump to for the write phase), and both buffer and vblank cycles end up being limiting factors at different times.
Besides the massive difference in skills, it's also a difference in goal. I wanted to get as close as possible to a fully bitmapped display. The tank demo wanted to do 3D wireframes (and a nice BG) and does 'em best. Whenever the framerate starts to slow to a crawl, that is something Bell's engine simply couldn't do with its tile budget. I still haven't read through the source, though, so this is based on watching it in FCEUX's debugger.