You should use DMA if/when technically possible, but not all situations can be done using DMA. DMA is substantially faster than code, by around a factor of... 8? 10? I don't know (refs: #1
but all of these talk about clocks, which is not the same thing as CPU cycles). Why you can't use DMA in your particular case here is because you want a way to increment the source address by something other than 1 or 2 -- SNES DMA can't do that. What most people end up doing is putting into WRAM, linearly, the bytes they want to be written to PPU RAM (by whatever increment) then use DMA for that.
Here are cycle counts for a 128 byte transfer into PPU RAM (with a 32x32 increment in PPU RAM, as well as from WRAM). Don't just use this, please read everything I've written. I haven't done SNES code in ~20 years so I may have parts of this wrong (ex. bits of $2115, PPU RAM layout for tilemap, etc.). Cut me some slack please.
Code: Select all
sep #$20 ; 3 cycles
; $2115 bit 7 = %1 = increment PPU RAM address on write to $2119 (low byte @ $2118, high byte @ $2119)
; $2115 bit 1,0 = %01 = increment PPU RAM address 32x32, e.g. one column at a time
lda #%10000001 ; 2 cycles
sta.l $002115 ; 5 cycles
rep #$30 ; 3 cycles
; XXXX = PPU RAM address of tilemap start; fill in yourself
lda #$xxxx ; 3 cycles
sta.l $002116 ; 6 cycles
ldx #0 ; 3 cycles
lda.l $7f0000,x ; 6 cycles
sta.l $002118 ; 6 cycles
txa ; 2 cycles
clc ; 2 cycles
adc #$40 ; 3 cycles
tax ; 2 cycles
cpx #$800 ; 3 cycles
bne loop ; 3 cycles if branching, 2 cycles if not
The initial setup (everything up to and including ldx #0
) takes 25 cycles.
Each loop iteration (of writing 2 bytes to PPU RAM) takes 27 cycles, including the cost of the branch being taken. 27*63 = 1701 cycles. The final transfer, where the branch isn't taken, takes 26 cycles. So 1701+26 = 1727 cycles total for the loop, or 1727+25 = 1752 cycles for everything you see above. (Edit: I suspect I may be off by 1 somewhere, as I had to edit my code due to forgetting you can't do stx long
This is a "slow but safe" routine. It can optimised in several different ways -- examples: not using long addressing when writing to $2118 (only will work in mode 20/LoROM), setting DB=$7F and then using absolute addressing for WRAM reads, switching DB=$00 and using absolute addressing for $2118/2119 writes, doing something like lda #$2100 / tcd / sta $18
(to write to $2118), unrolling the loop entirely + not using X indexing at all since the $7fxxxx addresses can be pre-calculated (this has most savings but at cost of ROM space), etc...
I forget how much time there is in NMI/VBlank on the SNES, but I imagine it's only a bit more than this.