Fastest possible PPU register shadowing?

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
Post Reply
AWJ
Posts: 433
Joined: Mon Nov 10, 2008 3:09 pm

Fastest possible PPU register shadowing?

Post by AWJ »

Possibly a duplicate post; I'm sure I'm not the first person to figure this out. This is, as far as I can tell, the fastest way to write to all the PPU registers in a NMI handler.

This code requires that your shadow PPU registers reside within a single 256-byte page in low RAM, in the same order as the physical PPU registers and without padding. Clobbers A, X, Y and D. Skips the OAM/VRAM/CGRAM address/data ports for obvious reasons, and COLDATA because of its specialness. If some registers are irrelevant to your game (e.g. BG4HOFS/BG4VOFS if you never use Mode 0) or if you always use HDMA to write to them (the M7 registers and window coordinates are likely candidates) then omit them.

Code: Select all

; assume A/X/Y/D have just been stacked and m = x = 0
lda #shadowPPUpage
tcd
ldx #($2100 - shadowPPUpage)
sep #$20
ldy z:<shadowINIDISP ; also gets OBJSEL
sty z:<INIDISP,x
ldy z:<shadowBGMODE  ; also gets MOSAIC
sty z:<BGMODE,x
ldy z:<shadowBG1SC   ; also gets BG2SC
sty z:<BG1SC,x
ldy z:<shadowBG3SC   ; also gets BG4SC
sty z:<BG3SC,x
ldy z:<shadowBG12NBA ; also gets BG34NBA
sty z:<BG12NBA,x

; this unrolled code is smaller than it looks--every instruction is direct page
lda z:<shadowBG1HOFS
sta z:<BG1HOFS,x
lda z:<(shadowBG1HOFS+1)
sta z:<BG1HOFS,x
lda z:<shadowBG1VOFS
sta z:<BG1VOFS,x
lda z:<(shadowBG1VOFS+1)
sta z:<BG1VOFS,x

lda z:<shadowBG2HOFS
sta z:<BG2HOFS,x
lda z:<(shadowBG2HOFS+1)
sta z:<BG2HOFS,x
lda z:<shadowBG2VOFS
sta z:<BG2VOFS,x
lda z:<(shadowBG2VOFS+1)
sta z:<BG2VOFS,x

lda z:<shadowBG3HOFS
sta z:<BG3HOFS,x
lda z:<(shadowBG3HOFS+1)
sta z:<BG3HOFS,x
lda z:<shadowBG3VOFS
sta z:<BG3VOFS,x
lda z:<(shadowBG3VOFS+1)
sta z:<BG3VOFS,x

lda z:<shadowBG4HOFS
sta z:<BG4HOFS,x
lda z:<(shadowBG4HOFS+1)
sta z:<BG4HOFS,x
lda z:<shadowBG4VOFS
sta z:<BG4VOFS,x
lda z:<(shadowBG4VOFS+1)
sta z:<BG4VOFS,x

lda z:<shadowM7SEL
sta z:<M7SEL,x

lda z:<shadowM7A
sta z:<M7A,x
lda z:<(shadowM7A+1)
sta z:<M7A,x
lda z:<shadowM7B
sta z:<M7B,x
lda z:<(shadowM7B+1)
sta z:<M7B,x
lda z:<shadowM7C
sta z:<M7C,x
lda z:<(shadowM7C+1)
sta z:<M7C,x
lda z:<shadowM7D
sta z:<M7D,x
lda z:<(shadowM7D+1)
sta z:<M7D,x
lda z:<shadowM7X
sta z:<M7X,x
lda z:<(shadowM7X+1)
sta z:<M7X,x
lda z:<shadowM7Y
sta z:<M7Y,x
lda z:<(shadowM7Y+1)
sta z:<M7Y,x

ldy z:<shadowW12SEL  ; also gets W34SEL
sty z:<W12SEL,x
lda z:<shadowWOBJSEL
sta z:<WOBJSEL,x
ldy z:<shadowWH0     ; also gets WH1
sty z:<WH0,x
ldy z:<shadowWH2     ; also gets WH2
sty z:<WH2,x
ldy z:<shadowWBGLOG  ; also gets WOBJLOG
sty z:<WBGLOG,x
ldy z:<shadowTM      ; also gets TS
sty z:<TM,x
ldy z:<shadowTMW     ; also gets TSW
sty z:<TMW,x
ldy z:<shadowCGSWSEL ; also gets CGADSUB
sty z:<CGSWSEL,x
lda z:<shadowSETINI
sta z:<SETINI,x
The trick here is that direct page indexed addressing takes the same number of cycles as absolute addressing--but one of the cycles is an IO cycle, so it's two master clocks faster than absolute when executing out of slow memory. Also, it doesn't depend on DB at all.

If your shadow registers are in SA-1 IRAM or BWRAM you can still use this without changing anything. ($2100 - shadowPPUpage) will be negative, but direct page indexed addressing wraps within bank 00 so that's just fine.

More generally, if you can spare the 16-bit X register, then you only have to set D once (to the page you access most often, or the one you need to do indirect addressing out of) and you can reach all of bank 00 with direct page indexed addressing. On the SNES, this is handy when you've got DB set to a bank that doesn't contain the MMIO registers. pea page; pld takes a distressing number of cycles and lda #page; tcd clobbers A and might require a rep #$20, but ldx #(page - directpage) is just one fast instruction.

I'm pretty sure the only way to improve on this is by using self-modifying code in RAM, where the "shadow registers" are the immediate operands of some load instructions.
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Fastest possible PPU register shadowing?

Post by psycopathicteen »

Actually it can be speeded up faster by placing the stack pointer at the end of the $21xx area, and pushing the values on stack using PEI.

I like this idea of having a code that writes to every hardware register. This is something I'd like to do with my game, if I find it necessary.
Revenant
Posts: 462
Joined: Sat Apr 25, 2015 1:47 pm
Location: FL

Re: Fastest possible PPU register shadowing?

Post by Revenant »

psycopathicteen wrote:Actually it can be speeded up faster by placing the stack pointer at the end of the $21xx area, and pushing the values on stack using PEI.
How would you get this to work with all of the "write twice" registers (BG scrolling, etc) without having to manually re-adjust the stack pointer for each one? You'd also have to do the same thing to avoid touching OAMDATA, VMDATA, CGDATA, etc.
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Fastest possible PPU register shadowing?

Post by psycopathicteen »

Revenant wrote:
psycopathicteen wrote:Actually it can be speeded up faster by placing the stack pointer at the end of the $21xx area, and pushing the values on stack using PEI.
How would you get this to work with all of the "write twice" registers (BG scrolling, etc) without having to manually re-adjust the stack pointer for each one? You'd also have to do the same thing to avoid touching OAMDATA, VMDATA, CGDATA, etc.
Do the "write twice" registers normally.
AWJ
Posts: 433
Joined: Mon Nov 10, 2008 3:09 pm

Re: Fastest possible PPU register shadowing?

Post by AWJ »

psycopathicteen wrote:Actually it can be speeded up faster by placing the stack pointer at the end of the $21xx area, and pushing the values on stack using PEI.
Let's try that and count the cycles:

Code: Select all

; instruction       ROM  RAM  IO/PPU
lda #shadowPPUpage   3    0    0
tcd                  1    0    1
tsc                  1    0    1
ldx #CGADSUB         3    0    0
txs                  1    0    1
pei (shadowCGSWSEL)  2    2    2
pei (shadowTMW)      2    2    2
pei (shadowTM)       2    2    2
pei (shadowWBGLOG)   2    2    2
pei (shadowWH2)      2    2    2
pei (shadowWH0)      2    2    2
pei (shadowW34SEL)   2    2    2
ldx #BG34NBA         3    0    0
txs                  1    0    1
pei (shadowBG12NBA)  2    2    2
pei (shadowBG3SC)    2    2    2
pei (shadowBG1SC)    2    2    2
pei (shadowBGMODE)   2    2    2
ldx #OBJSEL          3    0    0
txs                  1    0    1
pei (shadowINIDISP)  2    2    2
tcs                  1    0    1
ldx #(pagediff)      3    0    0
sep #$20             2    0    1
lda z:<shadowW12SEL  2    1    0
sta z:<W12SEL,x      2    0    2
lda z:<shadowSETINI  2    1    0
sta z:<SETINI,x      2    0    2
; skip scroll/M7
; total             55   26   35

Code: Select all

; instruction       ROM  RAM  IO/PPU
lda #shadowPPUpage   3    0    0
tcd                  1    0    1
ldx #(pagediff)      3    0    0
sep #$20             2    0    1
ldy z:<shadowINIDISP 2    2    0
sty z:<INIDISP,x     2    0    3
ldy z:<shadowBGMODE  2    2    0
sty z:<BGMODE,x      2    0    3
ldy z:<shadowBG1SC   2    2    0
sty z:<BG1SC,x       2    0    3
ldy z:<shadowBG3SC   2    2    0
sty z:<BG3SC,x       2    0    3
ldy z:<shadowBG12NBA 2    2    0
sty z:<BG12NBA,x     2    0    3
; skip scroll/M7
ldy z:<shadowW12SEL  2    2    0
sty z:<W12SEL,x      2    0    3
lda z:<shadowWOBJSEL 2    1    0
sta z:<WOBJSEL,x     2    0    2
ldy z:<shadowWH0     2    2    0
sty z:<WH0,x         2    0    3
ldy z:<shadowWH2     2    2    0
sty z:<WH2,x         2    0    3
ldy z:<shadowWBGLOG  2    2    0
sty z:<WBGLOG,x      2    0    3
ldy z:<shadowTM      2    2    0
sty z:<TM,x          2    0    3
ldy z:<shadowTMW     2    2    0
sty z:<TMW,x         2    0    3
ldy z:<shadowCGSWSEL 2    2    0
sty z:<CGSWSEL,x     2    0    3
lda z:<shadowSETINI  2    1    0
sta z:<SETINI,x      2    0    2

; total             65   26   42
Using pei is 10 bytes shorter and 7 fewer IO cycles. And it doesn't touch the Y register, which is potentially several cycles saved if the rest of your NMI handler can avoid either using Y or setting the X flag (which clobbers the high byte of both X and Y)

So you've beaten me, though my idea is still the fastest way to write the scroll registers and the "odd" registers at the ends of ranges (SETINI, and either W12SEL or CGADSUB)
Post Reply