It is currently Wed Dec 13, 2017 6:13 am

All times are UTC - 7 hours



Forum rules


Related:



Post new topic Reply to topic  [ 5 posts ] 
Author Message
PostPosted: Mon Jun 27, 2016 2:46 am 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 431
Possibly a duplicate post; I'm sure I'm not the first person to figure this out. This is, as far as I can tell, the fastest way to write to all the PPU registers in a NMI handler.

This code requires that your shadow PPU registers reside within a single 256-byte page in low RAM, in the same order as the physical PPU registers and without padding. Clobbers A, X, Y and D. Skips the OAM/VRAM/CGRAM address/data ports for obvious reasons, and COLDATA because of its specialness. If some registers are irrelevant to your game (e.g. BG4HOFS/BG4VOFS if you never use Mode 0) or if you always use HDMA to write to them (the M7 registers and window coordinates are likely candidates) then omit them.

Code:
; assume A/X/Y/D have just been stacked and m = x = 0
lda #shadowPPUpage
tcd
ldx #($2100 - shadowPPUpage)
sep #$20
ldy z:<shadowINIDISP ; also gets OBJSEL
sty z:<INIDISP,x
ldy z:<shadowBGMODE  ; also gets MOSAIC
sty z:<BGMODE,x
ldy z:<shadowBG1SC   ; also gets BG2SC
sty z:<BG1SC,x
ldy z:<shadowBG3SC   ; also gets BG4SC
sty z:<BG3SC,x
ldy z:<shadowBG12NBA ; also gets BG34NBA
sty z:<BG12NBA,x

; this unrolled code is smaller than it looks--every instruction is direct page
lda z:<shadowBG1HOFS
sta z:<BG1HOFS,x
lda z:<(shadowBG1HOFS+1)
sta z:<BG1HOFS,x
lda z:<shadowBG1VOFS
sta z:<BG1VOFS,x
lda z:<(shadowBG1VOFS+1)
sta z:<BG1VOFS,x

lda z:<shadowBG2HOFS
sta z:<BG2HOFS,x
lda z:<(shadowBG2HOFS+1)
sta z:<BG2HOFS,x
lda z:<shadowBG2VOFS
sta z:<BG2VOFS,x
lda z:<(shadowBG2VOFS+1)
sta z:<BG2VOFS,x

lda z:<shadowBG3HOFS
sta z:<BG3HOFS,x
lda z:<(shadowBG3HOFS+1)
sta z:<BG3HOFS,x
lda z:<shadowBG3VOFS
sta z:<BG3VOFS,x
lda z:<(shadowBG3VOFS+1)
sta z:<BG3VOFS,x

lda z:<shadowBG4HOFS
sta z:<BG4HOFS,x
lda z:<(shadowBG4HOFS+1)
sta z:<BG4HOFS,x
lda z:<shadowBG4VOFS
sta z:<BG4VOFS,x
lda z:<(shadowBG4VOFS+1)
sta z:<BG4VOFS,x

lda z:<shadowM7SEL
sta z:<M7SEL,x

lda z:<shadowM7A
sta z:<M7A,x
lda z:<(shadowM7A+1)
sta z:<M7A,x
lda z:<shadowM7B
sta z:<M7B,x
lda z:<(shadowM7B+1)
sta z:<M7B,x
lda z:<shadowM7C
sta z:<M7C,x
lda z:<(shadowM7C+1)
sta z:<M7C,x
lda z:<shadowM7D
sta z:<M7D,x
lda z:<(shadowM7D+1)
sta z:<M7D,x
lda z:<shadowM7X
sta z:<M7X,x
lda z:<(shadowM7X+1)
sta z:<M7X,x
lda z:<shadowM7Y
sta z:<M7Y,x
lda z:<(shadowM7Y+1)
sta z:<M7Y,x

ldy z:<shadowW12SEL  ; also gets W34SEL
sty z:<W12SEL,x
lda z:<shadowWOBJSEL
sta z:<WOBJSEL,x
ldy z:<shadowWH0     ; also gets WH1
sty z:<WH0,x
ldy z:<shadowWH2     ; also gets WH2
sty z:<WH2,x
ldy z:<shadowWBGLOG  ; also gets WOBJLOG
sty z:<WBGLOG,x
ldy z:<shadowTM      ; also gets TS
sty z:<TM,x
ldy z:<shadowTMW     ; also gets TSW
sty z:<TMW,x
ldy z:<shadowCGSWSEL ; also gets CGADSUB
sty z:<CGSWSEL,x
lda z:<shadowSETINI
sta z:<SETINI,x


The trick here is that direct page indexed addressing takes the same number of cycles as absolute addressing--but one of the cycles is an IO cycle, so it's two master clocks faster than absolute when executing out of slow memory. Also, it doesn't depend on DB at all.

If your shadow registers are in SA-1 IRAM or BWRAM you can still use this without changing anything. ($2100 - shadowPPUpage) will be negative, but direct page indexed addressing wraps within bank 00 so that's just fine.

More generally, if you can spare the 16-bit X register, then you only have to set D once (to the page you access most often, or the one you need to do indirect addressing out of) and you can reach all of bank 00 with direct page indexed addressing. On the SNES, this is handy when you've got DB set to a bank that doesn't contain the MMIO registers. pea page; pld takes a distressing number of cycles and lda #page; tcd clobbers A and might require a rep #$20, but ldx #(page - directpage) is just one fast instruction.

I'm pretty sure the only way to improve on this is by using self-modifying code in RAM, where the "shadow registers" are the immediate operands of some load instructions.


Top
 Profile  
 
PostPosted: Mon Jun 27, 2016 11:19 am 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2424
Actually it can be speeded up faster by placing the stack pointer at the end of the $21xx area, and pushing the values on stack using PEI.

I like this idea of having a code that writes to every hardware register. This is something I'd like to do with my game, if I find it necessary.


Top
 Profile  
 
PostPosted: Tue Jun 28, 2016 2:39 pm 
Offline

Joined: Sat Apr 25, 2015 1:47 pm
Posts: 336
Location: FL
psycopathicteen wrote:
Actually it can be speeded up faster by placing the stack pointer at the end of the $21xx area, and pushing the values on stack using PEI.


How would you get this to work with all of the "write twice" registers (BG scrolling, etc) without having to manually re-adjust the stack pointer for each one? You'd also have to do the same thing to avoid touching OAMDATA, VMDATA, CGDATA, etc.


Top
 Profile  
 
PostPosted: Tue Jun 28, 2016 4:00 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2424
Revenant wrote:
psycopathicteen wrote:
Actually it can be speeded up faster by placing the stack pointer at the end of the $21xx area, and pushing the values on stack using PEI.


How would you get this to work with all of the "write twice" registers (BG scrolling, etc) without having to manually re-adjust the stack pointer for each one? You'd also have to do the same thing to avoid touching OAMDATA, VMDATA, CGDATA, etc.


Do the "write twice" registers normally.


Top
 Profile  
 
PostPosted: Wed Jun 29, 2016 5:30 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 431
psycopathicteen wrote:
Actually it can be speeded up faster by placing the stack pointer at the end of the $21xx area, and pushing the values on stack using PEI.


Let's try that and count the cycles:

Code:
; instruction       ROM  RAM  IO/PPU
lda #shadowPPUpage   3    0    0
tcd                  1    0    1
tsc                  1    0    1
ldx #CGADSUB         3    0    0
txs                  1    0    1
pei (shadowCGSWSEL)  2    2    2
pei (shadowTMW)      2    2    2
pei (shadowTM)       2    2    2
pei (shadowWBGLOG)   2    2    2
pei (shadowWH2)      2    2    2
pei (shadowWH0)      2    2    2
pei (shadowW34SEL)   2    2    2
ldx #BG34NBA         3    0    0
txs                  1    0    1
pei (shadowBG12NBA)  2    2    2
pei (shadowBG3SC)    2    2    2
pei (shadowBG1SC)    2    2    2
pei (shadowBGMODE)   2    2    2
ldx #OBJSEL          3    0    0
txs                  1    0    1
pei (shadowINIDISP)  2    2    2
tcs                  1    0    1
ldx #(pagediff)      3    0    0
sep #$20             2    0    1
lda z:<shadowW12SEL  2    1    0
sta z:<W12SEL,x      2    0    2
lda z:<shadowSETINI  2    1    0
sta z:<SETINI,x      2    0    2
; skip scroll/M7
; total             55   26   35


Code:
; instruction       ROM  RAM  IO/PPU
lda #shadowPPUpage   3    0    0
tcd                  1    0    1
ldx #(pagediff)      3    0    0
sep #$20             2    0    1
ldy z:<shadowINIDISP 2    2    0
sty z:<INIDISP,x     2    0    3
ldy z:<shadowBGMODE  2    2    0
sty z:<BGMODE,x      2    0    3
ldy z:<shadowBG1SC   2    2    0
sty z:<BG1SC,x       2    0    3
ldy z:<shadowBG3SC   2    2    0
sty z:<BG3SC,x       2    0    3
ldy z:<shadowBG12NBA 2    2    0
sty z:<BG12NBA,x     2    0    3
; skip scroll/M7
ldy z:<shadowW12SEL  2    2    0
sty z:<W12SEL,x      2    0    3
lda z:<shadowWOBJSEL 2    1    0
sta z:<WOBJSEL,x     2    0    2
ldy z:<shadowWH0     2    2    0
sty z:<WH0,x         2    0    3
ldy z:<shadowWH2     2    2    0
sty z:<WH2,x         2    0    3
ldy z:<shadowWBGLOG  2    2    0
sty z:<WBGLOG,x      2    0    3
ldy z:<shadowTM      2    2    0
sty z:<TM,x          2    0    3
ldy z:<shadowTMW     2    2    0
sty z:<TMW,x         2    0    3
ldy z:<shadowCGSWSEL 2    2    0
sty z:<CGSWSEL,x     2    0    3
lda z:<shadowSETINI  2    1    0
sta z:<SETINI,x      2    0    2

; total             65   26   42


Using pei is 10 bytes shorter and 7 fewer IO cycles. And it doesn't touch the Y register, which is potentially several cycles saved if the rest of your NMI handler can avoid either using Y or setting the X flag (which clobbers the high byte of both X and Y)

So you've beaten me, though my idea is still the fastest way to write the scroll registers and the "odd" registers at the ends of ranges (SETINI, and either W12SEL or CGADSUB)


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group