...hey. Couldn't you set the stack pointer to somewhere in fast RAM if you did that mapping trick? That way all stack operations would be fast, including the irq/rti-related ones. The only slow part would be the vector load.
Code: Select all
[irq] ; 6 fast cycles + 2 slow cycles = 52 master clocks
sep #$20 ; 3 fc = 18 mc
pha ; 3 fc = 18 mc
lda #DMA_length ; 2 fc = 12 mc
sta $4375 ; 4 fc = 24 mc
lda #$80 ; 2 fc = 12 mc
sta $00 ; 3 fc = 18 mc
sta $420B ; 4 fc = 24 mc
lda #$0F ; 2 fc = 12 mc
sta $00 ; 3 fc = 18 mc
lda $4211 ; 4 fc = 24 mc
pla ; 4 fc = 24 mc
rti ; 7 fc = 42 mc
298 master clocks, for an improvement over my original function of 84 master clocks or about 22%. Just having a fast stack saves 20 master clocks by itself. Remaining compute time for the scanline is 26% vs. 20% for the original. The only way can I see to speed this up, short of reserving an index register, is to ensure that A is always 8-bit in the main code, eliminating the
sep #$20.
Now, if you were to reserve both index registers and assume that every wai clobbers the accumulator, you could use all three registers, and there's enough room for a DMA with the same length as the force blank and DMA start values:
Code: Select all
[irq] ; 6 fast cycles + 2 slow cycles = 52 master clocks
stx $4375 ; 4 fc = 24 mc
stx $00 ; 3 fc = 18 mc
stx $420B ; 4 fc = 24 mc
sty $00 ; 3 fc = 18 mc
lda $4211 ; 4 fc = 24 mc
rti ; 7 fc = 42 mc
202 master clocks. That's
got to be the theoretical limit... Ugly if not outright impossible to code around and leaves you with a pretty skinny active area, but we weren't planning on actually doing this anyway, so hey...[/size]
...
I might seriously consider doing this FastRAM thing for my shmup port; there's plenty of room for it in the Super FX map. The combination of HDMA and H-IRQ in my raster engine eats about 3/4 of my S-CPU time, and I think using fast RAM would save 54 master clocks per scanline (I can't get rid of the trampoline because I need to repurpose the IRQ on the fly, but at least it doesn't have to be a long jump). Plus there's the advantage that I'd be able to use high-speed memory for game state - I'd basically be running at 3.58 MHz flat out...