I suppose you could do that (putting wai at regular intervals), but it seems like a pretty restrictive requirement on the face of it, at least once you start writing loops. Easy to screw up too. You might also want to do the whole thing in the NMI so as to retain frame timing control, unless you're very careful about counting scanlines. And even if you did this, you'd still need to use another method to get a perfect split (not that that's difficult) because interrupting wai still has too much variance.Oziphantom wrote:the 65816's super power is it can give you perfect IRQ/NMIs with a single clock delay. So if you are doing horizontal splits, you can pepper your normal code and as long as you hit a WAI before the interrupt is due to happen you will get it with 1 clock fixed slide.
I stand corrected. If you're desperate enough to try this in the first place, the above may not be much of a deterrent.
I wonder how much code you could fit in between interrupts... ideally you'd want the IRQ jump/return, acknowledge, stack ops and DMA setup (you need to at least set the DMA size regardless) to all happen during active display to maximize bandwidth. Let's see... in FastROM, with D = $2100, one could write
Code: Select all
[irq] ; 2 fast cycles + 6 slow cycles = 60 master clocks
rep #$20 ; 3 fc = 18 mc
pha ; 2 fc + 2 sc = 28 mc
lda #DMA_length ; 3 fc = 18 mc
sta $4375 ; 5 fc = 30 mc
sep #$20 ; 3 fc = 18 mc
lda #$80 ; 2 fc = 12 mc
sta $00 ; 3 fc = 18 mc
sta $420B ; 4 fc = 24 mc
lda #$0F ; 2 fc = 12 mc
sta $00 ; 3 fc = 18 mc
lda $4211 ; 4 fc = 24 mc
rep #$20 ; 3 fc = 18 mc
pla ; 3 fc + 2 sc = 34 mc
rti ; 3 fc + 4 sc = 50 mc
DMA bandwidth seems to be something like 79 bytes per scanline, or about 17 KB per frame if you don't trim vertically, not counting the normal VBlank time. With a vertical size of 184 lines (again, this is a 144-pixel-wide display) on NTSC, total bandwidth might be enough for full-frame 8bpp at 60 fps, including CGRAM updates (but not OAM of course, because there's no point). Mind you, with what I've done here there's no facility for trimming the bottom of the screen, so you'd have to wait for NMI and the image wouldn't be centered vertically. I think there's enough room to add a line counter and test/branch to the IRQ, but it would take a big bite out of the remaining CPU time...
That's a min/maxed scenario, of course. You can always get more CPU time at the cost of bandwidth. With a DMA size of only one byte per line, you've got nearly 70% of your compute time left, and you don't actually have to reduce the width of the display at all until you get past about 30 bytes per line, or 20 if you don't do the wai trick (okay, that number depends pretty heavily on how much preload time the PPU needs to display the BG layers properly...).
...but yeah, it's not the most elegant way to get extra VRAM bandwidth, even if it works. I'd prefer to try HDMA - see if one could turn off the display with channel 0, transfer data with channels 1-6, and turn the display back on with channel 7. It's only 24 bytes per line (or less if the PPU needs more than 10 or so dots of preload) and it still kills the sprite layer, but it's way more lightweight and requires no special coding techniques.
Maybe. I haven't tried it. It depends on how fast VRAM unlocks after rendering is turned off, and on how early you have to turn rendering back on for the BG layers to display on time. And as I said earlier, it would almost certainly cause sprites to glitch out or not work at all.Señor Ventura wrote:So, you can shorten the scanlines... but, Could it increase the bandwidth?.
It wouldn't be at all transparent to the programmer. You'd have to explicitly set up and execute a small DMA transfer on every line, in addition to the main transfer(s) during VBlank.
Raster-synchronized video chips are really not suited to this sort of thing. A horizontal split requires only one register write (well, two, because you have to change it back eventually), but a vertical split requires hundreds of them because it has to happen on every scanline.