Some basic questions...

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Some basic questions...

Post by 93143 »

Oziphantom wrote:the 65816's super power is it can give you perfect IRQ/NMIs with a single clock delay. So if you are doing horizontal splits, you can pepper your normal code and as long as you hit a WAI before the interrupt is due to happen you will get it with 1 clock fixed slide.
I suppose you could do that (putting wai at regular intervals), but it seems like a pretty restrictive requirement on the face of it, at least once you start writing loops. Easy to screw up too. You might also want to do the whole thing in the NMI so as to retain frame timing control, unless you're very careful about counting scanlines. And even if you did this, you'd still need to use another method to get a perfect split (not that that's difficult) because interrupting wai still has too much variance.

I stand corrected. If you're desperate enough to try this in the first place, the above may not be much of a deterrent.

I wonder how much code you could fit in between interrupts... ideally you'd want the IRQ jump/return, acknowledge, stack ops and DMA setup (you need to at least set the DMA size regardless) to all happen during active display to maximize bandwidth. Let's see... in FastROM, with D = $2100, one could write

Code: Select all

	[irq]             ; 2 fast cycles + 6 slow cycles = 60 master clocks
	rep #$20          ; 3 fc = 18 mc
	pha               ; 2 fc + 2 sc = 28 mc
	lda #DMA_length   ; 3 fc = 18 mc
	sta $4375         ; 5 fc = 30 mc
	sep #$20          ; 3 fc = 18 mc
	lda #$80          ; 2 fc = 12 mc
	sta $00           ; 3 fc = 18 mc
	sta $420B         ; 4 fc = 24 mc
	lda #$0F          ; 2 fc = 12 mc
	sta $00           ; 3 fc = 18 mc
	lda $4211         ; 4 fc = 24 mc
	rep #$20          ; 3 fc = 18 mc
	pla               ; 3 fc + 2 sc = 34 mc
	rti               ; 3 fc + 4 sc = 50 mc
for a total of 382 master clocks, 202 of which happen before rendering is turned off and 126 of which happen after it's turned on. Now, you do have to turn on rendering well before the left-hand picture border, otherwise the PPU won't have preloaded the BG information, and it's entirely possible that it needs 16 dots or more, so let's say only 58 of the 126 actually happen during the picture. And for margin, let's say that 4 of the 202 clocks happen off the right-hand side of the picture. This gives us 256 master clocks of IRQ during the display area, or 64 pixels. With a 144-pixel-wide display area, that leaves 80 for computation, minus DRAM refresh (10 dots), minus the wai, leaving you about 20% of the originally-available CPU time to run your main code, which is hobbled by the requirement to stop and wait every dozen instructions or so.

DMA bandwidth seems to be something like 79 bytes per scanline, or about 17 KB per frame if you don't trim vertically, not counting the normal VBlank time. With a vertical size of 184 lines (again, this is a 144-pixel-wide display) on NTSC, total bandwidth might be enough for full-frame 8bpp at 60 fps, including CGRAM updates (but not OAM of course, because there's no point). Mind you, with what I've done here there's no facility for trimming the bottom of the screen, so you'd have to wait for NMI and the image wouldn't be centered vertically. I think there's enough room to add a line counter and test/branch to the IRQ, but it would take a big bite out of the remaining CPU time...

That's a min/maxed scenario, of course. You can always get more CPU time at the cost of bandwidth. With a DMA size of only one byte per line, you've got nearly 70% of your compute time left, and you don't actually have to reduce the width of the display at all until you get past about 30 bytes per line, or 20 if you don't do the wai trick (okay, that number depends pretty heavily on how much preload time the PPU needs to display the BG layers properly...).

...but yeah, it's not the most elegant way to get extra VRAM bandwidth, even if it works. I'd prefer to try HDMA - see if one could turn off the display with channel 0, transfer data with channels 1-6, and turn the display back on with channel 7. It's only 24 bytes per line (or less if the PPU needs more than 10 or so dots of preload) and it still kills the sprite layer, but it's way more lightweight and requires no special coding techniques.
Señor Ventura wrote:So, you can shorten the scanlines... but, Could it increase the bandwidth?.
Maybe. I haven't tried it. It depends on how fast VRAM unlocks after rendering is turned off, and on how early you have to turn rendering back on for the BG layers to display on time. And as I said earlier, it would almost certainly cause sprites to glitch out or not work at all.

It wouldn't be at all transparent to the programmer. You'd have to explicitly set up and execute a small DMA transfer on every line, in addition to the main transfer(s) during VBlank.

Raster-synchronized video chips are really not suited to this sort of thing. A horizontal split requires only one register write (well, two, because you have to change it back eventually), but a vertical split requires hundreds of them because it has to happen on every scanline.
Last edited by 93143 on Wed May 03, 2017 12:43 pm, edited 2 times in total.
AWJ
Posts: 433
Joined: Mon Nov 10, 2008 3:09 pm

Re: Some basic questions...

Post by AWJ »

93143 wrote:...but yeah, it's not the most elegant way to get extra VRAM bandwidth, even if it works. I'd prefer to try HDMA - see if one could turn off the display with channel 0, transfer data with channels 1-6, and turn the display back on with channel 7. It's only 24 bytes per line (or less if the PPU needs more than 10 or so dots of preload) and it still kills the sprite layer, but it's way more lightweight and requires no special coding techniques.
If you're using all 8 channels for HDMA, then during VBlank you have to reconfigure one channel to use it for your big in-VBlank DMA, then reconfigure it back before the end of VBlank. That's going to eat up a certain amount of VBlank cycles that could otherwise be included in the bulk DMA.

Also, where is the data you're transferring to VRAM coming from? Best scenario: you have a coprocessor decoding your FMV or whatever into a single buffer at a static address, so you can just configure your 6 HDMA channels for indirect mode with tables that point into the appropriate offsets into that buffer per channel and scanline. Any other scenario (double buffering, etc.) and you need to rewrite 6 channels times ~200 lines worth of HDMA tables every frame.

And all this discussion is presuming you're trying to do FMV with the absolutely smallest possible letterboxes for a given frame rate, and you're not trying to write an actual game (which presumably needs sprites, and some CPU cycles to run a game engine)

TL;DR:
Can i divide the screen vertically in three sections, to get active scanlines with 144 pixels of width resolution?
Yes.
and so, It could be possible to gain some bandwidth if i proceed like that?
No.
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

AWJ wrote:
and so, It could be possible to gain some bandwidth if i proceed like that?

No.
It has sense... the PPU's has to reach that point to interrupt, it doesn't anticipates nothing, so, always will delay the same amount of time wheter if it draws or not (but, What about from the second interrupt?, that is to say, the second black margin).

Too bad, i had already done some calculations.
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Some basic questions...

Post by 93143 »

AWJ wrote:Also, where is the data you're transferring to VRAM coming from? Best scenario: you have a coprocessor decoding your FMV or whatever into a single buffer at a static address, so you can just configure your 6 HDMA channels for indirect mode with tables that point into the appropriate offsets into that buffer per channel and scanline. Any other scenario (double buffering, etc.) and you need to rewrite 6 channels times ~200 lines worth of HDMA tables every frame.
Can't you use repeat mode? That way you only have to rewrite a dozen or so source addresses. Mind you, any indirect address loading breaks the assumptions under which I derived the preload headroom (wasn't thinking, apparently), so there's that to consider...

You could also just store the data in HDMA format...
TL;DR:
and so, It could be possible to gain some bandwidth if i proceed like that?
No.
...that's probably close enough, given the title of the thread...
Señor Ventura wrote:It has sense... the PPU's has to reach that point to interrupt, it doesn't anticipates nothing, so, always will delay the same amount of time wheter if it draws or not (but, What about from the second interrupt?, that is to say, the second black margin).
HBlank and IRQ are two separate things. HDMA will trigger at the beginning of normal HBlank, and can write to registers and CGRAM. An IRQ can be set to trigger at any time, so you can force the screen blank and start a VRAM DMA at any point.
Too bad, i had already done some calculations.
What he's saying is that while it might technically be possible, it's difficult, complicated, a huge pain, generally not worth it for most cases (to my knowledge no one has ever done it), and not something a beginner should be messing with. The mere fact that it completely disables sprites means it's not worth considering for the vast majority of applications even if it does work.

If you want extra VRAM bandwidth, trim the top and bottom, not the sides. Trimming the sides is for reducing the amount of data you need to transfer.
AWJ
Posts: 433
Joined: Mon Nov 10, 2008 3:09 pm

Re: Some basic questions...

Post by AWJ »

93143 wrote:
AWJ wrote:Also, where is the data you're transferring to VRAM coming from? Best scenario: you have a coprocessor decoding your FMV or whatever into a single buffer at a static address, so you can just configure your 6 HDMA channels for indirect mode with tables that point into the appropriate offsets into that buffer per channel and scanline. Any other scenario (double buffering, etc.) and you need to rewrite 6 channels times ~200 lines worth of HDMA tables every frame.
Can't you use repeat mode? That way you only have to rewrite a dozen or so source addresses. Mind you, any indirect address loading breaks the assumptions under which I derived the preload headroom (wasn't thinking, apparently), so there's that to consider...
Yes, you're right about the tables. But then the data you're transferring via HDMA has to be split into planes or stripes so that it ends up in order in VRAM (the channel 1 table has to contain or point to the first 4 bytes out of every 24, the channel 2 table has to have the second 4 bytes, etc.)

HDMA transfers occur at the same x-position for both direct and indirect mode. Indirect HDMA is double-buffered: the transfers for all channels come first, followed by loading any indirect addresses for the next scanline. Because of this you can set up all 8 channels to do 4 bytes each in indirect mode and they'll all fit in HBlank.
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Some basic questions...

Post by 93143 »

AWJ wrote:Indirect HDMA is double-buffered: the transfers for all channels come first, followed by loading any indirect addresses for the next scanline. Because of this you can set up all 8 channels to do 4 bytes each in indirect mode and they'll all fit in HBlank.
Really? That makes it easier to do the HDMA for my shmup port then...
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Some basic questions...

Post by psycopathicteen »

Is there anyway to get IRQs to fire immediately to fast ROM region?
Optiroc
Posts: 129
Joined: Thu Feb 07, 2013 1:15 am
Location: Sweden

Re: Some basic questions...

Post by Optiroc »

psycopathicteen wrote:Is there anyway to get IRQs to fire immediately to fast ROM region?
The interrupt vectors are 16 bit words pointing to bank $00, so no. One common pattern is to have an indirect long jump at that address, so the vectors can be reconfigured via software and consequently you can also spend as little time as possible in slow addressing space.

For "one h-irq per line" schemes like those discussed here you kinda spend too much time with setup either way... Timed "speedcode" seems like a better solution. I'm not sure if DRAM refresh and other factors are entirely deterministic across different consoles though (but then again the PPU revision is known, so it would suffice if the timing is known to be deterministic for all consoles with the same chipset).
lidnariq
Posts: 11430
Joined: Sun Apr 13, 2008 11:12 am

Re: Some basic questions...

Post by lidnariq »

Not really usefully...

Anything mapped from $002000-$003FFF or $004200-$005FFF is FastROM timing, but (almost?) no cartridges put any ROM or RAM there.
Optiroc
Posts: 129
Joined: Thu Feb 07, 2013 1:15 am
Location: Sweden

Re: Some basic questions...

Post by Optiroc »

lidnariq wrote:Anything mapped from $002000-$003FFF or $004200-$005FFF is FastROM timing, but (almost?) no cartridges put any ROM or RAM there.
Ah, true, I didn't think of the I/O ranges in bank zero... So yeah, that'd be awesome to jump to in theory at least.
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Some basic questions...

Post by psycopathicteen »

Yeah, and you can have 3.58Mhz RAM, and put the direct page there. Somebody should make a cartridge like that.
Optiroc
Posts: 129
Joined: Thu Feb 07, 2013 1:15 am
Location: Sweden

Re: Some basic questions...

Post by Optiroc »

It should be easy enough to add such a mapping to SD2SNES and bsnes+, so for development work we're covered.
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Some basic questions...

Post by tepples »

psycopathicteen wrote:Yeah, and you can have 3.58Mhz RAM, and put the direct page there. Somebody should make a cartridge like that.
Put the direct page at $4300 and you get 11 bytes for each DMA channel you aren't using.
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Some basic questions...

Post by 93143 »

I have no excuse. I actually do the rewritable JML trick in my game engine prototype, for this exact reason. I just forgot about it.

Interesting ideas about mapping ROM/RAM to a fast area, though...
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Some basic questions...

Post by psycopathicteen »

You can still save cycles by using an 8-bit DMA length because there is less than 256 bytes to DMA and the DMA length registers reset to 0 after a DMA takes place.
Post Reply