Some basic questions...

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: Some basic questions...

Post by 93143 »

psycopathicteen wrote:You can still save cycles by using an 8-bit DMA length because there is less than 256 bytes to DMA and the DMA length registers reset to 0 after a DMA takes place.
Yeah, apparently I was in too much of a hurry. That saves... what, about 64 master cycles? Put it in a fast area in bank 0 and you've got 25% CPU time left instead of 20%. Even using a trampoline only adds back 32 cycles.

Well, I was in the ballpark...
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Some basic questions...

Post by psycopathicteen »

You can do a long jump to itself, if your assembler is set up to always long jump into the fastROM area:
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: Some basic questions...

Post by 93143 »

Oh, hang on - the long jump can be in an unused DMA area. Make that 24 master cycles, not 32.

My game can't do that because all 8 channels are in use for HDMA, but there's no reason this scheme couldn't.
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: Some basic questions...

Post by 93143 »

...hey. Couldn't you set the stack pointer to somewhere in fast RAM if you did that mapping trick? That way all stack operations would be fast, including the irq/rti-related ones. The only slow part would be the vector load.

Code: Select all

	[irq]             ; 6 fast cycles + 2 slow cycles = 52 master clocks
	sep #$20          ; 3 fc = 18 mc
	pha               ; 3 fc = 18 mc
	lda #DMA_length   ; 2 fc = 12 mc
	sta $4375         ; 4 fc = 24 mc
	lda #$80          ; 2 fc = 12 mc
	sta $00           ; 3 fc = 18 mc
	sta $420B         ; 4 fc = 24 mc
	lda #$0F          ; 2 fc = 12 mc
	sta $00           ; 3 fc = 18 mc
	lda $4211         ; 4 fc = 24 mc
	pla               ; 4 fc = 24 mc
	rti               ; 7 fc = 42 mc
298 master clocks, for an improvement over my original function of 84 master clocks or about 22%. Just having a fast stack saves 20 master clocks by itself. Remaining compute time for the scanline is 26% vs. 20% for the original. The only way can I see to speed this up, short of reserving an index register, is to ensure that A is always 8-bit in the main code, eliminating the sep #$20.

Now, if you were to reserve both index registers and assume that every wai clobbers the accumulator, you could use all three registers, and there's enough room for a DMA with the same length as the force blank and DMA start values:

Code: Select all

	[irq]             ; 6 fast cycles + 2 slow cycles = 52 master clocks
	stx $4375         ; 4 fc = 24 mc
	stx $00           ; 3 fc = 18 mc
	stx $420B         ; 4 fc = 24 mc
	sty $00           ; 3 fc = 18 mc
	lda $4211         ; 4 fc = 24 mc
	rti               ; 7 fc = 42 mc
202 master clocks. That's got to be the theoretical limit... Ugly if not outright impossible to code around and leaves you with a pretty skinny active area, but we weren't planning on actually doing this anyway, so hey...[/size]

...

I might seriously consider doing this FastRAM thing for my shmup port; there's plenty of room for it in the Super FX map. The combination of HDMA and H-IRQ in my raster engine eats about 3/4 of my S-CPU time, and I think using fast RAM would save 54 master clocks per scanline (I can't get rid of the trampoline because I need to repurpose the IRQ on the fly, but at least it doesn't have to be a long jump). Plus there's the advantage that I'd be able to use high-speed memory for game state - I'd basically be running at 3.58 MHz flat out...
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Some basic questions...

Post by psycopathicteen »

You can probably have 16kB of RAM from $6000-$7ffff, and mirror the top 7kB into $4400-$5fff, so that you have a power of 2 amount. Just by having a lot of RAM in bank $00 would speed up code a lot because you wouldn't need long addressing as much, you can use the DP and SP more, you can use PEA and PEI more.

Also, with fast RAM, you can also make use of self-modifying code.
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: Some basic questions...

Post by 93143 »

Perhaps I could get rid of the trampoline in my game, at least in the H-IRQ, by overwriting the beginning of the H-IRQ routine with a jump when I want one of the HV-IRQs. And I think I can avoid needing multiple versions of the H-IRQ during a frame by simply rewriting a branch. (Oh wait - there are two of that branch because of the stagger-step. It should still fit...)

Maybe I should figure out whether I'm actually in need of more compute time first. I mean, the display engine does work as matters stand...
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

One more thing:
Nicole wrote:CPU cycles are different. Depending on the area of memory accessed, one CPU cycle = 6 master cycles
So, if the PPU's have 325996 master cycles, then at 3.58mhz the CPU has 325996/6 cpu cycles?.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Some basic questions...

Post by tepples »

Señor Ventura wrote:So, if the PPU's have 325996 master cycles, then at 3.58mhz the CPU has 325996/6 cpu cycles?.
Your number 325996 appears to be 1364 (the total number of qpels per scanline) times 239 (the number of active picture scanlines per frame with tall display turned on).

But in practice, you won't get exactly that many cycles for several reasons:
  1. Refresh
    The CPU takes a 40-qpel break every scanline to refresh a row of RAM. This means you have 1324, not 1364, qpels per scanline.
  2. HDMA
    The CPU is paused during HDMA, if you're using that. This accounts for 8 qpels per byte, plus some overhead that I'm not sure of.
  3. Slow RAM
    Every RAM read or write incurs a 2-qpel wait state, extending the CPU cycle from 6 to 8 qpels. In particular, every stack access (PE*, PH*, PL*, JSR/RTS, BRK/RTI, and d,S mode), every access through direct page (d, d,X, and d,Y modes), and every access through a pointer ((d), [d], (d),Y, [d],Y, (d,S),Y, and (d,X) modes) will incur a wait state unless you're doing something fairly tricky, such as putting direct page in unused DMA registers (LDA #$4300 TCD).
  4. NMI finishing during vblank
    Your NMI handler is likely to finish its chores before the end of vertical blanking, giving you a few extra scanlines prior to active picture to get work done.
  5. Underscan
    Most games on an NTSC system turn off the tall display bit, making active picture 224, not 239, scanlines long.
Realistically, you might get 1324/7 = 189.1 CPU cycles per scanline while PBR:PC is in fast ROM, or 45205 during a tall active picture. This is still much greater than 113.7 on the NES though.
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

tepples wrote:
Señor Ventura wrote:So, if the PPU's have 325996 master cycles, then at 3.58mhz the CPU has 325996/6 cpu cycles?.
Your number 325996 appears to be 1364 (the total number of qpels per scanline) times 239 (the number of active picture scanlines per frame with tall display turned on).

But in practice, you won't get exactly that many cycles for several reasons:
  1. Refresh
    The CPU takes a 40-qpel break every scanline to refresh a row of RAM. This means you have 1324, not 1364, qpels per scanline.
  2. HDMA
    The CPU is paused during HDMA, if you're using that. This accounts for 8 qpels per byte, plus some overhead that I'm not sure of.
  3. Slow RAM
    Every RAM read or write incurs a 2-qpel wait state, extending the CPU cycle from 6 to 8 qpels. In particular, every stack access (PE*, PH*, PL*, JSR/RTS, BRK/RTI, and d,S mode), every access through direct page (d, d,X, and d,Y modes), and every access through a pointer ((d), [d], (d),Y, [d],Y, (d,S),Y, and (d,X) modes) will incur a wait state unless you're doing something fairly tricky, such as putting direct page in unused DMA registers (LDA #$4300 TCD).
  4. NMI finishing during vblank
    Your NMI handler is likely to finish its chores before the end of vertical blanking, giving you a few extra scanlines prior to active picture to get work done.
  5. Underscan
    Most games on an NTSC system turn off the tall display bit, making active picture 224, not 239, scanlines long.
Realistically, you might get 1324/7 = 189.1 CPU cycles per scanline while PBR:PC is in fast ROM, or 45205 during a tall active picture. This is still much greater than 113.7 on the NES though.
Thank you so much, that is what i was searching for :D

Besides, cpu not always runs at 3.58mhz, so during that moments within a frame running at 2.68mhz or 1.79mhz, DMA adjusts its speed too, right? (if it weren't like that, the maximum peak would be about 8.43KB)

Anyway, look how during all this time everybody believed that snes only had 5.72KB, but it is possible that snes may have even 7.18KB, or even a bit more...
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Some basic questions...

Post by psycopathicteen »

Dma is always 2.68Mhz even with fast ROM memory.
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

psycopathicteen wrote:Dma is always 2.68Mhz even with fast ROM memory.
How is it possible?, What makes the DMA unit run at 2.68mhz when cpu runs at 3.58mhz?.
creaothceann
Posts: 611
Joined: Mon Jan 23, 2006 7:47 am
Location: Germany
Contact:

Re: Some basic questions...

Post by creaothceann »

The 65816 is halted while the 5A22 performs the DMA.
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

creaothceann wrote:The 65816 is halted while the 5A22 performs the DMA.
Are you sure?... if i use the cpu during all its cycles of an frame, i'm not lefting space to let the DMA work.
User avatar
HihiDanni
Posts: 186
Joined: Tue Apr 05, 2016 5:25 pm

Re: Some basic questions...

Post by HihiDanni »

Exactly. CPU code execution cuts into potential DMA time, and DMA cuts into potential CPU time. The more time you spend in your NMI routine, the less time you have for computing everything else in the following frame. But if you're out of CPU time in one frame you're probably not going to be doing any DMA during the following NMI anyway.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

HihiDanni wrote:Exactly. CPU code execution cuts into potential DMA time, and DMA cuts into potential CPU time. The more time you spend in your NMI routine, the less time you have for computing everything else in the following frame. But if you're out of CPU time in one frame you're probably not going to be doing any DMA during the following NMI anyway.
I thought the DMA achieved transferring 6.14KB per frame cause it works during all the frame while the cpu works by its side.

During how much time the DMA needs to be active within a frame to transfer those 6.14KB, then?.
Post Reply