It is currently Fri Oct 20, 2017 6:06 pm

All times are UTC - 7 hours



Forum rules


Related:



Post new topic Reply to topic  [ 56 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
PostPosted: Wed May 03, 2017 1:32 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 787
psycopathicteen wrote:
You can still save cycles by using an 8-bit DMA length because there is less than 256 bytes to DMA and the DMA length registers reset to 0 after a DMA takes place.

Yeah, apparently I was in too much of a hurry. That saves... what, about 64 master cycles? Put it in a fast area in bank 0 and you've got 25% CPU time left instead of 20%. Even using a trampoline only adds back 32 cycles.

Well, I was in the ballpark...


Top
 Profile  
 
PostPosted: Wed May 03, 2017 3:12 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2290
You can do a long jump to itself, if your assembler is set up to always long jump into the fastROM area:


Top
 Profile  
 
PostPosted: Wed May 03, 2017 5:58 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 787
Oh, hang on - the long jump can be in an unused DMA area. Make that 24 master cycles, not 32.

My game can't do that because all 8 channels are in use for HDMA, but there's no reason this scheme couldn't.


Top
 Profile  
 
PostPosted: Thu May 04, 2017 9:30 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 787
...hey. Couldn't you set the stack pointer to somewhere in fast RAM if you did that mapping trick? That way all stack operations would be fast, including the irq/rti-related ones. The only slow part would be the vector load.

Code:
   [irq]             ; 6 fast cycles + 2 slow cycles = 52 master clocks
   sep #$20          ; 3 fc = 18 mc
   pha               ; 3 fc = 18 mc
   lda #DMA_length   ; 2 fc = 12 mc
   sta $4375         ; 4 fc = 24 mc
   lda #$80          ; 2 fc = 12 mc
   sta $00           ; 3 fc = 18 mc
   sta $420B         ; 4 fc = 24 mc
   lda #$0F          ; 2 fc = 12 mc
   sta $00           ; 3 fc = 18 mc
   lda $4211         ; 4 fc = 24 mc
   pla               ; 4 fc = 24 mc
   rti               ; 7 fc = 42 mc

298 master clocks, for an improvement over my original function of 84 master clocks or about 22%. Just having a fast stack saves 20 master clocks by itself. Remaining compute time for the scanline is 26% vs. 20% for the original. The only way can I see to speed this up, short of reserving an index register, is to ensure that A is always 8-bit in the main code, eliminating the sep #$20.

Now, if you were to reserve both index registers and assume that every wai clobbers the accumulator, you could use all three registers, and there's enough room for a DMA with the same length as the force blank and DMA start values:

Code:
   [irq]             ; 6 fast cycles + 2 slow cycles = 52 master clocks
   stx $4375         ; 4 fc = 24 mc
   stx $00           ; 3 fc = 18 mc
   stx $420B         ; 4 fc = 24 mc
   sty $00           ; 3 fc = 18 mc
   lda $4211         ; 4 fc = 24 mc
   rti               ; 7 fc = 42 mc

202 master clocks. That's got to be the theoretical limit... Ugly if not outright impossible to code around and leaves you with a pretty skinny active area, but we weren't planning on actually doing this anyway, so hey...


...

I might seriously consider doing this FastRAM thing for my shmup port; there's plenty of room for it in the Super FX map. The combination of HDMA and H-IRQ in my raster engine eats about 3/4 of my S-CPU time, and I think using fast RAM would save 54 master clocks per scanline (I can't get rid of the trampoline because I need to repurpose the IRQ on the fly, but at least it doesn't have to be a long jump). Plus there's the advantage that I'd be able to use high-speed memory for game state - I'd basically be running at 3.58 MHz flat out...


Top
 Profile  
 
PostPosted: Fri May 05, 2017 10:29 am 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2290
You can probably have 16kB of RAM from $6000-$7ffff, and mirror the top 7kB into $4400-$5fff, so that you have a power of 2 amount. Just by having a lot of RAM in bank $00 would speed up code a lot because you wouldn't need long addressing as much, you can use the DP and SP more, you can use PEA and PEI more.

Also, with fast RAM, you can also make use of self-modifying code.


Top
 Profile  
 
PostPosted: Fri May 05, 2017 1:13 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 787
Perhaps I could get rid of the trampoline in my game, at least in the H-IRQ, by overwriting the beginning of the H-IRQ routine with a jump when I want one of the HV-IRQs. And I think I can avoid needing multiple versions of the H-IRQ during a frame by simply rewriting a branch. (Oh wait - there are two of that branch because of the stagger-step. It should still fit...)

Maybe I should figure out whether I'm actually in need of more compute time first. I mean, the display engine does work as matters stand...


Top
 Profile  
 
PostPosted: Sat May 20, 2017 4:12 pm 
Offline
User avatar

Joined: Sat Aug 20, 2016 3:58 am
Posts: 34
One more thing:

Nicole wrote:
CPU cycles are different. Depending on the area of memory accessed, one CPU cycle = 6 master cycles


So, if the PPU's have 325996 master cycles, then at 3.58mhz the CPU has 325996/6 cpu cycles?.


Top
 Profile  
 
PostPosted: Sat May 20, 2017 5:18 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19110
Location: NE Indiana, USA (NTSC)
Señor Ventura wrote:
So, if the PPU's have 325996 master cycles, then at 3.58mhz the CPU has 325996/6 cpu cycles?.

Your number 325996 appears to be 1364 (the total number of qpels per scanline) times 239 (the number of active picture scanlines per frame with tall display turned on).

But in practice, you won't get exactly that many cycles for several reasons:

  1. Refresh
    The CPU takes a 40-qpel break every scanline to refresh a row of RAM. This means you have 1324, not 1364, qpels per scanline.
  2. HDMA
    The CPU is paused during HDMA, if you're using that. This accounts for 8 qpels per byte, plus some overhead that I'm not sure of.
  3. Slow RAM
    Every RAM read or write incurs a 2-qpel wait state, extending the CPU cycle from 6 to 8 qpels. In particular, every stack access (PE*, PH*, PL*, JSR/RTS, BRK/RTI, and d,S mode), every access through direct page (d, d,X, and d,Y modes), and every access through a pointer ((d), [d], (d),Y, [d],Y, (d,S),Y, and (d,X) modes) will incur a wait state unless you're doing something fairly tricky, such as putting direct page in unused DMA registers (LDA #$4300 TCD).
  4. NMI finishing during vblank
    Your NMI handler is likely to finish its chores before the end of vertical blanking, giving you a few extra scanlines prior to active picture to get work done.
  5. Underscan
    Most games on an NTSC system turn off the tall display bit, making active picture 224, not 239, scanlines long.

Realistically, you might get 1324/7 = 189.1 CPU cycles per scanline while PBR:PC is in fast ROM, or 45205 during a tall active picture. This is still much greater than 113.7 on the NES though.


Top
 Profile  
 
PostPosted: Sat May 20, 2017 6:57 pm 
Offline
User avatar

Joined: Sat Aug 20, 2016 3:58 am
Posts: 34
tepples wrote:
Señor Ventura wrote:
So, if the PPU's have 325996 master cycles, then at 3.58mhz the CPU has 325996/6 cpu cycles?.

Your number 325996 appears to be 1364 (the total number of qpels per scanline) times 239 (the number of active picture scanlines per frame with tall display turned on).

But in practice, you won't get exactly that many cycles for several reasons:

  1. Refresh
    The CPU takes a 40-qpel break every scanline to refresh a row of RAM. This means you have 1324, not 1364, qpels per scanline.
  2. HDMA
    The CPU is paused during HDMA, if you're using that. This accounts for 8 qpels per byte, plus some overhead that I'm not sure of.
  3. Slow RAM
    Every RAM read or write incurs a 2-qpel wait state, extending the CPU cycle from 6 to 8 qpels. In particular, every stack access (PE*, PH*, PL*, JSR/RTS, BRK/RTI, and d,S mode), every access through direct page (d, d,X, and d,Y modes), and every access through a pointer ((d), [d], (d),Y, [d],Y, (d,S),Y, and (d,X) modes) will incur a wait state unless you're doing something fairly tricky, such as putting direct page in unused DMA registers (LDA #$4300 TCD).
  4. NMI finishing during vblank
    Your NMI handler is likely to finish its chores before the end of vertical blanking, giving you a few extra scanlines prior to active picture to get work done.
  5. Underscan
    Most games on an NTSC system turn off the tall display bit, making active picture 224, not 239, scanlines long.

Realistically, you might get 1324/7 = 189.1 CPU cycles per scanline while PBR:PC is in fast ROM, or 45205 during a tall active picture. This is still much greater than 113.7 on the NES though.


Thank you so much, that is what i was searching for :D

Besides, cpu not always runs at 3.58mhz, so during that moments within a frame running at 2.68mhz or 1.79mhz, DMA adjusts its speed too, right? (if it weren't like that, the maximum peak would be about 8.43KB)

Anyway, look how during all this time everybody believed that snes only had 5.72KB, but it is possible that snes may have even 7.18KB, or even a bit more...


Top
 Profile  
 
PostPosted: Sat May 20, 2017 7:12 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2290
Dma is always 2.68Mhz even with fast ROM memory.


Top
 Profile  
 
PostPosted: Sun May 21, 2017 8:14 am 
Offline
User avatar

Joined: Sat Aug 20, 2016 3:58 am
Posts: 34
psycopathicteen wrote:
Dma is always 2.68Mhz even with fast ROM memory.


How is it possible?, What makes the DMA unit run at 2.68mhz when cpu runs at 3.58mhz?.


Top
 Profile  
 
PostPosted: Sun May 21, 2017 8:49 am 
Offline
User avatar

Joined: Mon Jan 23, 2006 7:47 am
Posts: 70
The 65816 is halted while the 5A22 performs the DMA.


Top
 Profile  
 
PostPosted: Sun May 21, 2017 9:04 am 
Offline
User avatar

Joined: Sat Aug 20, 2016 3:58 am
Posts: 34
creaothceann wrote:
The 65816 is halted while the 5A22 performs the DMA.


Are you sure?... if i use the cpu during all its cycles of an frame, i'm not lefting space to let the DMA work.


Top
 Profile  
 
PostPosted: Sun May 21, 2017 9:16 am 
Offline
User avatar

Joined: Tue Apr 05, 2016 5:25 pm
Posts: 121
Exactly. CPU code execution cuts into potential DMA time, and DMA cuts into potential CPU time. The more time you spend in your NMI routine, the less time you have for computing everything else in the following frame. But if you're out of CPU time in one frame you're probably not going to be doing any DMA during the following NMI anyway.

_________________
SNES NTSC 2/1/3 1CHIP | serial number UN318588627


Top
 Profile  
 
PostPosted: Sun May 21, 2017 9:40 am 
Offline
User avatar

Joined: Sat Aug 20, 2016 3:58 am
Posts: 34
HihiDanni wrote:
Exactly. CPU code execution cuts into potential DMA time, and DMA cuts into potential CPU time. The more time you spend in your NMI routine, the less time you have for computing everything else in the following frame. But if you're out of CPU time in one frame you're probably not going to be doing any DMA during the following NMI anyway.


I thought the DMA achieved transferring 6.14KB per frame cause it works during all the frame while the cpu works by its side.

During how much time the DMA needs to be active within a frame to transfer those 6.14KB, then?.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 56 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: lidnariq, UnDisbeliever and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group