It is currently Tue Dec 12, 2017 6:39 pm

All times are UTC - 7 hours



Forum rules


Related:



Post new topic Reply to topic  [ 31 posts ]  Go to page 1, 2, 3  Next
Author Message
 Post subject: SNES Timing Questions
PostPosted: Fri Jul 15, 2016 9:53 pm 
Offline

Joined: Fri Jul 15, 2016 9:47 pm
Posts: 13
Hey all, I posted this on Reddit and they forwarded me here (after giving me some helpful information). Here's the question:

I've been working on a SNES emulator for fun, and have found some decent information, except for actual timing information. I'm probably just over thinking this, or maybe I just need more information, but here's my conundrum. There's this page with some timing information: http://wiki.superfamicom.org/snes/show/Timing

It talks about the SNES master clock running about 21.477 MHz, and that internal IO instructions take 6 cycles, and then different memory accesses can be between 6, 8 or 12 cycles and 1364 cycles per scanline (most of the time). This is cool, I just need to figure out what instructions take what timings and I'll have an idea of how many instructions to process per frame.
Then, I get to the instruction timings: http://wiki.superfamicom.org/snes/show/65816+Reference

This references CPU cycles, and none are even more than 8 cycles (many are between 1 and 6), and CPU cycles are not the same as master clock cycles. I found that the CPU can run at 2.68MHz most of the time, but can also run at 3.58 MHz or 1.79 MHz.

I'm wondering, does anyone have any good information on this stuff? Most of what I find appears to be from the exact same source, and has these two different ways of talking about timing, which doesn't make sense to me. Can someone help me make sense of this, or point me to a source that can give me a good idea about these things? Thanks ahead of time!


Top
 Profile  
 
PostPosted: Fri Jul 15, 2016 10:09 pm 
Online

Joined: Sun Apr 13, 2008 11:12 am
Posts: 6511
Location: Seattle
hatfarm wrote:
This references CPU cycles, and none are even more than 8 cycles (many are between 1 and 6), and CPU cycles are not the same as master clock cycles. I found that the CPU can run at 2.68MHz most of the time, but can also run at 3.58 MHz or 1.79 MHz.
Right. Every CPU instruction takes some number of CPU cycles; each CPU cycle in turn takes 6, 8, or 12 master clock cycles depending on which memory it's accessing.

"Internal" cycles on the CPU, and reads from or writes to "Fast" memory regions, specifically the upper 3/8th of memory when enabled (banks $80-$BF pages $80-$FF and banks $C0-$FF all pages) and most registers (banks $00-$3F and $80-$BF, pages $20-$3F and $42-$5F), take place in 6 master clock cycles.

Cycles reading or writing to "normal" memory regions (banks $00-$3F, pages $00-$1F and $60-$FF; banks $40-$7F all pages; and that same upper 3/8th of address space when fast memory is not enabled) take 8 master clock cycles.

Finally, cycles reading or writing to "slow" memory, which is only banks $00-$3F and $80-$BF, pages $40 and $41, take 12 master clock cycles.


Top
 Profile  
 
PostPosted: Fri Jul 15, 2016 10:36 pm 
Offline

Joined: Fri Jul 15, 2016 9:47 pm
Posts: 13
Okay, so if I'm understanding this correctly then, the instruction CPU cycles are multiplied by the master clock cycles to get the total number of cycles an instruction takes? Is that right?

Thanks for the link, I hadn't found that site, I'll definitely be giving it a thorough read.


Top
 Profile  
 
PostPosted: Fri Jul 15, 2016 10:47 pm 
Online

Joined: Sun Apr 13, 2008 11:12 am
Posts: 6511
Location: Seattle
No, it's far more involved than just multiplied.

For example, a NOP instruction takes two CPU cycles: one to fetch the byte that is the instruction, and one internal. That's either going to take 8+6=14 master clock cycles, or 6+6=12 master clock cycles, depending on where it's executing from.


Top
 Profile  
 
PostPosted: Sat Jul 16, 2016 12:02 am 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 818
Also note that the CPU reference on superfamicom.org, while useful for programming, is grossly oversimplified. For one thing, it only gives the minimum number of CPU cycles for each instruction, ignoring cycle-adding cases like 16-bit mode, non-page-aligned DP, and so on. Also the explanations sometimes leave a lot to be desired. This and this have more detail. So does fullsnes, apparently, but it's a bit cryptic...

EDIT: I'm sorry; the superfamicom.org page does in fact mention the cycle-adding cases at the bottom. Just make sure you look that far...

Also look up DRAM refresh. The CPU actually only gets 1324 master cycles per scanline, because the memory controller stalls it for 40 master clocks right about the middle of each scanline to refresh the contents of WRAM. Exact timing on this is kinda squirrelly and varies between models.

Also, DMA has some funky timing associated with it, but the Timing page on superfamicom.org seems to have the appropriate information.


Last edited by 93143 on Sat Jul 16, 2016 1:55 am, edited 3 times in total.

Top
 Profile  
 
PostPosted: Sat Jul 16, 2016 12:18 am 
Offline
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3153
Location: Nacogdoches, Texas
93143 wrote:
non-page-aligned DP

What is "page-aligned"? I know I've heard you say that before, but I'm curious now. I thought using a16 bit accumulator instead of 8 was the only thing that could add a cycle. Actually, I may be hallucinating, but does another cycle get added if you're using 16 bit x and y, even if you're not moving data in or out of them (so not stx, ldy, etc.)? In other words, is "lda $00,x" going to take one more cycle with a 16 bit x?

I don't need to worry about per cycle stuff yet, but I have a bad feeling I will as stuff gets tight.


Top
 Profile  
 
PostPosted: Sat Jul 16, 2016 12:33 am 
Offline

Joined: Sun Mar 27, 2016 7:56 pm
Posts: 138
You can set the direct page to start anywhere from $00:0000 to $00:ffff via the 16-bit DP register. If you set it to something that isn't a multiple of $0100, however, it will add an extra cycle to direct page addressing instructions.

There are actually a number of different things that can add extra cycles to instructions. This is probably the best resource I've found at the moment; it's better than the one on the SFC wiki at least, though, uh... it should be noted that I did spot a typo for either bytes or cycles in this table at one point, and then forgot where it was. Good luck?


Top
 Profile  
 
PostPosted: Sat Jul 16, 2016 12:42 am 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 818
Espozo wrote:
What is "page-aligned"?

Zero bottom byte. A "page" is 256 bytes, just like a bank is 65,536 bytes. If the bottom byte of DP is zero, the CPU can just take the operand of a direct-page instruction (an 8-bit address) and stick the top byte of DP on top to generate the absolute address. But if the bottom byte of DP is not zero, it has to actually add the 8-bit direct page address to the full 16-bit DP to generate the absolute address, and that takes longer.

Quote:
is "lda $00,x" going to take one more cycle with a 16 bit x?

No, but "lda $0000,x" does. Direct page instructions take an extra cycle for indexing regardless of the size of X/Y.


Top
 Profile  
 
PostPosted: Sat Jul 16, 2016 12:49 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
Espozo wrote:
93143 wrote:
non-page-aligned DP

What is "page-aligned"? I know I've heard you say that before, but I'm curious now. I thought using a16 bit accumulator instead of 8 was the only thing that could add a cycle. Actually, I may be hallucinating, but does another cycle get added if you're using 16 bit x and y, even if you're not moving data in or out of them (so not stx, ldy, etc.)? In other words, is "lda $00,x" going to take one more cycle with a 16 bit x?

"Pages" on 65xx CPUs are 256 bytes, i.e. $0000-00FF is page 0 (hence the term "zero page" in 6502/65c02), page 1 is $0100-01FF, etc..

The "common" cycle additions on 65816, for some operations (it varies per addressing mode):

1. Add 1 cycle if 16-bit accum (or 16-bit X/Y, for opcodes like ldx and ldy)
2. Add 1 cycle if low byte of D (direct page register) is a value other than $00
3. Add 1 cycle if when using indexed addressing (ex. lda $12FF,x), accessing data crosses a page boundary

#1: Should be obvious.

#2: Already covered by Nicole and 93143.

#3: Consider what happens if you do (assume 16-bit accumulator) ldx #1 ; lda $12FF,x. The accumulator will get loaded with data from address $1300 and address $1301. This costs an extra cycle because the effective address has to wrap a page ($12-->$13) when doing the calculation ($12FF + 1).

Branch instructions also have cycle penalties for page crossing, as well as whether or not the branch is taken (branches taken cost an extra cycle). Other opcodes and addressing modes have similar cycle penalties.

And yes, the cycle penalties can "stack" (meaning you can have two of them applying at the same time to cause, say, a 2-cycle penalty).

Please refer to the Programming the 65816 (including the 6502, 65C02, and 65802) by Western Design Center book. For the 2015/03/17 (54MByte) version, refer to Chapter 18 and pay attention to the subscript items / footnotes at the each of each opcode.

Welcome to how/why counting cycles for program efficiency/timing is difficult.


Top
 Profile  
 
PostPosted: Sat Jul 16, 2016 2:37 am 
Offline

Joined: Fri Jul 15, 2016 9:47 pm
Posts: 13
koitsu wrote:
Welcome to how/why counting cycles for program efficiency/timing is difficult.

Is there an easier way? I'm not 100% concerned about pixel perfect reproduction, but would like to be pretty close.

Thanks everyone for the help!


Top
 Profile  
 
PostPosted: Sat Jul 16, 2016 10:13 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
hatfarm wrote:
koitsu wrote:
Welcome to how/why counting cycles for program efficiency/timing is difficult.

Is there an easier way? I'm not 100% concerned about pixel perfect reproduction, but would like to be pretty close.

For programmers: no, there is not an easier way. You literally sit down and start counting cycles manually. Here's an example (but for a routine someone wrote for the NES/6502).

For CPU emulation, "instruction timing" is not that difficult, because the cycle counts and the "adjustments" for certain criteria (see my previous post) are documented. I even refer to the WDC document you can use in said previous post.

I can't really help you with timing/frequencies involving separate clocks, but lidnariq already covered that.

As for the different operational speeds (specifically 1.79MHz vs. 2.68MHz vs. 3.58MHz): these are for NTSC (PAL is different). 1.79MHz is speed when accessing things like controller ports (for buttons or peripheral I/O; specifically MMIO regs $4000-41FF in banks $00-3F). 2.68MHz (a.k.a. "SlowROM") is the normal operating speed for most things (see lidnariq's post), and 3.58MHz (a.k.a. "FastROM") is what can be used for certain banks/memory regions (and can be toggled in real-time via MMIO register $420D bit 0). All these frequencies are divisions of the master clock speed (crystal) of 21.47727MHz.

My advice is that if you want to do a SNES emulator, start on a 65816 emulation core (unless there's already one out there you can use -- no idea). You're not going to get "pretty graphics and sound" up and running, but you'd at least start to get some actual games *running* (though they'll likely get stuck in infinite loops waiting on SNES MMIO registers to return certain values -- that's normal. Take baby steps!).


Top
 Profile  
 
PostPosted: Sat Jul 16, 2016 1:57 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1339
Here is how you compute the speed (6,8,12 clocks) of any memory address on the SNES:

Code:
unsigned CPU::speed(unsigned addr) const {
  if(addr & 0x408000) return addr & 0x800000 ? romSpeed : 8;
  if(addr + 0x6000 & 0x4000) return 8;
  if(addr - 0x4000 & 0x7e00) return 6;
  return 12;
}


Where romSpeed is 6 when $420d.d0=1, and 8 when $420d.d0=0.

I know this routine is cryptic (it took me a long time to come up with this), but it's the smallest and fastest possible implementation of the logic. Many smart people have tried to best me on this routine with lookup tables and other such tricks, but nothing ends up faster or simpler than the above.

If you want to know the regions, you can reference the docs linked earlier.


Top
 Profile  
 
PostPosted: Mon Jul 18, 2016 9:20 pm 
Offline

Joined: Fri Jul 15, 2016 9:47 pm
Posts: 13
Thank you so much for that! It's way better than what I had.

I want to make sure I understand the timing.

Here's roughly what I have for the BRA instruction:
Code:
CPUCycleCount = FAST_CPU_CYCLE + (this.memory.getMemAccessCycleTime(this.pbr, this.pc) << 1);


The reason I'm thinking it is this way, is because we have a single mem access for the instruction fetch and another for the PC incrementer value. Then, the FAST_CPU_CYCLE (which is 6 master cycles) because of the internal addition and moving to the PC (getting us to the 3 CPU cycles that are supposed to be used by the instruction).

Is this the right thinking? Looking at the manual, sometimes a single instruction can grab a word vs a byte, but I'm not seeing that be reflected with instruction fetches and operand fetches.

Thanks again for all your help!


Top
 Profile  
 
PostPosted: Mon Jul 18, 2016 11:00 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
hatfarm wrote:
Is this the right thinking? Looking at the manual, sometimes a single instruction can grab a word vs a byte, but I'm not seeing that be reflected with instruction fetches and operand fetches.

Speaking strictly about branch opcodes (excluding the brl opcode):

1. Most branch instructions are 2 bytes in length: 1 for the opcode, 1 for the operand. The operand byte is essentially signed, thus branches can only go back 128 bytes or forward 127 bytes (from the operand byte itself). They're "PC relative", rather than absolute addresses.

2. Most branch instructions cost 3 CPU cycles unconditionally.

3. If emulation mode is enabled (CPU flag e=1), and the branch is taken (i.e. for conditional branches, the conditional proves true), and the effective address calculated crosses a page boundary, then there is an additional 1 cycle penalty.

Speaking generally about instructions and their lengths:

There are several instructions which "grab" more than a word (word = 16-bits) as part of their operand. Opcodes that use long addressing, for example, have operands that consist of 3 bytes (so the entire instruction is 4 bytes). An example would be opcode $af (ex. lda $123456), which uses absolute long addressing.


Top
 Profile  
 
PostPosted: Mon Jul 18, 2016 11:15 pm 
Offline

Joined: Fri Jul 15, 2016 9:47 pm
Posts: 13
Yeah, but the manual implies that sometimes it takes a single cycle to do that, and sometimes it takes multiple cycles (at least in the case of a word vs a byte). If it's a 4 byte instruction, how many cycles is that going to take?


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 31 posts ]  Go to page 1, 2, 3  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group