Learning the 65816 cycle penalties

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
Post Reply
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Learning the 65816 cycle penalties

Post by tepples »

The cycle penalties for 65816 instructions are listed at "65816 Reference" on Super Famicom Development Wiki. For someone moving from 6502 to 65816, that's a fairly big table to memorize. I find it easier to memorize the rules for when a penalty applies.

[^1]: !P.m (16-bit accumulator)
Main ALU ops (adc, and, etc.)
[^2]: low(D)>0 (direct page not 256-byte aligned)
dp; dp,X; dp,Y; (dp); (dp),Y; [dp]; [dp],Y
[^3]: hibyte(a16)!=hibyte(EA)||!P.x (16-bit index or page crossed)
(dp),Y; [dp],Y; a16,Y; a16,X
[^4]: 2*!P.m (2 cycles for 16-bit accumulator)
Read-modify-write (RMWs) instructions (asl, dec, etc.)
[^5]: EA!=PC (branch taken)
Branches
[^6]: EA!=PC&&P.e&&hibyte(EA)!=hibyte(PC) (taken, cross page, and emulation mode)
Branches
[^7]: !P.e (native mode)
brk, cop, rti
[^8]: !P.x (16-bit index)
cpx, cpy, etc.

Other rules:
- sr,S and (sr,S),Y have same timing as dp and (dp),Y with unaligned DP and X=0
- As on 6502, indexed writes (STA, STX, STY, STZ) and RMWs always incur the penalty for 16-bit index or page crossed. This is a penalty compared to LDA/LDX/LDY but not listed because is incorporated into the instruction's basic count.
- REP and SEP are 3 cycles to let the MX bits propagate to the entire CPU. XBA is also 3 cycles for some reason.

Now whether these penalties are considered internal operation cycles (and thus fast even if in WRAM or slow ROM) or read cycles (and thus slow from WRAM or slow ROM) is another question. In what manual could I find that?
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: Learning the 65816 cycle penalties

Post by Oziphantom »

XBA

A - > temp
B -> A
temp -> B

thus 3 cycles

the SNES only hits the clock penalty if the address bus is in those ranges. So you would need to look up the addressing mode tables in the 65816 and work out if the CPU is doing a dummy read from the target address range. See table 5-7 from the data sheet.
For example, for the abs,index the address bus is

Code: Select all

PBR,PC
PBR,PC+1
PBR,PC2
DBR, AAH, AAL+XL ; this only happens on a 16 bit, which does a dummy read at the target without the high offset
DBR,A+X
DBR,A+X+1
so if you say read $0001,x and X is $4001 and you are in a LOROM FAST ROM config

Code: Select all

PBR,PC           ; 3.58
PBR,PC+1         ; 3.58
PBR,PC2          ; 3.58
DBR, AAH, AAL+XL ; $7e0001 2.58
DBR,A+X          ; $004001 1.78
DBR,A+X+1        ; $004002 1.78
stan423321
Posts: 53
Joined: Wed Sep 09, 2020 3:08 am

Re: Learning the 65816 cycle penalties

Post by stan423321 »

Can you explain this once again, in simpler terms? Say we run across XBA while in "8-mini-cycles" ROM area or while in "6-mini-cycles" ROM area, how much time is it going to take and why?
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: Learning the 65816 cycle penalties

Post by 93143 »

tepples wrote: Mon Mar 15, 2021 7:12 amIn what manual could I find that?
https://github.com/larsbrinkhoff/awesom ... 65c816.txt

stan423321 wrote: Mon Mar 15, 2021 11:47 amSay we run across XBA while in "8-mini-cycles" ROM area or while in "6-mini-cycles" ROM area, how much time is it going to take and why?
When encountering an XBA, the CPU takes one cycle to load the opcode. That will be either 8 or 6 master clocks depending on where the ROM is and what the $420D setting is. It then takes two more internal cycles to do its job, because you can't directly transfer data from one register to the other while also transferring from the other to the one; you need at least one intermediate step. Both of those internal operation cycles should be 6 master clocks, according to both anomie's docs and nocash's docs.

Code: Select all

 *6b Implied -- i
   (XBA)
   (1 Op Code)
   (1 byte)
   (3 cycles)

		1	1  1   1  1	PBR,PC		Op Code		1
		2	1  1   0  0	PBR,PC+1	IO		1
		3	1  1   0  0	PBR,PC+1	IO		1
Technically the CPU has a single-stage pipeline, but unlike the Super FX it's pretty transparent to the programmer, so don't worry about it.
Last edited by 93143 on Mon Mar 15, 2021 1:20 pm, edited 1 time in total.
nocash
Posts: 1405
Joined: Fri Feb 24, 2012 12:09 pm
Contact:

Re: Learning the 65816 cycle penalties

Post by nocash »

The 21MHz clock is usually called master clock, mini clock sounds a bit unfamilar ; )

This document lists timings in detail https://github.com/larsbrinkhoff/awesom ... 65c816.txt
Oziphantom wrote: Mon Mar 15, 2021 9:48 am DBR, AAH, AAL+XL ; $7e0001 2.58
No, that step should be an Internal Operation "IO" thus 3.58MHz.
It might output something on the address bus, but doesn't actually transfer anything on the data bus.

As far as I remember, the 65816 doesn't do dummy memory transfers (unlike 6502), and does instead use those internal operation cycles.
So the general rule is "total number of clocks - number of bytes = number of internal clocks".

For example NOP has 2 cycles, and transfers only one byte (the opcode byte, without further parameter/address/data bytes), so the second cycle is an internal one.
homepage - patreon - you can think of a bit as a bottle that is either half full or half empty
creaothceann
Posts: 611
Joined: Mon Jan 23, 2006 7:47 am
Location: Germany
Contact:

Re: Learning the 65816 cycle penalties

Post by creaothceann »

Table 5-7 in the 65c816 data sheet has the "VDA" (valid data address) and "VPA" (valid program address) columns, corresponding to the pins shown on page 12. Every Internal Operation cycle has zeroes in both of these columns. The 5A22 then knows to add no extra waitstate cycles to the 6 minimum master clock cycles per 1 CPU core cycle, even if the core's address bus points to a "slow" or "xslow" region.

nocash wrote: Mon Mar 15, 2021 1:16 pm The 21MHz clock is usually called master clock, mini clock sounds a bit unfamilar ; )
To add more names to the mix: it seems that on most SNES versions, the main clock crystal is labeled "X1" and the audio crystal is labeled "X2".
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Learning the 65816 cycle penalties

Post by tepples »

creaothceann wrote: Mon Mar 15, 2021 2:59 pm Table 5-7 in the 65c816 data sheet has the "VDA" (valid data address) and "VPA" (valid program address) columns, corresponding to the pins shown on page 12. Every Internal Operation cycle has zeroes in both of these columns. The 5A22 then knows to add no extra waitstate cycles to the 6 minimum master clock cycles per 1 CPU core cycle, even if the core's address bus points to a "slow" or "xslow" region.
In particular, decoding is complete enough and fast enough that the second cycle of an implied, accumulator, push/pull instruction, or RTS/RTL/RTI is already VDA=VPA=0, therefore no /RD. BRK and COP have VPA=1 for the signature byte.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: Learning the 65816 cycle penalties

Post by Oziphantom »

nocash wrote: Mon Mar 15, 2021 1:16 pm The 21MHz clock is usually called master clock, mini clock sounds a bit unfamilar ; )

This document lists timings in detail https://github.com/larsbrinkhoff/awesom ... 65c816.txt
Oziphantom wrote: Mon Mar 15, 2021 9:48 am DBR, AAH, AAL+XL ; $7e0001 2.58
No, that step should be an Internal Operation "IO" thus 3.58MHz.
It might output something on the address bus, but doesn't actually transfer anything on the data bus.

As far as I remember, the 65816 doesn't do dummy memory transfers (unlike 6502), and does instead use those internal operation cycles.
So the general rule is "total number of clocks - number of bytes = number of internal clocks".

For example NOP has 2 cycles, and transfers only one byte (the opcode byte, without further parameter/address/data bytes), so the second cycle is an internal one.
So the SNES is smart enough to work out and gate the IO operations?
creaothceann
Posts: 611
Joined: Mon Jan 23, 2006 7:47 am
Location: Germany
Contact:

Re: Learning the 65816 cycle penalties

Post by creaothceann »

It just takes a few transistors.

In fact, the entire address map decoding logic is relatively simple, all you need to do is combine some bits.

Code: Select all

//  banks  |  addresses  | speed | mapping                              banks  |  addresses  | speed | mapping
// --------+-------------+-------+-----------------------------        --------+-------------+-------+-----------------------------
// $00-$3F | $0000-$1FFF |  slow | address bus A + /WRAM mirror        $80-$BF | $0000-$1FFF |  slow | address bus A + /WRAM mirror
//         | $2000-$20FF |  fast | address bus A                               | $2000-$20FF |  fast | address bus A
//         | $2100-$21FF |  fast | address bus B                               | $2100-$21FF |  fast | address bus B
//         | $2200-$3FFF |  fast | address bus A                               | $2200-$3FFF |  fast | address bus A
//         | $4000-$41FF | xslow | internal CPU registers                      | $4000-$41FF | xslow | internal CPU registers
//         | $4200-$43FF |  fast | internal CPU registers                      | $4200-$43FF |  fast | internal CPU registers
//         | $4400-$5FFF |  fast | address bus A                               | $4400-$5FFF |  fast | address bus A
//         | $6000-$7FFF |  slow | address bus A                               | $6000-$7FFF |  slow | address bus A
//         | $8000-$FFFF |  slow | address bus A + /CART                       | $8000-$FFFF | note2 | address bus A + /CART
// --------+-------------+-------+-----------------------------        --------+-------------+-------+-----------------------------
// $40-$7D | $0000-$FFFF |  slow | address bus A + /CART               $C0-$FD | $0000-$FFFF | note2 | address bus A + /CART
// $7E-$7F | $0000-$FFFF |  slow | address bus A + /WRAM               $FE-$FF | $0000-$FFFF | note2 | address bus A + /CART
// --------+-------------+-------+-----------------------------        --------+-------------+-------+-----------------------------

// 65c816 VDA ( 1 bit ): is valid data     address
// 65c816 VPA ( 1 bit ): is valid programm address
// 65c816 RWB ( 1 bit ): is read operation
// 65c816 MAR (24 bits): Memory Address Register (address bus value)

// as per jwdonal's schematics:
// /CPURD and /CPUWR = A address bus indicator
// /PARD  and /PAWR  = B address bus indicator

is_valid_address := VDA or VPA;
is_reading       := is_valid_address and (    RWB);
is_writing       := is_valid_address and (not RWB);

is_lower_bank        :=                        ((MAR.BankByte AND %11000000) = %00000000);  // banks 00..3F
is_WRAM_bank         :=                        ((MAR.BankByte AND %11111110) = %01111110);  // banks 7E..7F
is_CART_bank         := (not is_WRAM_bank) and ((MAR.BankByte AND %01000000) = %01000000);  // banks 40..7F and C0..FF, except for 7E..7F
is_WRAM_mirror       :=      is_lower_bank and ((MAR.HighByte AND %11100000) = %00000000);  // offsets 0000..1FFF
is_B_bus             :=      is_lower_bank and ((MAR.HighByte              ) = $21      );  // offsets 2100..21FF
is_internal_register :=      is_lower_bank and ((MAR.HighByte AND %11111100) = %01000000);  // offsets 4000..43FF
is_upper_offset      :=                        ((MAR.HighByte AND %10000000) = %10000000);  // offsets 8000..FFFF
is_A_bus             := not (is_internal_register or is_B_bus);

/CPURD := not (is_A_Bus and is_reading);
/CPUWR := not (is_A_Bus and is_writing);
/PARD  := not (is_B_Bus and is_reading);
/PAWR  := not (is_B_Bus and is_writing);

/CART  := not (is_upper_offset or is_CART_bank);
/WRAM  := not (is_WRAM_mirror  or is_WRAM_bank);
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
Post Reply